Using Metrics

Custom evaluation metrics are used to enumerate the dimensions of quality you want to evaluate against.

They are defined by evaluation prompts that can be iteratively refined to get the best possible results.

Building Custom Metrics

1

Initialize Atla

Initialize Atla by creating an instance of the Atla class.

from atla import Atla

client = Atla()
2

Create a metric

Create a metric by specifying the metric name, type and (optionally) a description.

clarity_metric_id = client.metrics.create(
    name="clarity",
    metric_type="likert_1_to_5",
    description="How clear and easy to understand a response is.",  # optional
).metric_id
3

Create a prompt

Create an evaluation prompt that describes how to evaluate the metric, and set it as the active version.

prompt = "Rate the clarity of the response."
client.metrics.prompts.create(
    metric_id=clarity_metric_id,
    content=prompt,
)
client.metrics.prompts.set_active_prompt_version(
    metric_id=clarity_metric_id,
    version=1,
)

A metric will use the “active” version of its evaluation prompt unless otherwise specified. This must be set before running an evaluation.

4

(Optional) Add few-shot examples

Add few-shot examples to help the model understand how to evaluate the metric.

from atla.types.metrics.few_shot_example import FewShotExample

example_1 = FewShotExample(
    model_input="Why does my car make a strange noise?",
    model_output="The noise is likely due to a loose part.",
    score=5,
    critique="The reponse is clear and concise."
)
example_2 = FewShotExample(
    model_input="Why does my car make a strange noise?",
    model_output="Air-fuel mixture issues often appear in carburetors.",
    score=1,
    critique="The reponse is unclear and uses jargon."
)

response = client.metrics.few_shot_examples.set(
    metric_id=clarity_metric_id,
    few_shot_examples=[example_1, example_2]
)
5

Run an evaluation

Run an evaluation using the metric.

evaluation = client.evaluation.create(
    model_id="atla-selene",
    model_input="Why does my car make a strange noise?",
    model_output="Something might be loose, broken, or worn out",
    metric_name="clarity",
)
6

(Optional) Iteratively refine the metric

Iteratively refine the metric’s evaluation prompt. New prompts are automatically versioned.

new_prompt = "Rate the response's clarity. A good response uses simple language."

client.metrics.prompts.create(
    metric_id=clarity_metric_id,
    content=new_prompt,
)

client.metrics.prompts.set_active_prompt_version(
    metric_id=clarity_metric_id,
    version=2,
)

new_evaluation = client.evaluation.create(
    model_id="atla-selene",
    model_input="Why does my car make a strange noise?",
    model_output="Something might be loose, broken, or worn out",
    metric_name="clarity",
)

You can still use the old version of the prompt by specifying the version number:

new_evaluation = client.evaluation.create(
    model_id="atla-selene",
    model_input="Why does my car make a strange noise?",
    model_output="Something might be loose, broken, or worn out",
    metric_name="clarity",
    prompt_version=1,
)

Custom metrics can also be created without any coding via the Eval Copilot (beta).

Learn more about using custom metrics in our API Reference.