Evaluation metrics are the core of Atla.

They define the different dimensions of quality you want to evaluate and are the primary way to customize your evaluations.

Understanding Metrics

“Good” and “bad” performance are fundamentally vague, highly context-dependent concepts. At their core, Atla evaluation metrics are a way to define what constitutes “good” and “bad” performance for your use case.

Each metric defines an aspect of performance that you want to evaluate against. For example, a copywriting AI model might want to get evaluated against aspects such as clarity, engagement, etc., whereas a medical AI model might want to get evaluated against aspects such as clinical_relevance, legal_compliance, etc.

To run an evaluation with a metric, you can specify the metric_name parameter as follows:

from atla import Atla
client = Atla()

evaluation = client.evaluation.create(
    model_id="atla-selene",
    model_input="What is the capital of France?",
    model_output="Paris",
    expected_model_output="The capital of France is Paris.",
    metric_name="atla_default_correctness",
)

Atla provides a set of pre-configured default metrics out of the box, which are enumerated here. These can be plugged into your evaluations out of the box and are optimized for immediate use.

Once you are familiar with the basics, you can create custom metrics and tailor them to your specific evaluation needs.

For cases where you need one-off evaluation criteria, you can bypass metric creation entirely by passing an evaluation prompt directly via the evaluation_criteria parameter.

from atla import Atla
client = Atla()

evaluation = client.evaluation.create(
    model_id="atla-selene",
    model_input="What is the capital of France?",
    model_output="Paris",
    expected_model_output="The capital of France is Paris.",
    evaluation_criteria="Rate whether the response is correct. Give a score of 1 if it is correct, or 0 if it is incorrect.",
)