Overview
Dive deeper into Atla evals
Inputs
Base Input Variables
Every evaluation requires the following variables:
Input Variable | Description |
---|---|
model_id | The Atla evaluator model to use. See our models page for details. |
model_input | The user input to your GenAI app (e.g., a question, instruction, or chat dialogue). |
model_output | The response from your LLM. |
Evaluation Metrics vs Evaluation Criteria
After selecting an evaluator model and specifying the LLM interaction, you need to define what to evaluate.
Metrics or evaluation critieria that you create and refine for one model are optimized for that model. We advise against using models interchangeably on the same criteria without testing.
Use either an evaluation metric or evaluation criteria, not both.
Choose one of these approaches:
-
Evaluation metrics
Use a prompt that captures a specific metric (e.g.,
logical_coherence
) to evaluate the LLM interaction. You can use our default metrics or create your own custom metrics.Input Variable Description metric_name
The name of the metric to use. See our metrics page for details.
-
Evaluation criteria
For rapid experimentation or to use an existing evaluation prompt, provide evaluation criteria directly:
Input Variable Description evaluation_criteria
A prompt instruction defining how to evaluate the LLM interaction.
While using atla-selene-mini
as your evaluation model, we strongly advise that you use the recommended template for best results.
The
evaluation_criteria
template for Selene Mini has the following 3 components:
- Description of the evaluation
- List of scores and their corresponding criteria
- A sentence that specifies constraints on the score. This sentence should contain the string
Your score should be
followed by the corresponding criteria for the binary or the Likert type.
Additional Inputs
Depending on your evaluation, you may need to provide additional inputs.
RAG Contexts
For RAG evaluations, provide the context available to the model:
Input Variable | Description |
---|---|
model_context | The context provided to the LLM for grounding. |
Reference Answers
Using reference answers is recommended for evaluation when available:
Input Variable | Description |
---|---|
expected_model_output | A reference “ground truth” that meets the evaluation criteria. |
Few-Shot Examples
Providing few-shot examples is one of the best ways to align your evaluation, regardless of your use case:
Input Variable | Description |
---|---|
few_shot_examples | A list of examples with known evaluation scores. |
Evaluator Output
Each evaluation produces two outputs:
Output Variable | Description |
---|---|
score | A numerical score indicating how well the LLM interaction meets the criteria. |
critique | A brief explanation justifying the score. |
Atla models generate the critique before deciding on a score.