Evaluation Components
Understanding Atla’s evaluation framework
Inputs
Base Input Variables
Every evaluation requires the following variables:
Input Variable | Description |
---|---|
model_id | The Atla evaluator model to use. See our models page for details. |
model_input | The user input to your GenAI app (e.g., a question, instruction, or chat dialogue). |
model_output | The response from your LLM. |
Evaluation Metrics vs Evaluation Criteria
After selecting an evaluator model and specifying the LLM interaction, you need to define what to evaluate. Choose one of these approaches:
- Evaluation metrics
Use a prompt that captures a specific metric (e.g., logical_coherence
) to evaluate the LLM interaction.
You can use our default metrics or create your own custom metrics:
Input Variable | Description |
---|---|
metric_name | The name of the metric to use. See our metrics page for details. |
- Evaluation criteria
For rapid experimentation or to use an existing evaluation prompt, provide evaluation criteria directly:
Input Variable | Description |
---|---|
evaluation_criteria | A prompt instruction defining how to evaluate the LLM interaction. |
Use either an evaluation metric or evaluation criteria, not both.
Additional Inputs
Depending on your evaluation, you may need to provide additional inputs.
RAG Contexts
For RAG evaluations, provide the context available to the model:
Input Variable | Description |
---|---|
model_context | The context provided to the LLM for grounding. |
Reference Answers
Using reference answers is recommended for evaluation when available:
Input Variable | Description |
---|---|
expected_model_output | A reference “ground truth” that meets the evaluation criteria. |
Few-Shot Examples
Providing few-shot examples is one of the best ways to align your evaluation, regardless of your use case:
Input Variable | Description |
---|---|
few_shot_examples | A list of examples with known evaluation scores. |
Evaluator Output
Each evaluation produces two outputs:
Output Variable | Description |
---|---|
score | A numerical score indicating how well the LLM interaction meets the criteria. |
critique | A brief explanation justifying the score. |
See our API reference for complete details.