Dive deeper into Atla evals
Every evaluation requires the following variables:
Input Variable | Description |
---|---|
model_id | The Atla evaluator model to use. See our models page for details. |
model_input | The user input to your GenAI app (e.g., a question, instruction, or chat dialogue). |
model_output | The response from your LLM. |
After selecting an evaluator model and specifying the LLM interaction, you need to define what to evaluate.
Metrics or evaluation critieria that you create and refine for one model are optimized for that model. We advise against using models interchangeably on the same criteria without testing.
Use either an evaluation metric or evaluation criteria, not both.
Choose one of these approaches:
Evaluation metrics
Use a prompt that captures a specific metric (e.g., logical_coherence
) to evaluate the LLM interaction. You can use our default metrics or create your own custom metrics.
Input Variable | Description |
---|---|
metric_name | The name of the metric to use. See our metrics page for details. |
Example with evaluation metrics
Evaluation criteria
For rapid experimentation or to use an existing evaluation prompt, provide evaluation criteria directly:
Input Variable | Description |
---|---|
evaluation_criteria | A prompt instruction defining how to evaluate the LLM interaction. |
Example with evaluation criteria
While using atla-selene-mini
as your evaluation model, we strongly advise that you use the recommended template for best results.
The
evaluation_criteria
template for Selene Mini has the following 3 components:
- Description of the evaluation
- List of scores and their corresponding criteria
- A sentence that specifies constraints on the score. This sentence should contain the string
Your score should be
followed by the corresponding criteria for the binary or the Likert type.
Example evaluation_criteria for binary metrics
Example evaluation_criteria template for Likert metrics
Depending on your evaluation, you may need to provide additional inputs.
For RAG evaluations, provide the context available to the model:
Input Variable | Description |
---|---|
model_context | The context provided to the LLM for grounding. |
Example with model context
Using reference answers is recommended for evaluation when available:
Input Variable | Description |
---|---|
expected_model_output | A reference “ground truth” that meets the evaluation criteria. |
Example with expected model output
Providing few-shot examples is one of the best ways to align your evaluation, regardless of your use case:
Input Variable | Description |
---|---|
few_shot_examples | A list of examples with known evaluation scores. |
Example with few-shot examples
Each evaluation produces two outputs:
Output Variable | Description |
---|---|
score | A numerical score indicating how well the LLM interaction meets the criteria. |
critique | A brief explanation justifying the score. |
Atla models generate the critique before deciding on a score.
Dive deeper into Atla evals
Every evaluation requires the following variables:
Input Variable | Description |
---|---|
model_id | The Atla evaluator model to use. See our models page for details. |
model_input | The user input to your GenAI app (e.g., a question, instruction, or chat dialogue). |
model_output | The response from your LLM. |
After selecting an evaluator model and specifying the LLM interaction, you need to define what to evaluate.
Metrics or evaluation critieria that you create and refine for one model are optimized for that model. We advise against using models interchangeably on the same criteria without testing.
Use either an evaluation metric or evaluation criteria, not both.
Choose one of these approaches:
Evaluation metrics
Use a prompt that captures a specific metric (e.g., logical_coherence
) to evaluate the LLM interaction. You can use our default metrics or create your own custom metrics.
Input Variable | Description |
---|---|
metric_name | The name of the metric to use. See our metrics page for details. |
Example with evaluation metrics
Evaluation criteria
For rapid experimentation or to use an existing evaluation prompt, provide evaluation criteria directly:
Input Variable | Description |
---|---|
evaluation_criteria | A prompt instruction defining how to evaluate the LLM interaction. |
Example with evaluation criteria
While using atla-selene-mini
as your evaluation model, we strongly advise that you use the recommended template for best results.
The
evaluation_criteria
template for Selene Mini has the following 3 components:
- Description of the evaluation
- List of scores and their corresponding criteria
- A sentence that specifies constraints on the score. This sentence should contain the string
Your score should be
followed by the corresponding criteria for the binary or the Likert type.
Example evaluation_criteria for binary metrics
Example evaluation_criteria template for Likert metrics
Depending on your evaluation, you may need to provide additional inputs.
For RAG evaluations, provide the context available to the model:
Input Variable | Description |
---|---|
model_context | The context provided to the LLM for grounding. |
Example with model context
Using reference answers is recommended for evaluation when available:
Input Variable | Description |
---|---|
expected_model_output | A reference “ground truth” that meets the evaluation criteria. |
Example with expected model output
Providing few-shot examples is one of the best ways to align your evaluation, regardless of your use case:
Input Variable | Description |
---|---|
few_shot_examples | A list of examples with known evaluation scores. |
Example with few-shot examples
Each evaluation produces two outputs:
Output Variable | Description |
---|---|
score | A numerical score indicating how well the LLM interaction meets the criteria. |
critique | A brief explanation justifying the score. |
Atla models generate the critique before deciding on a score.