Inputs

Base Input Variables

Every evaluation requires the following variables:

Input VariableDescription
model_idThe Atla evaluator model to use. See our models page for details.
model_inputThe user input to your GenAI app (e.g., a question, instruction, or chat dialogue).
model_outputThe response from your LLM.

Evaluation Metrics vs Evaluation Criteria

After selecting an evaluator model and specifying the LLM interaction, you need to define what to evaluate. Choose one of these approaches:

  1. Evaluation metrics

Use a prompt that captures a specific metric (e.g., logical_coherence) to evaluate the LLM interaction. You can use our default metrics or create your own custom metrics:

Input VariableDescription
metric_nameThe name of the metric to use. See our metrics page for details.
  1. Evaluation criteria

For rapid experimentation or to use an existing evaluation prompt, provide evaluation criteria directly:

Input VariableDescription
evaluation_criteriaA prompt instruction defining how to evaluate the LLM interaction.

Use either an evaluation metric or evaluation criteria, not both.

Additional Inputs

Depending on your evaluation, you may need to provide additional inputs.

RAG Contexts

For RAG evaluations, provide the context available to the model:

Input VariableDescription
model_contextThe context provided to the LLM for grounding.

Reference Answers

Using reference answers is recommended for evaluation when available:

Input VariableDescription
expected_model_outputA reference “ground truth” that meets the evaluation criteria.

Few-Shot Examples

Providing few-shot examples is one of the best ways to align your evaluation, regardless of your use case:

Input VariableDescription
few_shot_examplesA list of examples with known evaluation scores.

Evaluator Output

Each evaluation produces two outputs:

Output VariableDescription
scoreA numerical score indicating how well the LLM interaction meets the criteria.
critiqueA brief explanation justifying the score.

See our API reference for complete details.

Atla models generate the critique before deciding on a score.