Our default metrics cover common evaluation scenarios and are optimized for immediate use.

MetricDescriptionScaleWhen to useRequires
atla_default_concisenessHow concise and to-the-point the LLM’s response is.Likert (1-5)When you want to evaluate whether responses are brief and efficient.model_input, model_output
atla_default_correctnessHow factually accurate the LLM’s response is.BinaryWhen you want to check whether responses contain correct information.model_input, model_output,expected_model_output
atla_default_faithfulnessHow faithful the LLM is to the provided context.Likert (1-5)When you want to check for hallucinations.model_input, model_output, model_context
atla_default_helpfulnessHow effectively the LLM’s response addresses the user’s needs.Likert (1-5)When you want to assess practical value to users.model_input, model_output
atla_default_logical_coherenceHow well-reasoned and internally consistent the response is.Likert (1-5)When you want to check whether responses follow logical reasoning.model_input, model_output
atla_default_relevanceHow well the response addresses the specific query or context.Likert (1-5)When you want to ensure responses stay on topic.model_input, model_output

Scoring Scales

Our evaluator models produce scores that indicate how well an LLM interaction matches a specific metric.

The interpretation of our default metric scales is as follows:

Binary

01
Failure or IncorrectSuccess or Correct

Likert (1-5)

12345
Very PoorPoorAcceptableGoodExcellent

Understanding these scales helps you interpret evaluation results and compare performance across different prompts and models.

When creating custom metrics, you can use any scoring scale, though we recommend using one of the scales above.