Metrics
Default Metrics
Using our default metrics
Our default metrics cover common evaluation scenarios and are optimized for immediate use.
Metric | Description | Scale | When to use | Requires |
---|---|---|---|---|
atla_default_conciseness | How concise and to-the-point the LLM’s response is. | Likert (1-5) | When you want to evaluate whether responses are brief and efficient. | model_input , model_output |
atla_default_correctness | How factually accurate the LLM’s response is. | Binary | When you want to check whether responses contain correct information. | model_input , model_output ,expected_model_output |
atla_default_faithfulness | How faithful the LLM is to the provided context. | Likert (1-5) | When you want to check for hallucinations. | model_input , model_output, model_context |
atla_default_helpfulness | How effectively the LLM’s response addresses the user’s needs. | Likert (1-5) | When you want to assess practical value to users. | model_input , model_output |
atla_default_logical_coherence | How well-reasoned and internally consistent the response is. | Likert (1-5) | When you want to check whether responses follow logical reasoning. | model_input , model_output |
atla_default_relevance | How well the response addresses the specific query or context. | Likert (1-5) | When you want to ensure responses stay on topic. | model_input , model_output |
Scoring Scales
Our evaluator models produce scores that indicate how well an LLM interaction matches a specific metric.
The interpretation of our default metric scales is as follows:
Binary
0 | 1 |
---|---|
Failure or Incorrect | Success or Correct |
Likert (1-5)
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
Very Poor | Poor | Acceptable | Good | Excellent |
Understanding these scales helps you interpret evaluation results and compare performance across different prompts and models.
When creating custom metrics, you can use any scoring scale, though we recommend using one of the scales above.