Metrics
Default Metrics
Using our default metrics
Our default metrics cover common evaluation scenarios and are optimized for immediate use.
Metric | Description | Scale | When to use | Requires |
---|---|---|---|---|
atla_default_conciseness | How concise and to-the-point the LLM’s response is. | Likert (1-5) | When you want to evaluate whether responses are brief and efficient. | model_input , model_output |
atla_default_correctness | How factually accurate the LLM’s response is. | Binary | When you want to check whether responses contain correct information. | model_input , model_output , expected_model_output |
atla_default_faithfulness | How faithful the LLM is to the provided context. | Likert (1-5) | When you want to check for hallucinations. | model_input , model_output , model_context |
atla_default_helpfulness | How effectively the LLM’s response addresses the user’s needs. | Likert (1-5) | When you want to assess practical value to users. | model_input , model_output |
atla_default_logical_coherence | How well-reasoned and internally consistent the response is. | Likert (1-5) | When you want to check whether responses follow logical reasoning. | model_input , model_output |
atla_default_relevance | How well the response addresses the specific query or context. | Likert (1-5) | When you want to ensure responses stay on topic. | model_input , model_output |
Scoring Scales
Our evaluator models produce scores that indicate how well an LLM interaction matches a specific metric.
The interpretation of our default metric scales is as follows:
Binary
0 | 1 |
---|---|
Failure or Incorrect | Success or Correct |
Likert (1-5)
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
Very Poor | Poor | Acceptable | Good | Excellent |
Understanding these scales helps you interpret evaluation results and compare performance across different prompts and models.