Evaluate your LLM application against reference answers
Atla provides specific metrics for evaluating against a reference (aka ground truth) that help in assessing the accuracy and completeness of the AI’s responses. These metrics ensure that the generated answers align well with the reference answers provided. Atla will score results on a Likert scale (1-5).
You can pass any or all of these ground truth metrics to the evaluate function of the Atla client, and it will perform independent evaluations over the input and generated output provided.
atla_ metrics additionally use a multi-step process to achieve even more reliable scores on long & complex samples. This method uses slightly more tokens.
Metric Name
Description
hallucination
Assesses the presence of incorrect or unrelated content in the AI’s response.
atla_precision
Assesses the relevance of all the information in the response.
atla_recall
Measures how completely the response captures the key facts and details.
When given a reference, hallucination assesses incorrect or unrelated content in the AI’s response compared to the reference response (’ground truth’ answer). The hallucination score reflects the proportion of claims in the AI answer that correspond to claims in the reference response over the total number of claims made in the AI’s answer. A higher score indicates that most claims found in the AI answer are also in the reference. A lower score may indicate factual inaccuracies or unfounded claims.
Hallucination=∣Total number of claims in the AI answer∣∣Number of claims in the AI answer that are also in the reference response∣
When given a reference, Atla precision evaluates the specificity and directness of the information provided in the response compared to the reference response (’ground truth’ answer). The precision score is based on the ratio of claims in the AI answer that are also in the reference response to the total number of claims made in the AI answer. A high score signifies that the content is specifically relevant to the question with minimal extraneous information. A low score indicates the presence of superfluous or less relevant details. This score focuses on the accuracy and relevance of the details within the answer, not the volume of information.
Precision=∣Total number of claims in the AI answer∣∣Number of claims in the AI answer that are also in the reference∣
When given a reference, Atla recall assesses how completely the response covers the requested information compared to the reference response (’ground truth’ answer). The recall score is determined by the proportion of claims in the AI answer that are also in the reference response over the total number of claims made in the reference response. A high score indicates a comprehensive response that includes all necessary details. A low score signifies an incomplete response lacking in specifics. This score reflects the completeness of the information given, not just its presence, with respect to the user’s question.
Recall=∣Total number of claims in the reference∣∣Number of claims in the AI answer that are also in the reference∣