Evaluations against a “ground truth” reference are crucial for determining how well an AI’s responses align with established correct answers. By comparing the generated responses to a known reference, we can measure the accuracy and relevance of the AI’s output, ensuring that it meets the expected standards.

This method of evaluation helps identify discrepancies, such as hallucinations or omissions, and provides insights into areas where the AI’s performance can be improved. Using Atla to perform these evaluations involves passing a reference response in the messages block, enabling a detailed assessment of the AI’s alignment with the ground truth.

Atla has been trained on specific reference metrics ensuring the best performance.

Running evals against a ‘ground truth’ answer

If you have access to a ‘ground truth’ answer, Atla can use this reference response to evaluate answers against it. To achieve this, pass a reference via the reference parameter.

from atla import Atla

client = Atla()

messages = {
  "question": "Is NP equal to NP-Hard?",
  "reference": "NP-Hard problems are at least as hard as the hardest NP problems. NP-Hard problems don't have to be in NP, and their relationship is an open question.",
  "response": "Richard Karp expanded on the Cook-Levin Theorem to show that all NP problems can be reduced to NP-Hard problems in polynomial time."
}

score = client.evaluation.create(
  input=messages["question"],
  response=messages["response"],
  reference=messages["reference"],
  metrics=["hallucination"],
)

print(f"Atla's score: {score.evaluations['hallucination'].score} / 5")
print(f"Atla's critique: {score.evaluations['hallucination'].critique}")