Ground truth evaluations help determine how well AI responses match known correct answers. By comparing generated responses against reference answers, you can measure accuracy and ensure outputs meet expected standards.

Using Reference Answers

When you have a reference answer, Atla can evaluate LLM outputs against it using the expected_model_output parameter:

from atla import Atla

client = Atla()

evaluation = client.evaluation.create(
  model_id="atla-selene",
  model_input="What is compound interest?",
  model_output="Compound interest is when you earn interest on both your initial investment and previously earned interest.",
  expected_model_output="Compound interest is interest earned on both the principal amount and accumulated interest from previous periods.",
  metric_name="atla_default_correctness",
).result.evaluation

if int(evaluation.score) == 1:
  print("The response is correct given the provided reference.")
else:
  print("The response is incorrect given the provided reference.")

print(f"Atla's critique: {evaluation.critique}")

For evaluations against known correct answers, we recommend using the default atla_default_correctness metric.