The Eval Copilot helps users create and refine custom evaluation metrics. Think of it as a tool that helps you prompt engineer Selene effectively.

Naturally, different use cases require different ways to evaluate AI responses. We trained Selene to be steerable. You can use few-shot examples, make Selene stricter / more lenient, and much more.

For example, if you’re building an AI therapy chatbot, you can edit the atla_default_helpfulness metric to ensure responses show empathy while avoiding therapeutic claims.

Walkthrough

1

Generate your evaluation prompt

Define your own eval or adapt one of our templates.

  • Tailor the evaluation criteria to your domain / use case
  • Select your desired scoring format in the Metric Type dropdown
  • Select your input variables in the Input Variables dropdown
2

Review your generated prompt

  • Ensure the generated evaluation criteria and scoring rubric align with your objective
  • Add a metric name (used later via the Atla API as a Custom Metric)
3

Add test data

There are two main ways you can add test data to the Eval Copilot:

Upload your CSV

  • Upload your own data via Upload CSV
  • Map the column names that correspond to each input variable

If you don’t have ‘Expected Score’ labels already, you can add these in the UI once your data has been uploaded.

Generate test cases

  • Use Generate a test case to return a synthetic test case
  • Review and adjust the ‘Expected Score’ label as needed
4

Test out the metric

  • Click Run evaluations to return Selene scores on your test data

The Alignment Score measures how closely Selene’s predictions match expected human/expert scores. For reliable deployment in CI/CD pipelines or monitoring systems, aim for moderate (50-75%) or high (≥75%) alignment scores.

The Alignment Score is calculated as the normalized Mean Absolute Error (MAE) of Selene’s predictions against your expected scores.

5

Align your eval metric

There are two main ways you can align your eval metric to your expected scores:

Adjust your prompt

  • Access your prompt via the Show prompt toggle
  • Directly edit your prompt OR use the Describe how to edit the prompt functionality to let AI make the edit for you

Your prompts will be versioned in the Eval Copilot so you can revert back as you wish.

Add few-shot examples

  • Click the icon to directly add misaligned test cases (highlighted in amber or red) as few-shot examples

OR

  • Select Edit few-shot examples (beneath your prompt) to access your few-shot library
  • Click Add few-shot to add your own example
  • Use Generate few-shot to return a synthetic example
6

Deploy!

When you are confident that your evaluation metric is calibrated, you can deploy it to be used with the Atla API.

This evaluation metric is custom, so only you will be able to access it via your API key.

evaluation = client.evaluation.create(
    model_id="atla-selene",
    model_input="Is NP equal to NP-Hard?",
    model_output="Richard Karp expanded on the Cook-Levin Theorem to show that all NP problems can be reduced to NP-Hard problems in polynomial time.",
    expected_model_output="NP-Hard problems are at least as hard as the hardest NP problems. NP-Hard problems don't have to be in NP, and their relationship is an open question.",
    metric_name="<your_custom_metric_name>", # replace 
).result.evaluation

print(f"Atla's score: {evaluation.score}")
print(f"Atla's critique: {evaluation.critique}")