Eval Copilot (beta)
Using the Eval Copilot, formerly known as the Alignment Platform
The Eval Copilot helps users create and refine custom evaluation metrics. Think of it as a tool that helps you prompt engineer Selene effectively.
Naturally, different use cases require different ways to evaluate AI responses. We trained Selene to be steerable. You can use few-shot examples, make Selene stricter / more lenient, and much more.
For example, if you’re building an AI therapy chatbot, you can edit the
atla_default_helpfulness
metric to ensure responses show empathy while avoiding therapeutic claims.
Walkthrough
Generate your evaluation prompt
Define your own eval or adapt one of our templates.
- Tailor the evaluation criteria to your domain / use case
- Select your desired scoring format in the Metric Type dropdown
- Select your input variables in the Input Variables dropdown
We add domain context to the Helpfulness criteria and note to penalise therapeutic claims. We also add Ground truth response
as an input variable.
Review your generated prompt
- Ensure the generated evaluation criteria and scoring rubric align with your objective
- Add a metric name (used later via the Atla API as a Custom Metric)
We name our custom metric medical_chatbot_helpfulness
.
Add test data
There are two main ways you can add test data to the Eval Copilot:
Upload your CSV
- Upload your own data via Upload CSV
- Map the column names that correspond to each input variable
We upload our human-labeled test set with ground truth answers.
If you don’t have ‘Expected Score’ labels already, you can add these in the UI once your data has been uploaded.
Generate test cases
- Use Generate a test case to return a synthetic test case
- Review and adjust the ‘Expected Score’ label as needed
We generate a test case related to mental health support.
Test out the metric
- Click Run evaluations to return Selene scores on your test data
The Alignment Score measures how closely Selene’s predictions match expected human/expert scores. For reliable deployment in CI/CD pipelines or monitoring systems, aim for moderate (50-75%) or high (≥75%) alignment scores.
After running evaluations, we get a score suggesting moderate alignment.
The Alignment Score is calculated as the normalized Mean Absolute Error (MAE) of Selene’s predictions against your expected scores.
Align your eval metric
There are two main ways you can align your eval metric to your expected scores:
Adjust your prompt
- Access your prompt via the Show prompt toggle
- Directly edit your prompt OR use the Describe how to edit the prompt functionality to let AI make the edit for you
We can adjust the score distribution of our eval metric by making the prompt harsher or more lenient in scoring.
Your prompts will be versioned in the Eval Copilot so you can revert back as you wish.
Add few-shot examples
- Click the icon to directly add misaligned test cases (highlighted in amber or red) as few-shot examples
Selene’s score doesn’t align with our expected score. We can add this failure case as a few-shot example to seed our evaluation metric.
OR
- Select Edit few-shot examples (beneath your prompt) to access your few-shot library
- Click Add few-shot to add your own example
- Use Generate few-shot to return a synthetic example
We access our few-shot library (which includes the failure case from the previous example).
Deploy!
When you are confident that your evaluation metric is calibrated, you can deploy it to be used with the Atla API.
This evaluation metric is custom, so only you will be able to access it via your API key.