Naturally, different users require different ways to evaluate AI responses. We trained Selene to be highly steerable, meaning Selene is great at following instructions. It excels at following instructions asking Selene to fit criteria to few-shot examples, to make evaluations stricter, and to score according to the your desired format. 

Our Alignment Platform helps users create and refine custom evaluation criteria. Think of it as a way to prompt engineer Selene effectively.

For example, if you had made an AI Tweet generator, we can edit the atla_default_conciseness metric to adhere to Tweet lengths.

Walkthrough

1

Generate your evaluation prompt

Define your own eval or adapt one of our templates.

  1. Tailor the evaluation criteria to your domain / use case

  2. Adjust the Metric Type to your desired scoring format

  3. Select the input variables to be used in the evaluation

We adjusted the Conciseness criteria for assessing relevant legal information. We also added Ground truth response as an input variable.

2

Review your generated prompt

  1. Ensure the generated evaluation criteria and scoring rubric align with your objectives

  2. Add a name for your metric which will be used later via the Atla API as a Custom Metric.

We name this legal_conciseness.

3

Add your data

You can upload your own data via the Upload CSV function. Add labels for the ‘Expected score’ if you haven’t already.

Try to make sure your column names are somewhat similar to ours, as we use a fuzzy match to populate the right columns.

You can also use our Generate a test case function to return tailored synthetic test cases. Make sure that you agree with the ‘Expected Score’ labels assigned.

Here, we successfully generate a test case in a specific legal domain.

4

Test out the metric

Use the Run evaluations function to assess out your new evaluation metric on your test cases.

The Alignment Score provides real-time signal on how aligned your evals are to your desired eval objectives. To be confident in deploying and using your metric in CI/CD or in monitoring, you usually want a moderate or high alignment score.

It is calculated as the Mean Absolute Error (MAE) of Selene’s predictions against your expected scores.

5

Align your eval metric

There are two main ways you can align your eval metric to your expected scores.

Adjust your prompt

Access your prompt via the Show prompt toggle.

Directly edit your prompt or let AI edit the prompt for you.

Your prompts will be versioned in the Alignment Platform so you can revert back as you wish.

Add few-shot examples

If adjusting the prompt doesn’t get you there, try adding few-shot examples to seed your evaluation metric.

You can add these using the Edit few-shot example function beneath the prompt, or you can directly add trickier examples where the Selene score is misaligned. Simply click on the score like so:

You can also use the Generate few-shot function to generate similar examples for you to grade.

6

Deploy!

With confidence that your evaluation metric is calibrated, you can deploy it to be used with the Atla API. This evaluation metric is custom to you, so only you will be able to access it via your API key.

evaluation = client.evaluation.create(
    model_id="atla-selene",
    model_input="Is NP equal to NP-Hard?",
    model_output="Richard Karp expanded on the Cook-Levin Theorem to show that all NP problems can be reduced to NP-Hard problems in polynomial time.",
    expected_model_output="NP-Hard problems are at least as hard as the hardest NP problems. NP-Hard problems don't have to be in NP, and their relationship is an open question.",
    metric_name="<your_custom_metric_name",
).result.evaluation

print(f"Atla's score: {evaluation.score}")
print(f"Atla's critique: {evaluation.critique}")