Once you have your scoring criteria defined with the help of Atla metrics, the next step is to develop test cases. Test cases should include the following information:

  1. Model Input - This is the prompt (both system and user prompt) that your LLM model received. This is a required element for all use cases.
  2. Model Output - This is the output response from your model. This is a required element for all use cases.
  3. Context - This is the context available to your model. This is applicable for RAG use cases.
  4. Ground Truth - When you run evaluations, it is strongly recommended you have reference a ground truth answer for your test cases.
  5. Expected Score - This is the expected score for the test case. Labeled test cases go a long way in setting up an evaluation and aligning Selene to how an expert would score.

Best practices while developing test cases:

  1. Close to real-life scenarios: Your test cases should mimic real life situations as closely as possible. Make sure to include a diverse set of scenarios. If you only include high scoring test cases, Selene does not know what low quality test cases might be.
  2. Include edge-cases: Every scenario has edge cases that even experts debate on. Including edge cases as few-shot examples in the metric, helps Selene learn quickly from those.


Developing test cases is a hard task, but having approximately 30 test cases for each metric following the above principles is a good place to start, and will help you reliably assess the quality of your LLM output at scale.

Next steps

Read about aligning Selene with experts here.