Build
Develop your test cases
Gathering the right test cases for evaluation
Once you have your scoring criteria defined with the help of Atla metrics, the next step is to develop test cases. Test cases should include the following information:
- Model Input - This is the prompt (both system and user prompt) that your LLM model received. This is a required element for all use cases.
- Model Output - This is the output response from your model. This is a required element for all use cases.
- Context - This is the context available to your model. This is applicable for RAG use cases.
- Ground Truth - When you run evaluations, it is strongly recommended you have reference a ground truth answer for your test cases.
- Expected Score - This is the expected score for the test case. Labeled test cases go a long way in setting up an evaluation and aligning Selene to how an expert would score.
Best practices while developing test cases:
- Close to real-life scenarios: Your test cases should mimic real life situations as closely as possible. Make sure to include a diverse set of scenarios. If you only include high scoring test cases, Selene does not know what low quality test cases might be.
- Include edge-cases: Every scenario has edge cases that even experts debate on. Including edge cases as few-shot examples in the metric, helps Selene learn quickly from those.
Developing test cases is a hard task, but having approximately 30 test cases for each metric following the above principles is a good place to start, and will help you reliably assess the quality of your LLM output at scale. To do this, it is important to align Selene with how experts would score. We have built an alignment tool within the Eval Copilot to help you align Selene to experts using a simple to use workflow.
Next steps
Read about aligning Selene with experts here.