Defining your scoring criteria
Learn about best practices, and metrics
Defining the right scoring criteria to evaluate your LLM output against is at the core of an evaluation. Even though “good” and “bad” quality can be fundamentally vague and highly context-dependent concepts, you can build strong criteria using these tips:
- Make it specific - Be specific on what you mean by “good”. Eg: Instead of saying, the result needs to be “brief”, say the result must be “atmost one sentence long”.
- Make it measurable - You can use quantiative metrics like accuracy, precision or recall. For qualitative metrics, use binary(0-1) or Likert (1-5) scales.
- Keep it simple - Don’t try to evaluate too many things at once. While most real life use cases require evaluating on multiple dimensions, it is important to evaluate one criteria at a time in order to get specific and actionable results.
Examples of criteria might include evaluating a copywriting AI model on aspects of clarity, engagement, brand relevance, etc., whereas a medical AI model might want to get evaluated against aspects such as clinical relevance, legal compliance, etc.
You can define your scoring criteria with the help of Atla metrics. You can choose from a set of default metrics that Atla provides or create your own custom metrics if you need something customised for your use case.