Get Started
Evaluation Components
Understanding Atla’s evaluation framework
Input Variables
Input Variable | Description |
---|---|
input | The user input to your GenAI app. E.g., a question / instruction / previous chat dialogue etc. For previous chat dialogue, the input message should be a list of alternating user and assistant messages. |
context | Additional context fetched by your retrieval model. |
ground_truth | The ‘gold standard’ or expected response. |
response | The response given by your LLM. Measures how on-point the retrieved context is. When using chat dialogue, the response should be the last assistant message in the conversation. |
Evaluation Metrics
Evaluation metrics represent different dimensions to assess the performance of your GenAI app on.
Use our predefined metrics to comprehensively evaluate across the different components of your genAI application. The prompts underlying these metrics have been carefully prepared to maximise the effectiveness of our eval model.
- Retrieval evaluation
- Response generation evaluation
- Language quality evaluation
Alternatively, build your own custom metrics in the custom eval prompt UI (coming soon) and deploy these to use with our API.
Scoring Format
Scoring | Description |
---|---|
Score of 1 - 5 | A Likert-scale scoring system. Commonly used by human raters for subjective assessments. The default option across our pre-defined metrics. |
Binary 0 / 1 | A simple binary scoring system. Commonly used for classification purposes. 0 typically represents a negative outcome (no, fail, incorrect), while 1 represents a positive outcome (yes, pass, correct). |
Float 0.0 - 0.1 | A continuous scoring format. Commonly used as a precise representation to quantify accuracy. |
Evaluation Result
Response | Description |
---|---|
score | Score of 1 - 5 , Binary 0 / 1 , Float 0 - 1 |
critique | A brief justification for the provided score. Understand your GenAI app’s performance as you experiment with different model architectures / prompts / hyperparameters etc. |