Create Evaluation
Create a Evaluation
Send over a input that a AI generated a response for, and the model will generate an evaluation based on some desired metrics.
The evaluate API can be used for either single metric evaluations or multiple metric evaluations. It is also capable of evaluating in RAG settings with an optional context field and against a ground-truth answer if one is available.
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Body
Metrics to evaluate on.
Our models have been trained to evaluate on specific metrics ensuring the highest performance. Each metric passed will by default use the best model for that metric.
You can include multiple metrics to get multiple evaluations in one request or pass just a single metric.
Example with a single metric:
['recall']
Example with multiple metrics:
['recall', 'precision']
Input messages to evaluate the assistant's response for.
Atla will evaluate an AI response based on the input message that was used to generate the response. Typically the input message is a single question or prompt used within some context.
Example with a single input message:
Is it permissible for a cookie banner to obscure the imprint?
Atla is able to generate an evaluation for multi-turn input messages, typically a conversation. The input message should be a list of alternating user
and assistant
messages.Each message should be a dictionary with a role
and content
key.
Example with multiple conversational turns:
[
{'role': 'user', 'content': 'Is it permissible for a cookie banner to obscure the imprint?'},
{'role': 'assistant', 'content': 'I could not find a specific source addressing the permissibility of a cookie banner obscuring the imprint.'}
]
The response generated by the AI model which will be evaluated.
When using multi-turn input messages, the response should be the last assistant
message in the conversation
The model version that will perform the evaluation. By default, the latest atla
model will be used.
The context in which the input message is used.
In a Retrieval-Augmented Generation (RAG) setting, the context parameter is crucial for evaluating how well the AI system integrates retrieved information with generated responses. By providing the relevant context, Atla can measure the accuracy and relevance of the AI's responses based on the given context.
The reference or ground-truth answer against which the AI response will be evaluated.
This parameter is used to provide the correct or expected answer for the given input. Atla will compare the AI-generated response against this reference to assess the response's correctness and relevance. By providing a reference, you enable Atla to perform a detailed evaluation of the AI's performance in terms of accuracy and factual consistency.
Response
The model version that performed the evaluation.
Evaluations generated by the model.
This is an object where the key is the metric evaluated (as per the metrics
field in the request) and the value is the evaluation object.
The evaluation object contains the score (1-5) and critique of the models evaluation.
Example:
{
'recall': {
'score': 3,
'critique': 'The model was able to recall some of the information but not all.'
}
}
Billing and rate-limit usage
Atla's API is billed based on the number of evaluation tokens used.
Unique identifier for the evaluation.
Unix timestamp of when the evaluation was created.