POST
/
v1
/
eval

Send over a input that a AI generated a response for, and the model will generate an evaluation based on some desired metrics.

The evaluate API can be used for either single metric evaluations or multiple metric evaluations. It is also capable of evaluating in RAG settings with an optional context field and against a ground-truth answer if one is available.

Authorizations

Authorization
string
headerrequired

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
metrics
string[]
required

Metrics to evaluate on.

Our models have been trained to evaluate on specific metrics ensuring the highest performance. Each metric passed will by default use the best model for that metric.

You can include multiple metrics to get multiple evaluations in one request or pass just a single metric.

Example with a single metric:

['recall']

Example with multiple metrics:

['recall', 'precision']
input
required

Input messages to evaluate the assistant's response for.

Atla will evaluate an AI response based on the input message that was used to generate the response. Typically the input message is a single question or prompt used within some context.

Example with a single input message:

Is it permissible for a cookie banner to obscure the imprint?

Atla is able to generate an evaluation for multi-turn input messages, typically a conversation. The input message should be a list of alternating user and assistant messages.Each message should be a dictionary with a role and content key.

Example with multiple conversational turns:

[
   {'role': 'user', 'content': 'Is it permissible for a cookie banner to obscure the imprint?'},
   {'role': 'assistant', 'content': 'I could not find a specific source addressing the permissibility of a cookie banner obscuring the imprint.'}
]
response
string
required

The response generated by the AI model which will be evaluated.

When using multi-turn input messages, the response should be the last assistant message in the conversation

model
string
default: atla

The model version that will perform the evaluation. By default, the latest atla model will be used.

context
string | null
default:

The context in which the input message is used.

In a Retrieval-Augmented Generation (RAG) setting, the context parameter is crucial for evaluating how well the AI system integrates retrieved information with generated responses. By providing the relevant context, Atla can measure the accuracy and relevance of the AI's responses based on the given context.

reference
string | null

The reference or ground-truth answer against which the AI response will be evaluated.

This parameter is used to provide the correct or expected answer for the given input. Atla will compare the AI-generated response against this reference to assess the response's correctness and relevance. By providing a reference, you enable Atla to perform a detailed evaluation of the AI's performance in terms of accuracy and factual consistency.

Response

200 - application/json
model
string
required

The model version that performed the evaluation.

evaluations
object
required

Evaluations generated by the model.

This is an object where the key is the metric evaluated (as per the metrics field in the request) and the value is the evaluation object.

The evaluation object contains the score (1-5) and critique of the models evaluation.

Example:

{
   'recall': {
       'score': 3,
       'critique': 'The model was able to recall some of the information but not all.'
   }
}
usage
object
required

Billing and rate-limit usage

Atla's API is billed based on the number of evaluation tokens used.

id
string

Unique identifier for the evaluation.

created
integer

Unix timestamp of when the evaluation was created.