Use Atla with Langfuse
A guide on how to evaluate Langfuse traces using Atla metrics as you develop / monitor your AI application.
About Langfuse
Langfuse is an open-source LLM engineering platform that helps teams collaboratively debug, analyze, and iterate on their LLM applications.
About Atla
Atla trains frontier models to test & evaluate GenAI applications. Our models help developers run fast and accurate evaluations, so they can ship quickly and with confidence.
Using Langfuse
With this integration, use Atla to evaluate the quality of your GenAI application and add these scores to your Langfuse traces for observability.
We’ve created a cookbook here if you want to jump into the code immediately, or you can follow the guide below! We will walk through two key methods:
- Individual Trace Scoring: Apply Atla evaluations to each Trace item as it is created.
- Batch Scoring: Evaluate Traces from a seed data set or from production using Atla. In production, you might want to evaluate a sample of the traces rather than all of them, to keep costs down.
Individual Trace Scoring
Here’s how to score individual traces as they’re created, using a RAG application as an example:
Create Traces in Langfuse
We mock the instrumentation of your application by using this sample RAG data. See the quickstart to integrate Langfuse with your application.
Score each Trace with Atla
Set up a function to evaluate the retrieval and generation components of your RAG pipeline as Traces are created.
We evaluate on atla_groundedness
and context_relevance
. The first determines if the response is factually based on the provided context and the second measures how on-point the retrieved context is:
We can monitor Atla’s evaluation scores and critiques over different traces on Langfuse’s UI.
Scoring Batches
To evaluate batches of data, such as from a seed dataset or production logs in Langfuse:
Retrieve and Prepare Traces
We convert the traces into a dataframe in order to run parallel evaluations in the next step.
Set Up Parallel Evaluations with Atla
We can retrieve the traces like regular production data and evaluate them with Atla. For more info on this asynchronous capability, see Parallel Evaluations.
Score each Trace with Atla
We evaluate on atla_precision
as we have uploaded a dataset with ground-truth responses. This metric allows us to assesses the presence of incorrect or unrelated content in the AI’s response against the ground-truth.
Log Evaluation Results to Langfuse
We can monitor Atla’s evaluation scores and critiques over the batch on Langfuse’s UI.
Next Steps
- Set up continuous monitoring with Atla and Langfuse to identify mistakes with your application before your users do
- Consider optimising your AI application even further by experimenting with different hyperparameters visit our ‘how-to-build-with-Atla guide’
For more information on Atla’s capabilities and different use cases, check out our full documentation.