Use Atla with Langfuse
A guide on how to evaluate Langfuse traces using Atla metrics as you develop / monitor your AI application.
About Langfuse
Langfuse is an open-source LLM engineering platform that helps teams collaboratively debug, analyze, and iterate on their LLM applications.
About Atla
Atla trains frontier models to test & evaluate GenAI applications. Our models help developers run fast and accurate evaluations, so they can ship quickly and with confidence.
Using Langfuse
With this integration, use Atla to evaluate the quality of your GenAI application and add these scores to your Langfuse traces for observability.
We’ve created a cookbook here if you want to jump into the code immediately, or you can follow the guide below! We will walk through two key methods:
- Individual Trace Scoring: Apply Atla evaluations to each Trace item as it is created.
- Batch Scoring: Evaluate Traces from a seed data set or from production using Atla. In production, you might want to evaluate a sample of the traces rather than all of them, to keep costs down.
Individual Trace Scoring
Here’s how to score individual traces as they’re created, using a RAG application as an example:
Create Traces in Langfuse
We mock the instrumentation of your application by using this sample RAG data. See the quickstart to integrate Langfuse with your application.
# Sample RAG data
data_RAG = [
{
"question": "What is the process of water purification in a typical water treatment plant?",
"context": "[2] Water purification in treatment plants typically involves several stages: coagulation, sedimentation, filtration, and disinfection. Coagulation adds chemicals that bind with impurities, forming larger particles. Sedimentation allows these particles to settle out of the water. Filtration removes smaller particles by passing water through sand, gravel, or charcoal. Finally, disinfection kills any remaining microorganisms, often using chlorine or ultraviolet light.",
"response": "Water purification involves coagulation, which removes impurities, and then the water is filtered. Finally, salt is added to kill any remaining bacteria before the water is distributed for use."
}
]
# Start a new trace
question = data_RAG[0]['question']
trace = langfuse.trace(name="Atla RAG")
# retrieve the relevant chunks
# chunks = get_similar_chunks(question)
context = data_RAG[0]['context']
trace.span(
name="retrieval",
input={'question': question},
output={'context': context}
)
# use llm to generate an answer with the chunks
# answer = get_response_from_llm(question, chunks)
response = data_RAG[0]['response']
trace.span(
name="generation",
input={'question': question, 'context': context},
output={'response': response}
)
trace.update(input={'question': question}, output={'response': response})
Score each Trace with Atla
Set up a function to evaluate the retrieval and generation components of your RAG pipeline as Traces are created.
We evaluate on atla_groundedness
and context_relevance
. The first determines if the response is factually based on the provided context and the second measures how on-point the retrieved context is:
def atla_eval(metrics, user_input, assistant_response, context=None, reference=None):
response = client.evaluation.create(
input=user_input,
response=assistant_response,
metrics=metrics,
context=context,
reference=reference
)
results = {}
for metric in metrics:
score = response.evaluations[metric].score
critique = response.evaluations[metric].critique
trace.score(name=metric, value=score, comment=critique)
results[metric] = {'score': score, 'critique': critique}
return results
metrics = ['atla_groundedness', 'context_relevance']
results = atla_eval(metrics, question, response, context=context)
We can monitor Atla’s evaluation scores and critiques over different traces on Langfuse’s UI.
Scoring Batches
To evaluate batches of data, such as from a seed dataset or production logs in Langfuse:
Retrieve and Prepare Traces
We convert the traces into a dataframe in order to run parallel evaluations in the next step.
import pandas as pd
# Retrieve Atla ground truth traces
traces = langfuse.fetch_traces(page=1,
limit=len(df),
name="Atla ground truth").data
# Prep df
evaluation_batch = []
# Retrieve observations from each trace
for t in traces_sample:
observations = langfuse.fetch_observations(trace_id=t.id).data
for o in observations:
if o.name == 'retrieval':
question = o.input['question']
context = o.output.get('context') or None
if o.name=='generation':
reference = o.output.get('reference') or None
response = o.output['response']
evaluation_batch.append({
'question': question,
'response': response,
'trace_id': t.id,
'context': context if context is not None else None,
'reference': reference if reference is not None else None
})
evaluation_batch = pd.DataFrame(evaluation_batch)
])
Set Up Parallel Evaluations with Atla
We can retrieve the traces like regular production data and evaluate them with Atla. For more info on this asynchronous capability, see Parallel Evaluations.
import asyncio
from atla import AsyncAtla
async_client = AsyncAtla()
async def send_eval(index, metrics, user_input, assistant_response, context=None, reference=None):
try:
response = await async_client.evaluation.create(
input=user_input,
response=assistant_response,
metrics=metrics,
context=context,
reference=reference
)
return index, response
except Exception as e:
print(f"Error at index {index} during API call: {str(e)}")
return index
async def main(evaluation_batch, metrics):
tasks = []
for index, row in evaluation_batch.iterrows():
tasks.append(send_eval(index, metrics, row["question"], row["response"], row.get("context"), row.get("reference")))
evals = await asyncio.gather(*tasks)
for index, eval in evals:
if eval is not None:
for metric, result in eval.evaluations.items():
evaluation_batch.at[index, f'{metric}_score'] = result.score
evaluation_batch.at[index, f'{metric}_critique'] = result.critique
return evaluation_batch
Score each Trace with Atla
We evaluate on atla_precision
as we have uploaded a dataset with ground-truth responses. This metric allows us to assesses the presence of incorrect or unrelated content in the AI’s response against the ground-truth.
import asyncio
import nest_asyncio
nest_asyncio.apply()
metrics = ['atla_precision']
evaluation_batch = asyncio.run(main(evaluation_batch, metrics))
Log Evaluation Results to Langfuse
for _, row in evaluation_batch.iterrows():
for metric in metrics:
langfuse.score(
name=metric,
value=row[f'{metric}_score'],
comment=row[f'{metric}_critique'],
trace_id=row["trace_id"]
)
We can monitor Atla’s evaluation scores and critiques over the batch on Langfuse’s UI.
Next Steps
- Set up continuous monitoring with Atla and Langfuse to identify mistakes with your application before your users do
- Consider optimising your AI application even further by experimenting with different hyperparameters visit our ‘how-to-build-with-Atla guide’
For more information on Atla’s capabilities and different use cases, check out our full documentation.