About Langfuse

Langfuse is an open-source LLM engineering platform that helps teams collaboratively debug, analyze, and iterate on their LLM applications.

About Atla

Atla trains frontier models to test & evaluate GenAI applications. Our models help developers run fast and accurate evaluations, so they can ship quickly and with confidence.

Using Langfuse

With this integration, use Atla to evaluate the quality of your GenAI application and add these scores to your Langfuse traces for observability.

We’ve created a cookbook here if you want to jump into the code immediately, or you can follow the guide below! We will walk through two key methods:

  1. Individual Trace Scoring: Apply Atla evaluations to each Trace item as it is created.
  2. Batch Scoring: Evaluate Traces from a seed data set or from production using Atla. In production, you might want to evaluate a sample of the traces rather than all of them, to keep costs down.

Individual Trace Scoring


Here’s how to score individual traces as they’re created, using a RAG application as an example:

1

Create Traces in Langfuse

We mock the instrumentation of your application by using this sample RAG data. See the quickstart to integrate Langfuse with your application.

# Sample RAG data
data_RAG = [
    {
    "question": "What is the process of water purification in a typical water treatment plant?",
    "context": "[2] Water purification in treatment plants typically involves several stages: coagulation, sedimentation, filtration, and disinfection. Coagulation adds chemicals that bind with impurities, forming larger particles. Sedimentation allows these particles to settle out of the water. Filtration removes smaller particles by passing water through sand, gravel, or charcoal. Finally, disinfection kills any remaining microorganisms, often using chlorine or ultraviolet light.",
    "response": "Water purification involves coagulation, which removes impurities, and then the water is filtered. Finally, salt is added to kill any remaining bacteria before the water is distributed for use."
    }
]

# Start a new trace
question = data_RAG[0]['question']
trace = langfuse.trace(name="Atla RAG")

# retrieve the relevant chunks
# chunks = get_similar_chunks(question)
context = data_RAG[0]['context']
trace.span(
    name="retrieval",
    input={'question': question},
    output={'context': context}
)

# use llm to generate an answer with the chunks
# answer = get_response_from_llm(question, chunks)
response = data_RAG[0]['response']
trace.span(
    name="generation",
    input={'question': question, 'context': context},
    output={'response': response}
)

trace.update(input={'question': question}, output={'response': response})
2

Score each Trace with Atla

Set up a function to evaluate the retrieval and generation components of your RAG pipeline as Traces are created.

We evaluate on atla_groundedness and context_relevance. The first determines if the response is factually based on the provided context and the second measures how on-point the retrieved context is:

def atla_eval(metrics, user_input, assistant_response, context=None, reference=None):
    response = client.evaluation.create(
        input=user_input,
        response=assistant_response,
        metrics=metrics,
        context=context,
        reference=reference
    )

    results = {}
    for metric in metrics:
        score = response.evaluations[metric].score
        critique = response.evaluations[metric].critique
        trace.score(name=metric, value=score, comment=critique)
        results[metric] = {'score': score, 'critique': critique}

    return results

metrics = ['atla_groundedness', 'context_relevance']
results = atla_eval(metrics, question, response, context=context)

We can monitor Atla’s evaluation scores and critiques over different traces on Langfuse’s UI.

Scoring Batches


To evaluate batches of data, such as from a seed dataset or production logs in Langfuse:

1

Retrieve and Prepare Traces

We convert the traces into a dataframe in order to run parallel evaluations in the next step.

import pandas as pd

# Retrieve Atla ground truth traces
traces = langfuse.fetch_traces(page=1, 
        limit=len(df),
        name="Atla ground truth").data

# Prep df
evaluation_batch = []

# Retrieve observations from each trace
for t in traces_sample:
    observations = langfuse.fetch_observations(trace_id=t.id).data
    
    for o in observations:
      if o.name == 'retrieval':
        question = o.input['question']
        context = o.output.get('context') or None
      if o.name=='generation':
        reference = o.output.get('reference') or None
        response = o.output['response']

    evaluation_batch.append({
        'question': question,
        'response': response,
        'trace_id': t.id,
        'context': context if context is not None else None,
        'reference': reference if reference is not None else None
    })

evaluation_batch = pd.DataFrame(evaluation_batch)
])
2

Set Up Parallel Evaluations with Atla

We can retrieve the traces like regular production data and evaluate them with Atla. For more info on this asynchronous capability, see Parallel Evaluations.

import asyncio
from atla import AsyncAtla

async_client = AsyncAtla()

async def send_eval(index, metrics, user_input, assistant_response, context=None, reference=None):
    try:
        response = await async_client.evaluation.create(
            input=user_input,
            response=assistant_response,
            metrics=metrics,
            context=context,
            reference=reference
        )
        return index, response
    except Exception as e:
        print(f"Error at index {index} during API call: {str(e)}")
        return index

async def main(evaluation_batch, metrics):
    tasks = []
    for index, row in evaluation_batch.iterrows():
        tasks.append(send_eval(index, metrics, row["question"], row["response"], row.get("context"), row.get("reference")))

    evals = await asyncio.gather(*tasks)

    for index, eval in evals:
        if eval is not None:
            for metric, result in eval.evaluations.items():
                evaluation_batch.at[index, f'{metric}_score'] = result.score
                evaluation_batch.at[index, f'{metric}_critique'] = result.critique

    return evaluation_batch
3

Score each Trace with Atla

We evaluate on atla_precision as we have uploaded a dataset with ground-truth responses. This metric allows us to assesses the presence of incorrect or unrelated content in the AI’s response against the ground-truth.

import asyncio
import nest_asyncio

nest_asyncio.apply()
metrics = ['atla_precision']
evaluation_batch = asyncio.run(main(evaluation_batch, metrics))
4

Log Evaluation Results to Langfuse

for _, row in evaluation_batch.iterrows():
    for metric in metrics:
        langfuse.score(
            name=metric,
            value=row[f'{metric}_score'],
            comment=row[f'{metric}_critique'],
            trace_id=row["trace_id"]
        )

We can monitor Atla’s evaluation scores and critiques over the batch on Langfuse’s UI.

Next Steps

  • Set up continuous monitoring with Atla and Langfuse to identify mistakes with your application before your users do
  • Consider optimising your AI application even further by experimenting with different hyperparameters visit our ‘how-to-build-with-Atla guide’

For more information on Atla’s capabilities and different use cases, check out our full documentation.