Build with Atla
A start-to-end guide on optimising your AI app with Atla evaluations. Build more reliable GenAI applications, and find mistakes before your customers do.
Choosing the Best LLM for Your AI Application
There are many hyperparameters we can iterate on when building with GenAI. We will focus on arguably the most crucial: selecting the right LLM to use. This guide will walk you through the process of evaluating and comparing different LLMs using Atla and a large test dataset from HuggingFace. It takes ~20 mins to run through and you can find the cookbook here.
Prepare Your Test Dataset
Load a dataset for testing. In this example, we’re using a basketball Q&A dataset:
from datasets import load_dataset
import pandas as pd
# Load the first 20 items from the basketball QA dataset
dataset = load_dataset("PedroCJardim/QASports", "basketball", split="test[:20]")
df = dataset.to_pandas()
test_dataset = df[['question']].copy()
# Extract the 'text' field from the 'answer' column
test_dataset['answer'] = df['answer'].apply(lambda x: eval(x)['text'])
Set Up Your RAG Pipeline
Create a Retrieval-Augmented Generation (RAG) pipeline. This involves loading your document corpus, creating embeddings, building a vector store, and setting up a retriever.
We load a corpus of ~50k preprocessed, cleaned and organized documents extracted from Wikipedia articles related to basketball. For instance, this corpus includes articles about NBA teams, FIBA World Cup histories, biographies of famous players like Michael Jordan, and explanations of basketball positions and strategies. You can download this data here.
This setup provides a starting point, but there are many parameters you can tune to optimize your RAG pipeline:
- Chunking parameters: Adjust
chunk_size
andchunk_overlap
in the text splitter to find the optimal balance between context preservation and retrieval granularity. - Embedding model: Experiment with different embedding models to improve semantic understanding and retrieval accuracy.
- Vector store type: Consider alternatives to Chroma, such as FAISS or Pinecone, depending on your needs.
- Retrieval parameters: Fine-tune the number of retrieved documents and relevance thresholds in your retriever setup.
While optimizing the RAG pipeline is crucial for overall performance, this guide focuses specifically on evaluating LLM performance within the RAG system. We’ll assume you have a vectorstore
set up using a method similar to the one shown above.
To evaluating and optimize your RAG pipeline, you should refer to our RAG metrics guide.
Define LLMs for Comparison
Set up the LLMs you want to compare. Here we’re using Zephyr and Falcon as they’re small enough to run in this experiment:
from langchain_huggingface import HuggingFaceEndpoint
# Initialize Zephyr 7B Alpha
zephyr_llm = HuggingFaceEndpoint(
repo_id="HuggingFaceH4/zephyr-7b-alpha",
temperature=0.5, # Controls randomness: 0.5 is balanced
max_length=512, # Max length of generated text
top_p=0.95 # Nucleus sampling parameter
)
# Initialize Falcon 7B Instruct
falcon_llm = HuggingFaceEndpoint(
repo_id="tiiuae/falcon-7b-instruct",
temperature=0.5, # Same params as Zephyr for consistency
max_length=512,
top_p=0.95
)
Generate Responses
Create a function to generate responses using your RAG pipeline:
import pandas as pd
from langchain.chains import RetrievalQA
def generate_responses(llm, vectorstore, test_dataset):
# Create a retriever from the vectorstore
retriever = vectorstore.as_retriever()
# Initialize RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" passes all retrieved docs to the LLM at once
retriever=retriever,
return_source_documents=True # Include source documents in the output
)
answers = []
contexts = []
# Generate LLM responses for each question in the dataset
for question in test_dataset['question']:
result = qa_chain({"query": question})
answers.append(result['result'])
# Combine the content of all source documents, saved into 'contexts'
contexts.append("\n\n".join([doc.page_content for doc in result['source_documents']]))
# Return DataFrame with the results
return pd.DataFrame({
"question": test_dataset['question'],
"answer": answers,
"contexts": contexts,
"ground_truth": test_dataset['answer']
})
# Generate responses using Zephyr
zephyr_results = generate_responses(zephyr_llm, vectorstore, test_dataset)
# Generate responses using Falcon
falcon_results = generate_responses(falcon_llm, vectorstore, test_dataset)
print("Responses generated.")
Set Up Atla for Evaluation
Initialize the Atla client:
from atla import Atla
client = Atla(api_key="your_atla_api_key_here")
Define an evaluation function:
# Define Atla eval function
def evaluate_response(metrics, question, answer, ground_truth, context):
return client.evaluation.create(
input=question,
response=answer,
metrics=metrics,
reference=ground_truth,
context=context
)
Evaluate LLM Responses
Create a function to evaluate the responses. We choose to evaluate on hallucination
and groundedness
in order to comprehensively evaluate the LLM responses. The first evaluates against the ground-truth answers and the second against the context that has been retrieved:
def evaluate_model(result_dataset, metrics):
scores = {metric: [] for metric in metrics}
critiques = {metric: [] for metric in metrics}
for _, row in result_dataset.iterrows():
result_dataset = result_dataset.copy()
# Default answer if ground truth is missing or empty
empty_answer = "I do not have the necessary information to answer this question"
ground_truth = row['ground_truth'] if pd.notna(row['ground_truth']) and row['ground_truth'] != '' else empty_answer
# Evaluate the response
evals = evaluate_response(metrics, row['question'], row['answer'], ground_truth, row['contexts'])
# Store the scores and critiques for each metric
for metric in metrics:
score = evals.evaluations[metric].score
critique = evals.evaluations[metric].critique
scores[metric].append(score)
critiques[metric].append(critique)
# Add the scores and critiques to the result dataset
for metric in metrics:
result_dataset[f"Atla's {metric} Score"] = scores[metric]
result_dataset[f"Atla's {metric} Critique"] = critiques[metric]
return result_dataset
metrics = ['hallucination', 'groundedness']
zephyr_evaluated = evaluate_model(zephyr_results, metrics)
falcon_evaluated = evaluate_model(falcon_results, metrics)
print("Evaluation complete.")
Analyze and Visualize Results
Create a function to visualize the results:
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_results(df1, name1, df2, name2):
metrics = ['hallucination', 'groundedness']
# Create a figure with two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
for i, (df, name) in enumerate([(df1, name1), (df2, name2)]):
for metric in metrics:
sns.kdeplot(data=df[f"Atla's {metric} Score"], ax=axes[i], label=metric, fill=True)
axes[i].set_title(f"{name} scores distribution")
axes[i].set_xlabel("Score")
axes[i].set_ylabel("Density")
axes[i].legend()
axes[i].set_xticks(np.arange(1, 6))
plt.tight_layout()
plt.show()
analyze_results(zephyr_evaluated, "Zephyr", falcon_evaluated, "Falcon")
This will generate a visualization comparing the performance of both models.
A higher hallucination
score indicates that most claims found in the AI answer are also in the ground truth response.
A higher groundedness
score indicates that most claims are backed by the context.
Interpret Results and Make a Decision
Based on the evaluation scores and visualizations:
- Compare the average scores for each metric across models.
- Look at the distribution of scores to understand consistency.
- Review critiques for qualitative insights.
- Consider other factors like inference speed and cost.
Choose the LLM that performs best on the metrics most important for your specific application.
We find that Zephyr does better on both metrics, which is a positive indication to serve it over Falcon as the LLM for our basketball QA use case.
Next Steps
Now that you’ve evaluated your LLMs:
- Fine-tune your chosen model if necessary.
- Implement the selected LLM in your production environment.
- Set up continuous monitoring with Atla to ensure ongoing performance.
- Periodically re-evaluate as new models become available.
For more information on Atla’s capabilities and different use cases, check out our full documentation.