Build with Atla
A start-to-end guide on optimising your AI app with Atla evaluations. Build more reliable GenAI applications, and find mistakes before your customers do.
Choosing the Best LLM for Your AI Application
There are many hyperparameters we can iterate on when building with GenAI. We will focus on arguably the most crucial: selecting the right LLM to use. This guide will walk you through the process of evaluating and comparing different LLMs using Atla and a large test dataset from HuggingFace. It takes ~20 mins to run through and you can find the cookbook here.
Prepare Your Test Dataset
Load a dataset for testing. In this example, we’re using a basketball Q&A dataset:
Set Up Your RAG Pipeline
Create a Retrieval-Augmented Generation (RAG) pipeline. This involves loading your document corpus, creating embeddings, building a vector store, and setting up a retriever.
We load a corpus of ~50k preprocessed, cleaned and organized documents extracted from Wikipedia articles related to basketball. For instance, this corpus includes articles about NBA teams, FIBA World Cup histories, biographies of famous players like Michael Jordan, and explanations of basketball positions and strategies. You can download this data here.
This setup provides a starting point, but there are many parameters you can tune to optimize your RAG pipeline:
- Chunking parameters: Adjust
chunk_size
andchunk_overlap
in the text splitter to find the optimal balance between context preservation and retrieval granularity. - Embedding model: Experiment with different embedding models to improve semantic understanding and retrieval accuracy.
- Vector store type: Consider alternatives to Chroma, such as FAISS or Pinecone, depending on your needs.
- Retrieval parameters: Fine-tune the number of retrieved documents and relevance thresholds in your retriever setup.
While optimizing the RAG pipeline is crucial for overall performance, this guide focuses specifically on evaluating LLM performance within the RAG system. We’ll assume you have a vectorstore
set up using a method similar to the one shown above.
To evaluating and optimize your RAG pipeline, you should refer to our RAG metrics guide.
Define LLMs for Comparison
Set up the LLMs you want to compare. Here we’re using Zephyr and Falcon as they’re small enough to run in this experiment:
Generate Responses
Create a function to generate responses using your RAG pipeline:
Set Up Atla for Evaluation
Initialize the Atla client:
Define an evaluation function:
Evaluate LLM Responses
Create a function to evaluate the responses. We choose to evaluate on hallucination
and groundedness
in order to comprehensively evaluate the LLM responses. The first evaluates against the ground-truth answers and the second against the context that has been retrieved:
Analyze and Visualize Results
Create a function to visualize the results:
This will generate a visualization comparing the performance of both models.
A higher hallucination
score indicates that most claims found in the AI answer are also in the ground truth response.
A higher groundedness
score indicates that most claims are backed by the context.
Interpret Results and Make a Decision
Based on the evaluation scores and visualizations:
- Compare the average scores for each metric across models.
- Look at the distribution of scores to understand consistency.
- Review critiques for qualitative insights.
- Consider other factors like inference speed and cost.
Choose the LLM that performs best on the metrics most important for your specific application.
We find that Zephyr does better on both metrics, which is a positive indication to serve it over Falcon as the LLM for our basketball QA use case.
Next Steps
Now that you’ve evaluated your LLMs:
- Fine-tune your chosen model if necessary.
- Implement the selected LLM in your production environment.
- Set up continuous monitoring with Atla to ensure ongoing performance.
- Periodically re-evaluate as new models become available.
For more information on Atla’s capabilities and different use cases, check out our full documentation.