Choosing the Best LLM for Your AI Application

There are many hyperparameters we can iterate on when building with GenAI. We will focus on arguably the most crucial: selecting the right LLM to use. This guide will walk you through the process of evaluating and comparing different LLMs using Atla and a large test dataset from HuggingFace. It takes ~20 mins to run through and you can find the cookbook here.

1

Prepare Your Test Dataset

Load a dataset for testing. In this example, we’re using a basketball Q&A dataset:

from datasets import load_dataset
import pandas as pd

# Load the first 20 items from the basketball QA dataset
dataset = load_dataset("PedroCJardim/QASports", "basketball", split="test[:20]")
df = dataset.to_pandas()
test_dataset = df[['question']].copy()
# Extract the 'text' field from the 'answer' column
test_dataset['answer'] = df['answer'].apply(lambda x: eval(x)['text']) 
2

Set Up Your RAG Pipeline

Create a Retrieval-Augmented Generation (RAG) pipeline. This involves loading your document corpus, creating embeddings, building a vector store, and setting up a retriever.

We load a corpus of ~50k preprocessed, cleaned and organized documents extracted from Wikipedia articles related to basketball. For instance, this corpus includes articles about NBA teams, FIBA World Cup histories, biographies of famous players like Michael Jordan, and explanations of basketball positions and strategies. You can download this data here.

This setup provides a starting point, but there are many parameters you can tune to optimize your RAG pipeline:

  • Chunking parameters: Adjust chunk_size and chunk_overlap in the text splitter to find the optimal balance between context preservation and retrieval granularity.
  • Embedding model: Experiment with different embedding models to improve semantic understanding and retrieval accuracy.
  • Vector store type: Consider alternatives to Chroma, such as FAISS or Pinecone, depending on your needs.
  • Retrieval parameters: Fine-tune the number of retrieved documents and relevance thresholds in your retriever setup.

While optimizing the RAG pipeline is crucial for overall performance, this guide focuses specifically on evaluating LLM performance within the RAG system. We’ll assume you have a vectorstore set up using a method similar to the one shown above.

To evaluating and optimize your RAG pipeline, you should refer to our RAG metrics guide.

3

Define LLMs for Comparison

Set up the LLMs you want to compare. Here we’re using Zephyr and Falcon as they’re small enough to run in this experiment:

from langchain_huggingface import HuggingFaceEndpoint

# Initialize Zephyr 7B Alpha
zephyr_llm = HuggingFaceEndpoint(
    repo_id="HuggingFaceH4/zephyr-7b-alpha",
    temperature=0.5,  # Controls randomness: 0.5 is balanced
    max_length=512,   # Max length of generated text
    top_p=0.95        # Nucleus sampling parameter
)

# Initialize Falcon 7B Instruct
falcon_llm = HuggingFaceEndpoint(
    repo_id="tiiuae/falcon-7b-instruct",
    temperature=0.5,  # Same params as Zephyr for consistency
    max_length=512,
    top_p=0.95
)
4

Generate Responses

Create a function to generate responses using your RAG pipeline:

import pandas as pd
from langchain.chains import RetrievalQA

def generate_responses(llm, vectorstore, test_dataset):
    # Create a retriever from the vectorstore
    retriever = vectorstore.as_retriever()

    # Initialize RetrievalQA chain
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # "stuff" passes all retrieved docs to the LLM at once
        retriever=retriever,
        return_source_documents=True  # Include source documents in the output
    )

    answers = []
    contexts = []

    # Generate LLM responses for each question in the dataset
    for question in test_dataset['question']:
        result = qa_chain({"query": question})
        answers.append(result['result'])
        # Combine the content of all source documents, saved into 'contexts'
        contexts.append("\n\n".join([doc.page_content for doc in result['source_documents']]))

    # Return DataFrame with the results
    return pd.DataFrame({
        "question": test_dataset['question'],
        "answer": answers,
        "contexts": contexts,
        "ground_truth": test_dataset['answer']
})

# Generate responses using Zephyr
zephyr_results = generate_responses(zephyr_llm, vectorstore, test_dataset)

# Generate responses using Falcon
falcon_results = generate_responses(falcon_llm, vectorstore, test_dataset)

print("Responses generated.")
5

Set Up Atla for Evaluation

Initialize the Atla client:

from atla import Atla

client = Atla(api_key="your_atla_api_key_here")

Define an evaluation function:

# Define Atla eval function
def evaluate_response(metrics, question, answer, ground_truth, context):
    return client.evaluation.create(
        input=question,
        response=answer,
        metrics=metrics,
        reference=ground_truth,
        context=context
    )
6

Evaluate LLM Responses

Create a function to evaluate the responses. We choose to evaluate on hallucination and groundedness in order to comprehensively evaluate the LLM responses. The first evaluates against the ground-truth answers and the second against the context that has been retrieved:

def evaluate_model(result_dataset, metrics):

scores = {metric: [] for metric in metrics}
critiques = {metric: [] for metric in metrics}

for _, row in result_dataset.iterrows():
    result_dataset = result_dataset.copy()
    # Default answer if ground truth is missing or empty
    empty_answer = "I do not have the necessary information to answer this question"
    ground_truth = row['ground_truth'] if pd.notna(row['ground_truth']) and row['ground_truth'] != '' else empty_answer

    # Evaluate the response
    evals = evaluate_response(metrics, row['question'], row['answer'], ground_truth, row['contexts'])

    # Store the scores and critiques for each metric
    for metric in metrics:
        score = evals.evaluations[metric].score
        critique = evals.evaluations[metric].critique

        scores[metric].append(score)
        critiques[metric].append(critique)

# Add the scores and critiques to the result dataset
for metric in metrics:
    result_dataset[f"Atla's {metric} Score"] = scores[metric]
    result_dataset[f"Atla's {metric} Critique"] = critiques[metric]

return result_dataset

metrics = ['hallucination', 'groundedness']

zephyr_evaluated = evaluate_model(zephyr_results, metrics)

falcon_evaluated = evaluate_model(falcon_results, metrics)

print("Evaluation complete.")
7

Analyze and Visualize Results

Create a function to visualize the results:

import matplotlib.pyplot as plt
import seaborn as sns

def analyze_results(df1, name1, df2, name2):
    metrics = ['hallucination', 'groundedness']

    # Create a figure with two subplots side by side
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    for i, (df, name) in enumerate([(df1, name1), (df2, name2)]):
        for metric in metrics:
            sns.kdeplot(data=df[f"Atla's {metric} Score"], ax=axes[i], label=metric, fill=True)

        axes[i].set_title(f"{name} scores distribution")
        axes[i].set_xlabel("Score")
        axes[i].set_ylabel("Density")
        axes[i].legend()


        axes[i].set_xticks(np.arange(1, 6))

    plt.tight_layout()
    plt.show()

analyze_results(zephyr_evaluated, "Zephyr", falcon_evaluated, "Falcon")

This will generate a visualization comparing the performance of both models.

LLM Performance Comparison

A higher hallucination score indicates that most claims found in the AI answer are also in the ground truth response.

A higher groundednessscore indicates that most claims are backed by the context.

8

Interpret Results and Make a Decision

Based on the evaluation scores and visualizations:

  1. Compare the average scores for each metric across models.
  2. Look at the distribution of scores to understand consistency.
  3. Review critiques for qualitative insights.
  4. Consider other factors like inference speed and cost.

Choose the LLM that performs best on the metrics most important for your specific application.


We find that Zephyr does better on both metrics, which is a positive indication to serve it over Falcon as the LLM for our basketball QA use case.

Next Steps

Now that you’ve evaluated your LLMs:

  • Fine-tune your chosen model if necessary.
  • Implement the selected LLM in your production environment.
  • Set up continuous monitoring with Atla to ensure ongoing performance.
  • Periodically re-evaluate as new models become available.

For more information on Atla’s capabilities and different use cases, check out our full documentation.