About DeepEval

DeepEval is an open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs.

About Atla

Atla trains frontier models to test & evaluate GenAI applications. Our models help developers run fast and accurate evaluations, so they can ship quickly and with confidence.

Using DeepEval with Selene Mini


DeepEval supports the use of custom models from Hugging Face, as well as locally hosted models. Use DeepEval’s plug-and-use metrics / prompts with Selene Mini to quickly start evaluating the quality of your GenAI application.

Below you will find quick start guides to do so with:

  1. Selene Mini through Ollama

  2. Selene Mini through Hugging Face Transformers

Use Selene Mini Locally with Ollama


1

Download Selene Mini on Ollama

Start by pulling the Selene Mini model. From the terminal run the following:

ollama pull atla/selene-mini

If this has correctly ran, you should see atla/selene-mini:latest after you run the following:

ollama list
2

Configure Selene Mini on DeepEval

Install DeepEval if you haven’t already:

pip install deepeval

Run the following command to configure Selene Mini as the LLM for all LLM-based metrics.

deepeval set-local-model --model-name=atla/atla-8b-preview:latest --base-url="http://localhost:11434/v1/" --api-key="ollama"

Where:

  • The model name is the NAME that appears after executing ollama list

  • The base URL is the cURL specified in Ollama documentation

3

Run your DeepEval scripts as usual

Within the same terminal, run your code as usual. The model being used to calculate evals will default to Selene Mini.

Here is some example code you can try:

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

metric = FaithfulnessMetric(
  threshold=0.7,
  # model = <model_name> does not need to be specified
  include_reason=True,
  async_mode=False
)
test_case = LLMTestCase(
  input="What if these shoes don't fit?",
  actual_output=actual_output,
  retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Load Selene Mini using Hugging Face


Follow the steps below or jump straight into the code in our cookbook.

1

Load Selene from HF

You will require the following packages:

!pip install torch transformers deepeval -U bitsandbytes lm-format-enforcer --quiet

Load the model and tokenizer:

We load the model in 4-bit quantized format to make it easy to run in a Colab notebook. Remove the quantization config to run the full precision model (requires 32GB VRAM).

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

selene_model = AutoModelForCausalLM.from_pretrained(
    "AtlaAI/Selene-1-Mini-Llama-3.1-8B",
    device_map="auto",
    quantization_config=quantization_config # remove to load FP16 model
)
selene_tokenizer = AutoTokenizer.from_pretrained(
    "AtlaAI/Selene-1-Mini-Llama-3.1-8B"
)
2

Create CustomModel class for DeepEval

  1. Inherit the class DeepEvalBaseLLM

  2. Implement load_model() , which will be responsible for returning a model object.

  3. Implement generate() with parameter of type string that acts as the prompt to your custom LLM. This function returns the generated string output from Selene.

  4. Implement get_model_name(), which simply returns a string representing our custom model name.

from deepeval.models.base_model import DeepEvalBaseLLM
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
build_transformers_prefix_allowed_tokens_fn
)
from transformers import pipeline
from pydantic import BaseModel
import json

class CustomModel(DeepEvalBaseLLM):
def __init__(
    self,
    model,
    tokenizer
):
    self.model = model
    self.tokenizer = tokenizer

def load_model(self):
    return self.model

def generate(self, prompt: str, schema: BaseModel = None) -> BaseModel | str:
    model = self.load_model()
    
    # HF pipeline for inference of text generation
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=self.tokenizer,
        use_cache=True,
        device_map="auto",
        max_new_tokens=256,
        num_return_sequences=1,
    )
    if schema is not None:
        # Create parser required for JSON confinement using       lmformatenforcer
        parser = JsonSchemaParser(schema.model_json_schema())
        prefix_function = build_transformers_prefix_allowed_tokens_fn(
            pipeline.tokenizer, parser
        )

        # Output and load valid JSON
        output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
        output = output_dict[0]["generated_text"][len(prompt) :]
        json_result = json.loads(output)

        # Return valid JSON object according to the schema DeepEval supplied
        return schema(**json_result)
    return pipeline(prompt)

def get_model_name(self):
    return "Atla Selene Mini"
custom_selene = CustomModel(model=selene_model, tokenizer=selene_tokenizer)
3

Start running evals with DeepEval

Set model = custom_selene as you initialize the evaluation metric, and Selene will be used to run the evaluation.

Here is a RAG Hallucination example you can try:

Deepeval’s default metrics output a score between 0-1. The metric is successful if the evaluation score is equal to or greater than the instantiated threshold.

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Replace this with the documents that you are passing as input to your LLM
context=["A man with blond-hair, and a brown shirt drinking out of a public water fountain."]

# Replace this with the actual output from your LLM application
actual_output="A blond drinking water in public."

test_case = LLMTestCase(
    input="What was the blond doing?",
    actual_output=actual_output,
    context=context
)
metric = HallucinationMetric(
    threshold=0.5,
    model=custom_selene
    )

metric.measure(test_case)
print(metric.score)
print(metric.reason)

You can find more examples to try out in our cookbook!