DeepEval
A guide on how to run evaluations on DeepEval using self-deployed Selene Mini as you develop / monitor your AI application.
About DeepEval
DeepEval is an open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs.
About Atla
Atla trains frontier models to test & evaluate GenAI applications. Our models help developers run fast and accurate evaluations, so they can ship quickly and with confidence.
Using DeepEval with Selene Mini
DeepEval supports the use of custom models from Hugging Face, as well as locally hosted models. Use DeepEval’s plug-and-use metrics / prompts with Selene Mini to quickly start evaluating the quality of your GenAI application.
Below you will find quick start guides to do so with:
-
Selene Mini through Ollama
-
Selene Mini through Hugging Face Transformers
Use Selene Mini Locally with Ollama
Download Selene Mini on Ollama
Start by pulling the Selene Mini model. From the terminal run the following:
If this has correctly ran, you should see atla/selene-mini:latest
after you run the following:
Configure Selene Mini on DeepEval
Install DeepEval if you haven’t already:
Run the following command to configure Selene Mini as the LLM for all LLM-based metrics.
Where:
-
The model name is the
NAME
that appears after executingollama list
-
The base URL is the
cURL
specified in Ollama documentation
Run your DeepEval scripts as usual
Within the same terminal, run your code as usual. The model being used to calculate evals will default to Selene Mini.
Here is some example code you can try:
Load Selene Mini using Hugging Face
Follow the steps below or jump straight into the code in our cookbook.
Load Selene from HF
You will require the following packages:
Load the model and tokenizer:
We load the model in 4-bit quantized format to make it easy to run in a Colab notebook. Remove the quantization config to run the full precision model (requires 32GB VRAM).
Create CustomModel class for DeepEval
-
Inherit the class
DeepEvalBaseLLM
-
Implement
load_model()
, which will be responsible for returning a model object. -
Implement
generate()
with parameter of type string that acts as the prompt to your custom LLM. This function returns the generated string output from Selene. -
Implement
get_model_name()
, which simply returns a string representing our custom model name.
Start running evals with DeepEval
Set model = custom_selene
as you initialize the evaluation metric, and Selene will be used to run the evaluation.
Here is a RAG Hallucination example you can try:
Deepeval’s default metrics output a score between 0-1. The metric is successful if the evaluation score is equal to or greater than the instantiated threshold.
You can find more examples to try out in our cookbook!