Tutorials
End-to-end tutorials using Selene with the Atla API
Dive straight into the code and start running evals instantly.
Absolute Scoring
This cookbook gets you started running evals with absolute scores, and does so on a sample set from the public benchmark FLASK dataset - a collection of 1,740 human-annotated samples from 120 NLP datasets. Evaluators assign scores ranging from 1 to 5 for each annotated skill based on the reference (ground-truth) answer and skill-specific scoring rubrics.
We evaluate logical robustness (whether the model avoids logical contradictions in its reasoning) and completeness (whether the response provides sufficient explanation) using default and custom-defined metrics respectively, then compare how Selene’s scores align with the human labels.
Hallucination Scoring
This cookbook gets you started detecting hallucinations, and runs over a sample set from the public benchmark RAGTruth benchmark - a large-scale corpus of naturally generated hallucinations, featuring detailed word-level annotations specifically designed for retrieval-augmented generation (RAG) scenarios.
We check for hallucination in AI responses i.e. ‘Is the information provided in the response directly supported by the context given in the related passages?’ and compare how Selene’s scores align with the human labels.
Multi-Criteria Evals
This cookbook gets you started on running multi-criteria evals with Selene, to help you get a comprehensive picture of your model’s performance. We follow eval best practices by evaluating each criterion as an individual metric to receive clearer insights and more reliable scores.
The first section will show you how to run multi-criteria evals on one/many datapoints across 3 criteria using our async client. The second section will showcase how our model performs on multi-criteria evals, across 12 criteria on the public FLASK dataset.
Atla on Langfuse
You can use Selene as an LLM Judge in Langfuse to monitor your app’s performance in production using traces, as well as to run experiments over datasets pre-production. We provide demo videos and cookbooks for both use cases. Click here to go to our Langfuse cookbooks.
Missing anything?
Get in touch with us if there’s another use case you’d like to see a cookbook for!