Tutorials
End-to-end tutorials using Selene with the Atla API
Dive straight into the code and start running evals instantly.
Absolute Scoring
This cookbook gets you started running evals with absolute scores, and does so on a sample set from the public benchmark FLASK dataset - a collection of 1,740 human-annotated samples from 120 NLP datasets. Evaluators assign scores ranging from 1 to 5 for each annotated skill based on the reference (ground-truth) answer and skill-specific scoring rubrics.
We evaluate logical robustness (whether the model avoids logical contradictions in its reasoning) and completeness (whether the response provides sufficient explanation) using default and custom-defined metrics respectively, then compare how Selene’s scores align with the human labels.
Hallucination Scoring
This cookbook gets you started detecting hallucinations, and runs over a sample set from the public benchmark RAGTruth benchmark - a large-scale corpus of naturally generated hallucinations, featuring detailed word-level annotations specifically designed for retrieval-augmented generation (RAG) scenarios.
We check for hallucination in AI responses i.e. ‘Is the information provided in the response directly supported by the context given in the related passages?’ and compare how Selene’s scores align with the human labels.
Missing anything?
Get in touch with us if there’s another use case you’d like to see a cookbook for!