Dive straight into the code and start running evals instantly.

Choosing a Model (Guide)

This cookbook presents a structured way to approach picking the right model for your use case.

We take Chat as an example use case, where we build a playful and helpful assistant that is cost-effective. We evaluate the performance of two popular models against criteria we are interested in - clarity, objectivity and tone.

We demonstrate how Selene can be used to guide the decision.

Improving your Prompts (Guide)

This cookbook presents a structured way to improve your prompts to get the best out of your foundation model for your use case.

We take Chat as an example use case and demonstrate how Selene can be used to guide the decision.

Implementing Guardrails (Guide)

This cookbook demonstrates how to implement inference-time guardrails to validate and filter your AI outputs. We evaluate GPT-4o outputs against example safety dimensions (toxicity, bias, and medical advice) to replace problematic outputs before they are delivered to users.

We use Selene Mini, our state-of-the-art small-LLM-as-a-Judge that excels in low latency use cases.

Absolute Scoring (Tutorial)

This cookbook gets you started running evals with absolute scores, and does so on a sample set from the public benchmark FLASK dataset - a collection of 1,740 human-annotated samples from 120 NLP datasets. Evaluators assign scores ranging from 1 to 5 for each annotated skill based on the reference (ground-truth) answer and skill-specific scoring rubrics.

We evaluate logical robustness (whether the model avoids logical contradictions in its reasoning) and completeness (whether the response provides sufficient explanation) using default and custom-defined metrics respectively, then compare how Selene’s scores align with the human labels.

Hallucination Scoring (Tutorial)

This cookbook gets you started detecting hallucinations, and runs over a sample set from the public benchmark RAGTruth benchmark - a large-scale corpus of naturally generated hallucinations, featuring detailed word-level annotations specifically designed for retrieval-augmented generation (RAG) scenarios.

We check for hallucination in AI responses i.e. ‘Is the information provided in the response directly supported by the context given in the related passages?’ and compare how Selene’s scores align with the human labels.

Multi-Criteria Evals (Tutorial)

This cookbook gets you started on running multi-criteria evals with Selene, to help you get a comprehensive picture of your model’s performance. We follow eval best practices by evaluating each criterion as an individual metric to receive clearer insights and more reliable scores.

The first section will show you how to run multi-criteria evals on one/many datapoints across 3 criteria using our async client. The second section will showcase how our model performs on multi-criteria evals, across 12 criteria on the public FLASK dataset.

Atla on Langfuse

You can use Selene as an LLM Judge in Langfuse to monitor your app’s performance in production using traces, as well as to run experiments over datasets pre-production. We provide demo videos and cookbooks for both use cases. Click here to go to our Langfuse cookbooks.

Missing anything?

Get in touch with us if there’s another use case you’d like to see a cookbook for!