Model card


You can find our Hugging Face model card here.

Quickstart:


from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model_id = "AtlaAI/Selene-1-Mini-Llama-3.1-8B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16, 
)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=bnb_config # remove to load in FP16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "I heard you can evaluate my responses?" # replace with your eval prompt
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

API reference

Prompt templates


To achieve best results, we provide the prompts we used for training here.

Cookbooks


Try our cookbooks to start running two popular use cases straight away.

Absolute scoring

This example gets you started running evals with absolute scores, and does so on a sample set from the public benchmark FLASK dataset - a collection of 1,740 human-annotated samples from 120 NLP datasets. Evaluators assign scores ranging from 1 to 5 for each annotated skill based on the reference (ground-truth) answer and skill-specific scoring rubrics.

Here, we evaluate the completeness of AI responses i.e. ‘Does the response provide a sufficient explanation?‘

RAG hallucination

This example gets you started detecting hallucinations, and runs over a sample set from the public benchmark RAGTruth benchmark - a large-scale corpus of naturally generated hallucinations, featuring detailed word-level annotations specifically designed for retrieval-augmented generation (RAG) scenarios.

Here, we check for the hallucination of AI responses i.e. ‘Is the information provided in the response directly supported by the context given in the related passages?’