Evaluating AI Systems

Evaluation is the hardest part of building AI systems. Without it, you cannot tell whether a change to your prompt, retrieval, or model improved anything.

Why evaluation is hard

Unlike traditional software, AI systems don’t have deterministic outputs. A change that improves one output may degrade another. Evaluation must:

Cover a representative sample of the input distribution
Measure what actually matters to users (not just perplexity or BLEU)
Be fast enough to run on every change

You can’t improve what you don’t measure. Build your eval harness before you start tuning.

Offline vs Online Evaluation

	Offline	Online
When	Before deployment	After deployment
Data source	Curated test set	Real user traffic
Speed	Fast	Slow (requires traffic)
Risk	None	Affects real users
Signal	Leading indicator	Ground truth

Use offline evaluation to iterate quickly. Use online evaluation (A/B tests, shadow mode) to confirm offline gains hold in production.

Evaluation methods

Reference-based

Compare model output to a gold-standard answer. Useful when there is a clear correct answer.

def exact_match(prediction: str, reference: str) -> float:
    return float(prediction.strip() == reference.strip())

def f1_token_overlap(prediction: str, reference: str) -> float:
    pred_tokens = set(prediction.lower().split())
    ref_tokens = set(reference.lower().split())
    if not pred_tokens or not ref_tokens:
        return 0.0
    common = pred_tokens & ref_tokens
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

LLM-as-judge

Use a stronger model to score the output of a weaker model. Scales better than human evaluation and is more nuanced than reference-based metrics.

Judge prompt:
Rate the following answer on a scale of 1–5 for:
- Accuracy (does it correctly answer the question?)
- Completeness (does it cover all relevant points?)
- Conciseness (is it appropriately brief?)

Question: {question}
Reference answer: {reference}
Model answer: {prediction}

Return JSON: {"accuracy": N, "completeness": N, "conciseness": N, "reasoning": "..."}

LLM judges have known biases: preference for longer answers, position bias, self-preference. Mitigate by averaging across multiple judge runs and using calibration examples.

Task-specific metrics

Task	Metrics
RAG / QA	Answer correctness, context recall, context precision
Summarisation	ROUGE, BERTScore, factual consistency
Code generation	Execution accuracy, test pass rate
Classification	Accuracy, F1, confusion matrix
Chatbot	User satisfaction, session length, deflection rate

Building a test set

A good test set is:

Representative — covers the real distribution of user queries, including edge cases
Challenging — easy examples don’t differentiate models
Stable — doesn’t change between runs (use a fixed seed for sampling)
Labelled — each example has a ground-truth answer or criteria

Start with 50–100 examples. This is enough to detect regressions on most tasks. Scale up when you need finer discrimination between similar models.

Adversarial examples

Include cases that are known to fail or are particularly hard:

Queries outside the knowledge base (expected: “I don’t know”)
Ambiguous questions with multiple valid answers
Questions requiring multi-hop reasoning
Prompt injection attempts