Evaluation Guide • 13 min read

LLM Evaluation Metrics: How to Measure Output Quality

Q: What are the best LLM evaluation metrics?

The best metrics depend on your task: for text generation use ROUGE-L and BERTScore, for classification use accuracy and F1, for code generation use pass@k, for conversational AI use LLM-as-judge with rubric scoring, and for safety use toxicity classifiers and PII detection. No single metric is sufficient — use 3-5 metrics per evaluation to capture different quality dimensions.

Q: What is the difference between BLEU, ROUGE, and BERTScore?

BLEU measures n-gram precision (how much of the output matches the reference), ROUGE measures n-gram recall (how much of the reference appears in the output), and BERTScore uses contextual embeddings to measure semantic similarity regardless of exact wording. BLEU penalises for brevity, ROUGE penalises for missing information, and BERTScore captures paraphrasing. For most LLM evaluation, BERTScore + ROUGE-L is the strongest combination.

Q: When should I use LLM-as-judge?

Use LLM-as-judge when evaluating subjective qualities that automated metrics miss: helpfulness, coherence, conversational fluency, tone appropriateness, and creative quality. Provide a detailed rubric (1-5 scale with definitions for each score) and use a stronger model as the judge (e.g., GPT-4 or Claude 3.5 judging outputs from smaller models). Always calibrate against human judgements on a subset before trusting at scale.

Q: How many evaluation examples do I need?

For a production evaluation suite: minimum 50 test cases for a basic confidence level, 100+ for robust evaluation, 200+ for regulated or high-stakes applications. Distribute cases across: normal inputs (60%), edge cases (25%), and adversarial inputs (15%). Quality of test cases matters more than quantity — 50 well-designed cases beat 500 random ones. Update your test suite monthly to capture new failure modes.

Quick Answer

LLM evaluation metrics fall into five categories: lexical (BLEU, ROUGE — word overlap), semantic (BERTScore — meaning similarity), LLM-as-judge (rubric scoring for subjective quality), task-specific (accuracy, F1, pass@k), and safety (toxicity, PII detection). Use 3-5 metrics per evaluation, build a test suite of 50-100 cases, and integrate into your CI/CD pipeline.

Metric categories

Individual metrics covered

50+

Min test cases for production

Metric Categories

📝

Lexical Metrics

Measure surface-level text overlap between generated output and reference answers. Fast, deterministic, and easy to automate — but miss paraphrasing and semantic equivalence.

BLEU

0-1 (0.3+ good for generation)

N-gram precision — how much of the generated text matches the reference.

✓ Fast, well-understood, good for translation.

△ Penalises valid paraphrases, no semantic understanding.

ROUGE-L

0-1 (0.4+ good for summaries)

Longest common subsequence recall — how much of the reference appears in the output.

✓ Captures sentence-level structure, good for summarisation.

△ Still lexical — two semantically identical sentences may score low.

METEOR

0-1 (0.35+ good)

Combines precision, recall, stemming, and synonym matching.

✓ Better than BLEU for paraphrases, handles synonyms.

△ Slower than BLEU/ROUGE, language-dependent.

🧠

Semantic Metrics

Use neural embeddings to measure meaning similarity rather than word overlap. Capture paraphrasing, synonym usage, and semantic equivalence that lexical metrics miss.

BERTScore

F1: 0-1 (0.85+ good)

Contextual embedding similarity using BERT — measures semantic overlap at the token level.

✓ Captures paraphrasing, state-of-the-art correlation with human judgement.

△ Model-dependent, slower than lexical metrics.

Embedding Cosine

0-1 (0.8+ good)

Cosine similarity between sentence-level embedding vectors.

✓ Very fast, good for ranking and retrieval evaluation.

△ Loses fine-grained detail at sentence level.

⚖️

LLM-as-Judge

Use a strong LLM to evaluate outputs against a rubric. Best for subjective qualities — helpfulness, coherence, tone, creative quality — that automated metrics cannot capture.

Rubric Scoring

1-5 scale (4.0+ good)

Score outputs 1-5 against defined criteria (e.g., "5 = fully addresses all aspects, well-structured, actionable").

✓ Captures nuance, customisable per use case, scales well.

△ Requires calibration against human judgement, judge bias.

Pairwise Comparison

Win rate (>60% = meaningful)

The judge compares two outputs and selects the better one.

✓ Easier than absolute scoring, reduces bias.

△ Doesn't give absolute quality measure, position bias.

🎯

Task-Specific Metrics

Purpose-built metrics for specific task types. These are the most directly meaningful metrics because they measure exactly what matters for your use case.

Accuracy / F1

F1: 0-1 (0.9+ good)

Classification correctness (accuracy) and harmonic mean of precision/recall (F1).

✓ Direct measure of task success, universally understood.

△ Only for classification/extraction tasks.

Pass@k

Pass@1: 0-1 (0.7+ good)

Probability that at least one of k generated code samples passes all test cases.

✓ Direct measure of code generation quality.

△ Requires executable test suites, expensive to run.

Exact Match

0-100% (context-dependent)

Binary — did the output exactly match the expected answer?

✓ Unambiguous, perfect for structured outputs.

△ Too strict for free-text, misses valid alternatives.

🛡️

Safety Metrics

Detect harmful, biased, toxic, or leaked content in model outputs. Non-negotiable for production deployment — a brilliant output that contains PII or toxicity is a failure.

Toxicity Score

< 0.05 (production)

Probability of toxic content (Perspective API, Detoxify).

✓ Well-calibrated, fast, catches most harmful content.

△ May miss subtle bias, context-dependent toxicity.

PII Detection

0 PII in output (hard req)

Identifies personally identifiable information in outputs (regex + NER).

✓ Critical for compliance (GDPR, HIPAA), automatable.

△ False positives with NER, context matters.

Decision Guide: Which Metric When?

Summarisation → ROUGE-L + BERTScore + LLM-as-judge
Classification → Accuracy + F1 + Confusion matrix
Code generation → Pass@k + Exact Match + BERTScore
Conversational AI → LLM-as-judge (rubric) + Toxicity + User satisfaction
Data extraction → Exact Match + F1 + Schema compliance
Content generation → BERTScore + LLM-as-judge + ROUGE-L
Any production use → Add safety metrics (toxicity + PII) to the above

Building Your Evaluation Suite

Test Cases

50-100+

60% normal inputs
25% edge cases
15% adversarial inputs
Gold-standard expected outputs

Metrics

3-5 per eval

1 lexical or semantic
1 task-specific
1 LLM-as-judge
1 safety metric

Pipeline

Every deploy

Run in CI/CD
Gate on thresholds
Track trends over time
Alert on regression

📌 Key Takeaways

No single metric captures quality — use 3-5 metrics across different categories.
BERTScore + ROUGE-L is the strongest general-purpose combination for text generation.
LLM-as-judge fills the gap for subjective quality — but calibrate against human judgement first.
See LLM Output Quality for what to measure, Structured Output Prompting for format reliability, and Prompt Testing & Evaluation for testing workflows.

Frequently Asked Questions

What are the best LLM evaluation metrics?

The best metrics depend on your task: for text generation use ROUGE-L and BERTScore, for classification use accuracy and F1, for code generation use pass@k, for conversational AI use LLM-as-judge with rubric scoring, and for safety use toxicity classifiers and PII detection. No single metric is sufficient — use 3-5 metrics per evaluation to capture different quality dimensions.

What is the difference between BLEU, ROUGE, and BERTScore?

BLEU measures n-gram precision (how much of the output matches the reference), ROUGE measures n-gram recall (how much of the reference appears in the output), and BERTScore uses contextual embeddings to measure semantic similarity regardless of exact wording. BLEU penalises for brevity, ROUGE penalises for missing information, and BERTScore captures paraphrasing. For most LLM evaluation, BERTScore + ROUGE-L is the strongest combination.

When should I use LLM-as-judge?

Use LLM-as-judge when evaluating subjective qualities that automated metrics miss: helpfulness, coherence, conversational fluency, tone appropriateness, and creative quality. Provide a detailed rubric (1-5 scale with definitions for each score) and use a stronger model as the judge (e.g., GPT-4 or Claude 3.5 judging outputs from smaller models). Always calibrate against human judgements on a subset before trusting at scale.

How many evaluation examples do I need?

For a production evaluation suite: minimum 50 test cases for a basic confidence level, 100+ for robust evaluation, 200+ for regulated or high-stakes applications. Distribute cases across: normal inputs (60%), edge cases (25%), and adversarial inputs (15%). Quality of test cases matters more than quantity — 50 well-designed cases beat 500 random ones. Update your test suite monthly to capture new failure modes.

Build Evaluation-Ready Prompts

AI Prompt Architect generates prompts with built-in output format constraints, making automated evaluation reliable from day one.

Start Measuring Quality →

LLM Evaluation: The Evidence

Every claim below is sourced from peer-reviewed research and industry reports.Browse all 141 citations →

JSON Schema enforcement eliminates parse errors.

OpenAI structured outputs with JSON Schema achieve 99.9% schema adherence vs <70% with unconstrained generation — a 30x reduction in parse failures.

Without schema enforcement, every 1M requests generate 300K+ malformed responses requiring retries, error handling, and downstream data corruption.

OpenAI, 'Structured Outputs: JSON Schema' documentation, 2024

Fallback model chains prevent downstream failures.

Claude OPUS → GPT-4o → Gemini 1.5 Pro fallback chain achieves 99.995% uptime for critical inference paths, with <500ms failover latency.

Without provider fallback, one API outage takes down the entire product. Teams only discover this when pager duty wakes them at 3am.

Portkey AI, 'AI Gateway: Fallback' documentation, 2024

Chain-of-thought prompting improves complex reasoning accuracy.

Adding 'Let's think step by step' improves accuracy on GSM8K math benchmarks from 17.7% to 78.7% — a 4.4x improvement on multi-step reasoning tasks.

Without chain-of-thought, models attempt to produce answers in a single leap, failing on problems requiring intermediate steps.

Wei et al., 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models', Google Research, 2022