Evaluation Guide • 13 min read
LLM Evaluation Metrics: How to Measure Output Quality
LLM evaluation metrics fall into five categories: lexical (BLEU, ROUGE — word overlap), semantic (BERTScore — meaning similarity), LLM-as-judge (rubric scoring for subjective quality), task-specific (accuracy, F1, pass@k), and safety (toxicity, PII detection). Use 3-5 metrics per evaluation, build a test suite of 50-100 cases, and integrate into your CI/CD pipeline.
Metric Categories
Decision Guide: Which Metric When?
- Summarisation → ROUGE-L + BERTScore + LLM-as-judge
- Classification → Accuracy + F1 + Confusion matrix
- Code generation → Pass@k + Exact Match + BERTScore
- Conversational AI → LLM-as-judge (rubric) + Toxicity + User satisfaction
- Data extraction → Exact Match + F1 + Schema compliance
- Content generation → BERTScore + LLM-as-judge + ROUGE-L
- Any production use → Add safety metrics (toxicity + PII) to the above
Building Your Evaluation Suite
- 60% normal inputs
- 25% edge cases
- 15% adversarial inputs
- Gold-standard expected outputs
- 1 lexical or semantic
- 1 task-specific
- 1 LLM-as-judge
- 1 safety metric
- Run in CI/CD
- Gate on thresholds
- Track trends over time
- Alert on regression
📌 Key Takeaways
- No single metric captures quality — use 3-5 metrics across different categories.
- BERTScore + ROUGE-L is the strongest general-purpose combination for text generation.
- LLM-as-judge fills the gap for subjective quality — but calibrate against human judgement first.
- See LLM Output Quality for what to measure, Structured Output Prompting for format reliability, and Prompt Testing & Evaluation for testing workflows.
Frequently Asked Questions
What are the best LLM evaluation metrics?
The best metrics depend on your task: for text generation use ROUGE-L and BERTScore, for classification use accuracy and F1, for code generation use pass@k, for conversational AI use LLM-as-judge with rubric scoring, and for safety use toxicity classifiers and PII detection. No single metric is sufficient — use 3-5 metrics per evaluation to capture different quality dimensions.
What is the difference between BLEU, ROUGE, and BERTScore?
BLEU measures n-gram precision (how much of the output matches the reference), ROUGE measures n-gram recall (how much of the reference appears in the output), and BERTScore uses contextual embeddings to measure semantic similarity regardless of exact wording. BLEU penalises for brevity, ROUGE penalises for missing information, and BERTScore captures paraphrasing. For most LLM evaluation, BERTScore + ROUGE-L is the strongest combination.
When should I use LLM-as-judge?
Use LLM-as-judge when evaluating subjective qualities that automated metrics miss: helpfulness, coherence, conversational fluency, tone appropriateness, and creative quality. Provide a detailed rubric (1-5 scale with definitions for each score) and use a stronger model as the judge (e.g., GPT-4 or Claude 3.5 judging outputs from smaller models). Always calibrate against human judgements on a subset before trusting at scale.
How many evaluation examples do I need?
For a production evaluation suite: minimum 50 test cases for a basic confidence level, 100+ for robust evaluation, 200+ for regulated or high-stakes applications. Distribute cases across: normal inputs (60%), edge cases (25%), and adversarial inputs (15%). Quality of test cases matters more than quantity — 50 well-designed cases beat 500 random ones. Update your test suite monthly to capture new failure modes.
Build Evaluation-Ready Prompts
AI Prompt Architect generates prompts with built-in output format constraints, making automated evaluation reliable from day one.
Start Measuring Quality →LLM Evaluation: The Evidence
Every claim below is sourced from peer-reviewed research and industry reports.Browse all 141 citations →
JSON Schema enforcement eliminates parse errors.
OpenAI structured outputs with JSON Schema achieve 99.9% schema adherence vs <70% with unconstrained generation — a 30x reduction in parse failures.
Without schema enforcement, every 1M requests generate 300K+ malformed responses requiring retries, error handling, and downstream data corruption.
OpenAI, 'Structured Outputs: JSON Schema' documentation, 2024Fallback model chains prevent downstream failures.
Claude OPUS → GPT-4o → Gemini 1.5 Pro fallback chain achieves 99.995% uptime for critical inference paths, with <500ms failover latency.
Without provider fallback, one API outage takes down the entire product. Teams only discover this when pager duty wakes them at 3am.
Portkey AI, 'AI Gateway: Fallback' documentation, 2024Chain-of-thought prompting improves complex reasoning accuracy.
Adding 'Let's think step by step' improves accuracy on GSM8K math benchmarks from 17.7% to 78.7% — a 4.4x improvement on multi-step reasoning tasks.
Without chain-of-thought, models attempt to produce answers in a single leap, failing on problems requiring intermediate steps.
Wei et al., 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models', Google Research, 2022