How to Test AI Prompts: A/B Testing and Evaluation Frameworks
You wouldn't deploy code without tests. Why are you deploying prompts without them? Prompt testing is the missing discipline in most AI engineering teams. This guide covers how to build evaluation frameworks that catch regressions, measure improvements, and give you confidence that your prompts work.
Why Prompt Testing Is Hard
Traditional software testing has a clear oracle: given input X, the expected output is Y. Prompt testing doesn't have this luxury. LLM outputs are nondeterministic — the same prompt can produce different outputs on different runs. This means you can't write simple assertEquals tests. You need evaluation criteria rather than exact expectations.
The Three Levels of Prompt Testing
Level 1: Format Compliance
The lowest bar — does the output match the expected structure? These tests are deterministic and automatable:
- Schema validation: If you expect JSON, parse it. If it doesn't parse, the prompt failed.
- Field presence: Every required field in your schema must be present.
- Type checking: If
confidenceshould be "HIGH", "MEDIUM", or "LOW", validate that the value is one of those strings. - Length constraints: If your summary should be under 200 words, count words.
// Automated format compliance test
function testFormatCompliance(output: string): boolean {
try {
const parsed = JSON.parse(output);
assert(typeof parsed.answer === 'string');
assert(['HIGH', 'MEDIUM', 'LOW'].includes(parsed.confidence));
assert(Array.isArray(parsed.sources));
return true;
} catch {
return false;
}
}
Run this against 100+ outputs from your prompt. If format compliance is below 95%, your prompt needs work.
Level 2: Content Quality (Automated)
Use a judge LLM to evaluate output quality. This sounds circular, but it works when the evaluation criteria are well-defined:
- Relevance: Does the output address the question asked?
- Accuracy: Are factual claims correct? (Requires ground-truth data)
- Completeness: Are all aspects of the question addressed?
- Tone: Does the output match the specified persona/tone?
Build an evaluation prompt that scores outputs on each criterion using a 1-5 scale. Run it against a fixed set of 20-50 test cases and track scores over time.
Level 3: Human Evaluation
For high-stakes applications, there's no substitute for human review. But make it structured:
- Build a rating interface (even a simple spreadsheet works)
- Define specific criteria with examples of each score level
- Use multiple raters and measure inter-rater agreement
- Sample strategically — don't review random outputs, review edge cases and failures
A/B Testing Prompts
The gold standard for prompt improvement is controlled A/B testing. Here's the methodology:
Step 1: Define Your Metric
Choose one primary metric. For most applications, this is one of:
- Task completion rate: Did the output achieve the user's goal?
- User satisfaction: Thumbs up/down on the response
- Downstream action: Did the user accept the suggestion, click the link, complete the flow?
Step 2: Set Up the Experiment
- Route 50% of traffic to Prompt A (control) and 50% to Prompt B (variant)
- Ensure consistent routing — the same user should see the same variant throughout their session
- Run for a minimum of 1000 requests per variant to achieve statistical significance
Step 3: Analyse with Caution
LLM outputs have high variance. A 2% difference in satisfaction scores is likely noise. Look for differences of 5%+ and validate with confidence intervals. If your metric is binary (pass/fail), use a chi-squared test. If continuous, use a t-test with Welch's correction for unequal variances.
Building a Regression Suite
Every time you fix a prompt bug, add a test case. Over time, you build a comprehensive regression suite that catches future regressions. Structure it as:
// Test case format
interface PromptTestCase {
id: string;
input: string;
expectedBehaviour: string; // Natural language description
format: 'json' | 'markdown' | 'text';
requiredFields?: string[];
bannedPhrases?: string[]; // Things the output should NEVER contain
addedAfterBug?: string; // Reference to the bug that prompted this test
}
Run your regression suite in CI. Flag any test case where format compliance drops below 90% across 10 runs (to account for nondeterminism).
Continuous Improvement Loops
The final piece is closing the feedback loop:
- Monitor: Track format compliance, judge LLM scores, and user feedback in production.
- Triage: When scores drop, identify the failing test cases and input patterns.
- Iterate: Modify the prompt to address the failure mode.
- Test: Run the modified prompt against your regression suite + a targeted A/B test.
- Deploy: If the new prompt passes regression and wins the A/B test, promote it to production.
How AI Prompt Architect Helps
AI Prompt Architect's Analyse workflow implements Level 2 automated evaluation. It scores your prompts against structured criteria and identifies specific weaknesses. The Refine workflow then generates targeted improvements based on that analysis. This creates a built-in improvement loop: Generate → Analyse → Refine → Analyse again — closing the gap between "good enough" and production-grade.
