Skip to Main Content

Cost Engineering • 13 min read

Prompt Optimization: How to Cut Costs 60% Without Losing Quality

Quick Answer

Prompt optimization reduces AI costs by 60-80% without quality loss through three levers: token compression (shorter prompts, fewer examples), prompt caching (50-90% discount on repeated prefixes), and tiered model routing (cheap models for simple tasks, frontier models for reasoning). Always A/B test optimized prompts against your baseline to prevent quality regression.

60-80%
Cost reduction achievable with systematic optimization
90%
Discount on cached prompt tokens (Anthropic)
45%
Savings from tiered model routing alone

The 3 Dimensions of Prompt Optimization

Every prompt optimization targets one or more of these dimensions. The best optimizations improve all three simultaneously:

📝 Token Efficiency

Reduce input and output token count. Fewer tokens = lower cost + faster responses. The most impactful optimization: a 50% token reduction directly halves your API spend.

⚡ Latency Reduction

Faster time-to-first-token and total response time. Critical for user-facing applications where every 100ms affects engagement. Achieve through caching, smaller models, and parallel execution.

💰 Cost Control

Lower the dollar cost per API call. Combines token efficiency with model routing — use the cheapest model that meets quality requirements for each task type.

7 Token Optimization Techniques

1

Remove Verbal Padding

Saves 10-30%

LLMs don't need politeness or verbose preambles. "Please analyze the following data carefully and provide a comprehensive summary" → "Summarize this data." Saves 10-30% tokens with zero quality loss.

2

Compress Few-Shot Examples

Saves 40-60%

Use 2-3 examples instead of 10. Format examples as concise input/output pairs, not full conversations. One high-quality example outperforms five mediocre ones.

3

Use Structured Formats

Saves 20-40%

YAML and JSON instructions are more token-efficient than prose. "Given a product, return: name, category, price" in YAML uses 40% fewer tokens than the equivalent paragraph.

4

Reference Instead of Repeat

Saves 15-25%

"Use the schema defined above" instead of repeating the full schema in every section of a long prompt. Especially impactful in multi-section system prompts.

5

Trim Context Windows

Saves 50-80%

Don't stuff the entire document into context — extract only the relevant sections. Use RAG to retrieve the 3-5 most relevant chunks instead of passing 50 pages. Quality often improves because the model focuses better.

6

Constrain Output Length

Saves 30-50%

"Maximum 3 sentences" or "respond in under 50 words." Without length constraints, models default to verbose responses. Output tokens are typically 2-4× more expensive than input tokens.

7

Use System Prompt Efficiently

Saves 20-40%

Move stable instructions to the system prompt (cacheable) and keep the user prompt minimal. The system prompt is processed once; the user prompt changes per request.

Prompt Caching: The Biggest Cost Lever

Prompt caching stores the computed KV-cache of your prompt prefix so repeated requests skip re-processing. This is the single highest-impact optimization for production systems with stable system prompts:

ProviderCache DiscountMin Prefix LengthTTLAuto-enabled?
Anthropic (Claude)90% off1,024 tokens5 min (extendable)Yes
OpenAI (GPT-4o)50% off1,024 tokens5-10 minYes
Google (Gemini)75% off32,768 tokensConfigurableManual

💡 Pro Tip

Structure your prompts so the system prompt + stable instructions come first (cacheable prefix), followed by the variable user input. Keep the prefix stable across requests — even changing one token invalidates the cache.

Tiered Model Routing: Right Model, Right Task

Not every task needs a $15/MTok frontier model. Route tasks to the cheapest model that meets quality requirements. Most production systems find 70% of tasks run perfectly on cheap models:

Tier 1 — Fast & Cheap

$0.15-0.25/MTok

GPT-4o-mini / Haiku

Classification, extraction, formatting, simple Q&A, summarization

Tier 2 — Balanced

$2.50-3.00/MTok

GPT-4o / Sonnet

Analysis, content generation, code review, multi-step reasoning

Tier 3 — Maximum Quality

$15-60/MTok

Claude Opus / o1-pro

Complex reasoning, novel problem-solving, safety-critical decisions

The Optimization Workflow

Follow this systematic process to optimize any production prompt. Always measure before and after:

# prompt-optimization-workflow.py
import tiktoken

def measure_baseline(prompt: str, eval_set: list) -> dict:
    """Step 1: Measure current performance."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    token_count = len(enc.encode(prompt))
    
    results = run_eval(prompt, eval_set)
    return {
        "tokens": token_count,
        "cost_per_call": token_count * 2.50 / 1_000_000,
        "quality_score": results["accuracy"],
        "latency_p50": results["latency_p50"]
    }

def optimize_tokens(prompt: str) -> str:
    """Step 2: Apply token compression."""
    # Remove verbal padding
    prompt = remove_padding(prompt)
    # Compress few-shot examples  
    prompt = compress_examples(prompt, max_examples=3)
    # Convert prose → structured format
    prompt = convert_to_yaml(prompt)
    return prompt

def ab_test(original: str, optimized: str, eval_set: list):
    """Step 3: A/B test — never deploy without this."""
    baseline = measure_baseline(original, eval_set)
    candidate = measure_baseline(optimized, eval_set)
    
    quality_delta = candidate["quality_score"] - baseline["quality_score"]
    cost_delta = (candidate["cost_per_call"] - baseline["cost_per_call"]) 
                 / baseline["cost_per_call"]
    
    if quality_delta >= -0.02:  # Allow max 2% quality drop
        print(f"✅ Deploy: {cost_delta:.0%} cost, {quality_delta:+.1%} quality")
    else:
        print(f"❌ Reject: quality dropped {quality_delta:.1%}")

Optimization Pitfalls to Avoid

✂️

Over-Compression

Removing critical context to save tokens. Always A/B test — if quality drops >2%, the optimization isn't worth it.

🔀

Wrong Model Routing

Sending complex reasoning tasks to cheap models. Start conservative (Tier 2 default), then selectively downgrade tasks that score well on cheap models.

📤

Ignoring Output Tokens

Optimizing input but allowing unbounded output. Output tokens cost 2-4× more. Always set max_tokens and add length constraints.

💥

Cache Invalidation

Changing one token in the prompt prefix invalidates the entire cache. Keep your system prompt frozen — move variable content to the user message.

📌 Key Takeaways

  • Optimize across three dimensions: token count, latency, and cost.
  • Prompt caching gives 50-90% discount — structure prompts with a stable prefix.
  • Route 70% of tasks to cheap models (Tier 1) for 45% cost reduction.
  • Always A/B test optimizations — never deploy without measuring quality impact.
  • Constrain output tokens — they cost 2-4× more than input tokens.
  • Combine with production-ready prompt patterns for the full lifecycle.

Frequently Asked Questions

What is prompt optimization?

Prompt optimization is the systematic process of reducing token count, latency, and API cost while maintaining or improving output quality. It covers three dimensions: token efficiency (shorter prompts that produce equal results), latency reduction (faster responses through caching and model selection), and cost control (routing tasks to the cheapest capable model). A well-optimized prompt can cost 60-80% less than a naive one with identical quality.

How do I reduce prompt tokens without losing quality?

Four techniques: (1) Remove redundant instructions — LLMs don't need "please" or verbose preambles, (2) Compress few-shot examples — use 2-3 examples instead of 10, and use concise formatting, (3) Use references instead of full context — "Use the schema from the previous step" rather than repeating it, (4) Switch to structured formats (YAML, JSON) which are more token-efficient than prose instructions.

What is prompt caching and how much does it save?

Prompt caching stores the computed representation of your prompt prefix so repeated calls don't reprocess the same instructions. Anthropic offers 90% discount on cached tokens, OpenAI offers 50%. If your system prompt is 2,000 tokens and you make 10,000 calls/day, caching saves $150-300/month. Enable by keeping a stable prompt prefix across requests.

Should I use a cheaper model for simple tasks?

Yes — this is called tiered model routing. Route simple tasks (classification, extraction, formatting) to GPT-4o-mini or Claude Haiku ($0.25/MTok) and reserve frontier models ($15/MTok) for complex reasoning. Most production systems find that 70% of tasks can run on cheap models, cutting total cost by 45-60%.

How do I measure prompt optimization success?

Track four metrics: (1) Tokens per request — total input + output tokens, (2) Cost per call — actual API spend per request, (3) Latency P50/P95 — response time at median and tail, (4) Quality score — accuracy/relevance measured via eval suite. Optimization succeeds when cost drops without quality regression.

What is the biggest prompt optimization mistake?

Over-optimization — cutting so many tokens that the model loses critical context and quality drops. Always A/B test optimized prompts against the original on a fixed eval set. The goal is finding the Pareto frontier: minimum cost at maximum quality, not minimum cost at any quality.

Optimize with STCO

AI Prompt Architect's structured framework generates token-efficient prompts by default — every output is formatted for production cost control.

Start Optimizing Free →

Prompt Optimization: The Evidence

Every claim below is sourced from peer-reviewed research and industry reports.Browse all 141 citations →

Prompt caching reduces static context costs.

Cached prompt tokens cost $0.30/MTok vs $3.00/MTok uncached on Claude 3.5 Sonnet — a 90% reduction on repeated system instructions.

Without prompt caching, enterprise pipelines re-tokenise and re-bill the same system prompt across thousands of requests, paying 10x more for identical static context.

Anthropic, 'Prompt Caching (Beta)' documentation, 2024

Output tokens are significantly more expensive than input tokens.

GPT-4o charges $15.00/MTok for output vs $5.00/MTok for input — a 3x premium. Constraining max_tokens from 4096 to 500 saves $11.25 per million requests.

Without output length constraints, LLMs generate verbose responses that consume the most expensive billing vector — output tokens — at 3x the input rate.

OpenAI, 'API Pricing' page, updated 2024

Few-shot extraction minimizes context window usage vs zero-shot verbose.

3 well-crafted few-shot examples (150 tokens) outperform a 600-token verbose instruction block, saving 75% on input costs per request.

Without concise few-shot examples, developers write lengthy prose instructions that consume 4x more tokens for equivalent or inferior output quality.

Brown et al., 'Language Models are Few-Shot Learners', NeurIPS 2020

Tiered model routing based on prompt complexity.

Routing 70% of queries to Haiku ($0.25/MTok) and 30% to Opus ($15/MTok) reduces average cost by 45% compared to Opus-only, with only 2% quality degradation.

Without complexity-based routing, every query — including trivial classification and formatting tasks — hits the most expensive model tier, wasting 60x on tasks that a cheap model handles identically.

Unify AI, 'Dynamic Model Routing for Cost-Optimized LLM Inference' documentation, 2024

Outlines' grammar-guided generation produces valid JSON on every call with 0% retry rate, versus 15% retry rates with un.Outlines, '.txt: Structured Generation with Gramma…