AI Architecture • 14 min read
Fine-Tuning vs Prompt Engineering: The Technical Guide
"Should we just fine-tune it?" is the most common, expensive mistake made by engineering teams entering generative AI. Fine-tuning and prompt engineering solve entirely different problems. This guide breaks down the cost, latency, and capabilities of both approaches.
The Architectural Breakdown
Prompt Engineering
Controlling the model strictly through the text you pass at inference time. Often combined with RAG (Retrieval-Augmented Generation) to pass dynamic facts in the context window.
- Update Speed: Instant. Change the string, change the behavior.
- Knowledge: Perfect for injecting highly dynamic facts via RAG.
- Latency: Slower if you use massive few-shot examples (many input tokens).
Fine-Tuning
Updating the actual weights of the neural network by training it on hundreds or thousands of high-quality `{prompt, completion}` pairs.
- Update Speed: Slow. Requires collecting datasets and re-training jobs.
- Knowledge: Terrible for facts. If the company wiki changes, the model is outdated.
- Latency: Extremely fast. No need to pass few-shot examples; the format is baked in.
5 Decision Scenarios: Which to Choose?
1. Teaching the model your internal HR policies
Policies change frequently. You need the model to answer based on today's PDF, not last month's.
2. Forcing a custom, proprietary JSON structure
You have a highly specific JSON schema that the base model constantly messes up, even with 10 few-shot examples in the prompt.
3. Prototyping a new AI feature
You are trying to validate if an AI feature is useful to users before committing engineering resources.
4. Replicating a specific author's voice
You want the AI to write marketing copy that exactly matches the nuanced, sarcastic tone of your lead copywriter.
5. Scaling to 10M requests per day
You have a working prompt, but it contains 4,000 tokens of few-shot examples. Your API bill is $50,000/month in input tokens alone.
Architecture STCO Templates
TASK: Answer the user's question.
CONTEXT: [Inject database search results here dynamically]
OUTPUT: Provide a clear answer. If the context does not contain the answer, say "I don't know."
TASK: Generate 50 unique {prompt, completion} pairs that demonstrate how to format a medical record into our custom JSON schema.
CONTEXT: [Provide 3 perfect examples of your schema here]
OUTPUT: Output JSONL format ready for the OpenAI Fine-Tuning API.
Frequently Asked Questions
When should I fine-tune instead of prompt engineer?
Is fine-tuning cheaper than prompt engineering?
Can I fine-tune a model to teach it my company wiki?
AI Architecture Research: The Empirical Evidence
Every claim below is sourced from peer-reviewed research and industry reports.Browse all 141 citations →
Constrained decoding eliminates retry loops via grammar-guided generation.
Outlines' grammar-guided generation produces valid JSON on every call with 0% retry rate, versus 15% retry rates with unconstrained generation — eliminating the 2-3x token cost multiplier from failed parses.
Without constrained decoding, each failed JSON generation consumes the full input + output token budget before retrying, compounding costs exponentially across high-volume pipelines.
Outlines, '.txt: Structured Generation with Grammar-Guided Constrained Decoding' documentation, 2024Tiered model routing based on prompt complexity.
Routing 70% of queries to Haiku ($0.25/MTok) and 30% to Opus ($15/MTok) reduces average cost by 45% compared to Opus-only, with only 2% quality degradation.
Without complexity-based routing, every query — including trivial classification and formatting tasks — hits the most expensive model tier, wasting 60x on tasks that a cheap model handles identically.
Unify AI, 'Dynamic Model Routing for Cost-Optimized LLM Inference' documentation, 2024Early exit reasoning paths save compute.
Structured prompts that allow 'confident: true' short-circuit responses save 25% compute by generating 150 output tokens instead of 600 for simple queries.
Without structured confidence signals, the model generates full reasoning chains even for trivial questions, wasting GPU cycles.
Google DeepMind, 'Scaling LLM Test-Time Compute Optimally', 2024JSON Schema enforcement eliminates parse errors.
OpenAI structured outputs with JSON Schema achieve 99.9% schema adherence vs <70% with unconstrained generation — a 30x reduction in parse failures.
Without schema enforcement, every 1M requests generate 300K+ malformed responses requiring retries, error handling, and downstream data corruption.
OpenAI, 'Structured Outputs: JSON Schema' documentation, 2024