What is RAG (Retrieval-Augmented Generation)?

RAG is an architecture that combines a retrieval system (search engine, vector database) with a language model. Instead of relying solely on what the model "knows," RAG fetches relevant documents at query time and injects them into the prompt as context. This reduces hallucinations and keeps responses grounded in current, verified data.

When should I use RAG vs long context windows?

Use RAG when you have large, frequently updated knowledge bases (>100K tokens), need citation tracking, or require cost control. Use long context when your corpus is small enough to fit in the window, rarely changes, and you need the model to reason across all documents simultaneously.

What are the limitations of long context windows?

Key limitations: high cost per request (you pay for all tokens every time), "lost in the middle" problem (models attend less to content in the middle of long contexts), no selective retrieval (everything is processed), and slower response times. RAG avoids these by only injecting relevant chunks.

How do I choose the right chunk size for RAG?

Optimal chunk size depends on your content and use case. Start with 500–1000 tokens per chunk. Too small ( 2000 tokens) dilutes relevance. Use overlapping chunks (10–20% overlap) to preserve context across boundaries. Benchmark different sizes against your query patterns.

Prompt Engineering13 March 202614 min readAI Prompt Architect

RAG vs Long Context Windows: Architectural Decision Guide --- ## Further Reading - [How to Reduce LLM Hallucinations with Prompts: A Deep Dive](/blog/how-to-reduce-llm-hallucinations-with-prompts) - [The Manifest: The Complete Guide to Architect-Grade LLM Prompts](/blog/the-manifest-architect-grade-llm-prompts) - [Structured Output Prompt Engineering: The Ultimate Guide](/blog/structured-output-prompt-engineering)

The Context Problem

Every production LLM application faces the same fundamental challenge: how do you give the model access to relevant information that wasn't in its training data? Your company's documentation, user data, product catalogue, codebase — none of it exists in GPT-4 or Claude's weights. You have two architectural approaches:

RAG (Retrieval-Augmented Generation) — Retrieve relevant chunks from an external datastore and inject them into the prompt
Long Context Windows — Stuff the entire knowledge base directly into the prompt (e.g., Claude's 200K tokens, Gemini's 1M+ tokens)

Neither approach is universally better. The right choice depends on your data volume, query patterns, latency requirements, and budget.

How RAG Works

Indexing — Split your documents into chunks (typically 256-1024 tokens), generate embeddings for each chunk, store in a vector database
Retrieval — When a user query arrives, embed the query, find the top-K most similar chunks via vector similarity search
Generation — Inject the retrieved chunks into the prompt as context, then generate the answer

System: Answer the user's question using ONLY the following context documents.
If the answer isn't in the provided context, say "I don't have information about that."

Context Documents:
---
{chunk_1}
---
{chunk_2}
---
{chunk_3}

User Question: {query}

How Long Context Works

With models supporting 100K-1M+ token context windows, you can skip retrieval entirely:

System: You are a documentation expert. Answer questions based on the following 
complete documentation set.

{entire_documentation_contents}

User Question: {query}

This is conceptually simpler — no embedding pipeline, no vector database, no chunk management. But it comes with significant trade-offs.

Decision Matrix

Factor	RAG Wins	Long Context Wins
Data Volume	> 200K tokens (millions of docs)	< 200K tokens total
Cost per Query	Lower (only relevant chunks sent)	Higher (entire corpus every query)
Latency	Lower (smaller prompts = faster generation)	Higher (processing entire context)
Accuracy	Can miss relevant context (retrieval failures)	Model sees everything (no retrieval gap)
Freshness	Near real-time (re-index changed docs)	Requires re-sending updated corpus
Infrastructure	Vector DB + embedding pipeline required	No additional infrastructure
Multi-hop Reasoning	Weak (chunks may lack cross-references)	Strong (model sees full context)
Complexity	Higher (chunking, embedding, retrieval tuning)	Lower (just concatenate and send)

When to Use RAG

Large knowledge bases — Customer support with 10,000+ articles, legal document search, enterprise wikis
Cost-sensitive applications — High query volume where per-token costs matter
Frequently updated data — Product catalogues, pricing databases, inventory systems
Multi-tenant applications — Each user has their own data; retrieval naturally scopes to their documents
Latency-critical — Sub-second response times where processing 200K tokens is too slow

When to Use Long Context

Small, stable corpora — A single codebase, a company handbook, a product specification
Cross-reference-heavy tasks — Legal contract analysis, code review across multiple files, research synthesis
Simplicity is paramount — Prototyping, MVP stage, small teams without ML infrastructure
Summarisation tasks — The model needs to see the entire document to summarise it properly
One-shot analysis — Upload a document, ask questions, discard. No need to persist embeddings

The Hybrid Approach

The best production systems often combine both approaches:

async function answerQuery(query: string): Promise<string> {
  // Step 1: RAG retrieval for candidate documents
  const candidates = await vectorDB.search(embed(query), { topK: 20 });
  
  // Step 2: Re-rank candidates
  const reranked = await reranker.rank(query, candidates);
  
  // Step 3: Take top results and use long context for synthesis
  const topDocs = reranked.slice(0, 5).map(d => d.fullContent);
  
  // Step 4: Send full documents (not chunks) to a long-context model
  const answer = await longContextModel.generate({
    system: "Answer based on these documents. Cite specific sections.",
    context: topDocs.join('\n---\n'),
    query: query
  });
  
  return answer;
}

This gives you RAG's efficiency for retrieval with long-context's accuracy for synthesis. You search broadly (RAG) then reason deeply (long context).

RAG Pitfalls to Avoid

Chunk sizes too small — Chunks under 200 tokens often lack sufficient context. Start with 512-1024 tokens
No overlap between chunks — Use 10-20% overlap so sentences aren't split mid-thought
Ignoring metadata — Always store and filter by document metadata (source, date, category) alongside embeddings
No evaluation pipeline — You need to measure retrieval quality (hit rate, MRR) separately from generation quality
Embedding model mismatch — Use the same embedding model for indexing and querying. Mixing models destroys similarity metrics

How AI Prompt Architect Helps

Whether you use RAG or long context, the prompt matters most. AI Prompt Architect's Generate workflow creates structured prompts that work with either architecture — including context injection points, grounding instructions ("only answer from the provided context"), and citation formatting. The Analyse workflow can evaluate your RAG prompts for common failure modes like context window overflow and instruction confusion.

Building a RAG backend in Python with Django? Our guide on scaffolding Django REST Framework APIs shows how to structure your retrieval endpoints with proper serialization, filtering, and pagination.

Get the Prompt Engineering Playbook

Join 5,000+ developers receiving our weekly deep-dives on structured outputs, RAG optimisation, and advanced AI agent prompting.

RAGcontext-windowsarchitecturevector-databasesLLMretrieval

AI Prompt Architect

Author

Expert in prompt architecture and large language model optimization.