Skip to content
About

RAG Patterns

RAG is not one architecture — it’s a family. These patterns are the recurring shapes, ordered simple to complex. Match the pattern to the problem and resist climbing higher than you need.

The baseline. One knowledge source, one retrieval step, one generation step.

Query Embed Vector search Top-k chunks Grounded prompt Answer

Use when: a single, reasonably uniform knowledge base; questions answerable from one retrieval. Strengths: simple, fast, cheap, debuggable. Start here for almost everything — and add complexity only when evaluation proves you need it.

Different question types need different handling. A router classifies the query first and dispatches it.

Query Router Product-docs index Past-tickets index SQL query — for "how many / metrics" Plain LLM — chit-chat, no retrieval

Use when: queries are heterogeneous, or some shouldn’t trigger retrieval at all. Watch: the router is a failure point — misroute and the answer is doomed; evaluate it on its own.

One question genuinely needs several sources at once. Retrieve from each in parallel, merge, rerank, then generate.

Query Docs Tickets Code / wiki Merge Rerank Generate

Use when: complete answers span multiple repositories. Watch: merging needs a reranker so the best chunks survive regardless of source; cost and latency rise with each source.

The raw question is a poor search query, so transform it before retrieving — rewrite, expand to multiple queries, or decompose. See Chunking & Retrieval.

Query Transform query rewrite · expand · decompose Retrieve per query Union Generate

Use when: conversational follow-ups (“what about the other one?”), vague questions, or multi-part questions. Watch: an extra LLM call before retrieval — latency for recall.

Retrieval becomes a tool an agent decides to use. The agent chooses whether to retrieve, from where, judges if results suffice, and can search again.

no — refine & retry Query Agent Retrieve Assess Enough? yes Answer

Use when: unpredictable queries, multi-hop questions, results that need iterative refinement. Watch: multiple LLM calls — the most expensive and slowest pattern, and the hardest to make predictable. Adopt last.

Most production RAG, whatever the pattern, also includes:

  • Hybrid search (vector + keyword) — so exact terms aren’t lost.
  • Reranking — retrieve broad, rerank, keep the best few.
  • Metadata filtering — permissions, recency, tenant isolation.
  • Caching — for repeated or similar queries.
  • Citations — every claim traceable to a source.
SituationPattern
One knowledge base, direct questionsSimple Q&A RAG
Distinct query types / some need no retrievalQuery routing
Answers span multiple sourcesMulti-source retrieval
Vague, conversational, or multi-part queriesQuery transformation
Unpredictable or multi-hop, need iterationAgentic RAG

RAG is a family of patterns of rising complexity: simple Q&A, query routing, multi-source, query transformation, and agentic RAG. Almost every system should begin with simple Q&A. Most production RAG layers in hybrid search, reranking, metadata filtering, caching, and citations regardless of pattern. Choose by the problem’s actual shape, and move to a more complex pattern only when evaluation proves the simpler one is failing.