The RAG Pipeline
RAG has two phases that run at completely different times: indexing happens ahead of time (offline), and retrieval + generation happens per request (online). Keeping them straight is the foundation for everything else.
Phase 1 — Indexing (offline)
Section titled “Phase 1 — Indexing (offline)”You prepare your knowledge so it can be searched. This is a batch job, run when content is added or changed.
- Load — pull raw content from its sources: files, a CMS, a database, wikis, support tickets.
- Chunk — split each document into passages small enough to embed and to fit usefully in a prompt. This step makes or breaks quality — see Chunking & Retrieval.
- Embed — convert each chunk into a vector with an embedding model.
- Store — write the vector, the original chunk text, and metadata (source, date, permissions) into a vector database.
Phase 2 — Retrieval & generation (online)
Section titled “Phase 2 — Retrieval & generation (online)”This runs for every user query, in real time.
- Embed the query with the same embedding model used at indexing.
- Retrieve the top-k most similar chunks from the vector database.
- Augment — build a prompt placing the retrieved chunks as context.
- Generate — the LLM answers using that context, ideally citing sources.
A baseline implementation
Section titled “A baseline implementation”# --- Phase 1: Indexing (run offline) ---def index(documents): for doc in documents: for chunk in split_into_chunks(doc, size=500, overlap=50): vector = embed(chunk.text) vector_db.upsert( id=chunk.id, vector=vector, text=chunk.text, metadata={"source": doc.source, "date": doc.date}, )
# --- Phase 2: Retrieval + generation (per request) ---def answer(question): q_vector = embed(question) # same model as indexing chunks = vector_db.search(q_vector, top_k=5) # retrieve
context = "\n\n---\n\n".join( f"[{c.metadata['source']}]\n{c.text}" for c in chunks ) prompt = f"""Answer the question using ONLY the context below.If the context does not contain the answer, say you don't know.Cite the source in brackets after each claim.
Context:{context}
Question: {question}"""
return llm(prompt, temperature=0) # low temp: be faithfulThat is naive RAG — and it’s a legitimate starting point. Ship it, measure it, then improve the weak parts.
The grounding prompt
Section titled “The grounding prompt”The instruction wrapping the context is doing real work. A good RAG prompt should:
- Restrict the model to the context — “use ONLY the context below.”
- Permit “I don’t know” — give it an honest exit so it doesn’t invent an answer when retrieval misses.
- Require citations — so users (and you) can verify, and hallucinations become visible.
Where naive RAG breaks
Section titled “Where naive RAG breaks”The baseline fails in predictable ways, each with a known fix:
| Symptom | Likely cause | Fixed in |
|---|---|---|
| Answers miss obvious info | Bad chunking; pure vector search | Chunking & Retrieval |
| Right topic, wrong specifics | No reranking; k too low | Chunking & Retrieval |
| Exact terms / IDs not found | Vector-only search | Hybrid search |
| Vague or multi-part questions fail | Query needs transformation | Advanced RAG |
| Good context, bad answer | Generation/prompt problem | Evaluation |
The rest of this section walks the fixes.
Key takeaways
Section titled “Key takeaways”RAG has an offline indexing phase (load, chunk, embed, store) and an online phase (embed query, retrieve, augment, generate). Use the same embedding model for both. Naive RAG — top-k vector search plus a grounding prompt — is a valid v1. The grounding prompt must restrict the model to the context, allow “I don’t know,” and require citations. Naive RAG fails predictably, and every failure has a specific remedy.