The RAG Pipeline

RAG has two phases that run at completely different times: indexing happens ahead of time (offline), and retrieval + generation happens per request (online). Keeping them straight is the foundation for everything else.

Phase 1 — Indexing (offline)

You prepare your knowledge so it can be searched. This is a batch job, run when content is added or changed.

Load — pull raw content from its sources: files, a CMS, a database, wikis, support tickets.
Chunk — split each document into passages small enough to embed and to fit usefully in a prompt. This step makes or breaks quality — see Chunking & Retrieval.
Embed — convert each chunk into a vector with an embedding model.
Store — write the vector, the original chunk text, and metadata (source, date, permissions) into a vector database.

Phase 2 — Retrieval & generation (online)

This runs for every user query, in real time.

Embed the query with the same embedding model used at indexing.
Retrieve the top-k most similar chunks from the vector database.
Augment — build a prompt placing the retrieved chunks as context.
Generate — the LLM answers using that context, ideally citing sources.

A baseline implementation

# --- Phase 1: Indexing (run offline) ---
def index(documents):
    for doc in documents:
        for chunk in split_into_chunks(doc, size=500, overlap=50):
            vector = embed(chunk.text)
            vector_db.upsert(
                id=chunk.id,
                vector=vector,
                text=chunk.text,
                metadata={"source": doc.source, "date": doc.date},
            )

# --- Phase 2: Retrieval + generation (per request) ---
def answer(question):
    q_vector = embed(question)                       # same model as indexing
    chunks   = vector_db.search(q_vector, top_k=5)   # retrieve

    context = "\n\n---\n\n".join(
        f"[{c.metadata['source']}]\n{c.text}" for c in chunks
    )
    prompt = f"""Answer the question using ONLY the context below.
If the context does not contain the answer, say you don't know.
Cite the source in brackets after each claim.

Context:
{context}

Question: {question}"""

    return llm(prompt, temperature=0)                # low temp: be faithful

That is naive RAG — and it’s a legitimate starting point. Ship it, measure it, then improve the weak parts.

The grounding prompt

The instruction wrapping the context is doing real work. A good RAG prompt should:

Restrict the model to the context — “use ONLY the context below.”
Permit “I don’t know” — give it an honest exit so it doesn’t invent an answer when retrieval misses.
Require citations — so users (and you) can verify, and hallucinations become visible.

Where naive RAG breaks

The baseline fails in predictable ways, each with a known fix:

Symptom	Likely cause	Fixed in
Answers miss obvious info	Bad chunking; pure vector search	Chunking & Retrieval
Right topic, wrong specifics	No reranking; k too low	Chunking & Retrieval
Exact terms / IDs not found	Vector-only search	Hybrid search
Vague or multi-part questions fail	Query needs transformation	Advanced RAG
Good context, bad answer	Generation/prompt problem	Evaluation

The rest of this section walks the fixes.

Key takeaways

RAG has an offline indexing phase (load, chunk, embed, store) and an online phase (embed query, retrieve, augment, generate). Use the same embedding model for both. Naive RAG — top-k vector search plus a grounding prompt — is a valid v1. The grounding prompt must restrict the model to the context, allow “I don’t know,” and require citations. Naive RAG fails predictably, and every failure has a specific remedy.