Skip to content
About

The RAG Pipeline

RAG has two phases that run at completely different times: indexing happens ahead of time (offline), and retrieval + generation happens per request (online). Keeping them straight is the foundation for everything else.

You prepare your knowledge so it can be searched. This is a batch job, run when content is added or changed.

Offline — runs when content is added or changed Documents PDFs · tickets Load raw content Chunk split to passages Embed chunk → vector Store vector DB
  1. Load — pull raw content from its sources: files, a CMS, a database, wikis, support tickets.
  2. Chunk — split each document into passages small enough to embed and to fit usefully in a prompt. This step makes or breaks quality — see Chunking & Retrieval.
  3. Embed — convert each chunk into a vector with an embedding model.
  4. Store — write the vector, the original chunk text, and metadata (source, date, permissions) into a vector database.

Phase 2 — Retrieval & generation (online)

Section titled “Phase 2 — Retrieval & generation (online)”

This runs for every user query, in real time.

Online — runs for every user query Query Embed query Search vector DB Top-k chunks Assemble prompt · system instructions · retrieved chunks · user question LLM Grounded answer + citations
  1. Embed the query with the same embedding model used at indexing.
  2. Retrieve the top-k most similar chunks from the vector database.
  3. Augment — build a prompt placing the retrieved chunks as context.
  4. Generate — the LLM answers using that context, ideally citing sources.
# --- Phase 1: Indexing (run offline) ---
def index(documents):
for doc in documents:
for chunk in split_into_chunks(doc, size=500, overlap=50):
vector = embed(chunk.text)
vector_db.upsert(
id=chunk.id,
vector=vector,
text=chunk.text,
metadata={"source": doc.source, "date": doc.date},
)
# --- Phase 2: Retrieval + generation (per request) ---
def answer(question):
q_vector = embed(question) # same model as indexing
chunks = vector_db.search(q_vector, top_k=5) # retrieve
context = "\n\n---\n\n".join(
f"[{c.metadata['source']}]\n{c.text}" for c in chunks
)
prompt = f"""Answer the question using ONLY the context below.
If the context does not contain the answer, say you don't know.
Cite the source in brackets after each claim.
Context:
{context}
Question: {question}"""
return llm(prompt, temperature=0) # low temp: be faithful

That is naive RAG — and it’s a legitimate starting point. Ship it, measure it, then improve the weak parts.

The instruction wrapping the context is doing real work. A good RAG prompt should:

  • Restrict the model to the context — “use ONLY the context below.”
  • Permit “I don’t know” — give it an honest exit so it doesn’t invent an answer when retrieval misses.
  • Require citations — so users (and you) can verify, and hallucinations become visible.

The baseline fails in predictable ways, each with a known fix:

SymptomLikely causeFixed in
Answers miss obvious infoBad chunking; pure vector searchChunking & Retrieval
Right topic, wrong specificsNo reranking; k too lowChunking & Retrieval
Exact terms / IDs not foundVector-only searchHybrid search
Vague or multi-part questions failQuery needs transformationAdvanced RAG
Good context, bad answerGeneration/prompt problemEvaluation

The rest of this section walks the fixes.

RAG has an offline indexing phase (load, chunk, embed, store) and an online phase (embed query, retrieve, augment, generate). Use the same embedding model for both. Naive RAG — top-k vector search plus a grounding prompt — is a valid v1. The grounding prompt must restrict the model to the context, allow “I don’t know,” and require citations. Naive RAG fails predictably, and every failure has a specific remedy.