Advanced RAG & Evaluation
Once a baseline RAG system runs, two things move it from “demo” to “dependable”: advanced patterns for hard cases, and evaluation so you know whether any change actually helped.
Advanced RAG patterns
Section titled “Advanced RAG patterns”Reach for these only when evaluation shows naive RAG is failing on a specific class of query.
Agentic RAG
Section titled “Agentic RAG”Instead of always retrieving once, an LLM decides how to handle the query: retrieve or not, from which source, and whether to search again after seeing weak results. Retrieval becomes a tool an agent calls, with self-correcting loops. Powerful for varied queries; costs more calls and latency.
Multi-hop RAG
Section titled “Multi-hop RAG”Some questions need chained retrieval: “What did the CEO of our biggest competitor say about pricing?” requires finding the competitor, then the CEO, then the quote. The system retrieves, partially reasons, retrieves again — using query decomposition.
Graph RAG
Section titled “Graph RAG”When relationships between entities matter, build a knowledge graph and retrieve connected subgraphs rather than isolated text chunks. Strong for “how do these things relate” questions; heavier to build and maintain.
Contextual retrieval
Section titled “Contextual retrieval”At indexing time, an LLM adds a short, document-aware context blurb to each chunk before embedding — so an isolated chunk that says “the limit is 500” becomes “In the Pro plan billing section: the limit is 500.” A cheap step that measurably raises retrieval accuracy.
| Pattern | Solves | Cost |
|---|---|---|
| Agentic RAG | Varied, unpredictable queries | High — multiple LLM calls |
| Multi-hop | Questions needing chained facts | Medium |
| Graph RAG | Relationship-heavy questions | High — graph to build/maintain |
| Contextual retrieval | Chunks lacking standalone context | Low — one-time indexing cost |
Evaluating RAG
Section titled “Evaluating RAG”You cannot improve RAG without measuring it, and “it seemed fine” is not a measurement. RAG has two failure surfaces — retrieval and generation — and you must score them separately to know where to fix things.
Build a test set
Section titled “Build a test set”Assemble 30–100+ representative (question, ideal answer, source chunks)
triples. Include the hard cases: multi-part questions, questions your data
can’t answer, and exact-term lookups. This set is the most valuable artifact
in a RAG project.
Retrieval metrics
Section titled “Retrieval metrics”Did the right chunks come back?
- Context recall — were all the chunks needed to answer actually retrieved?
- Context precision — of the retrieved chunks, how many were relevant (and ranked high)?
Generation metrics
Section titled “Generation metrics”Given the retrieved context, was the answer good?
- Faithfulness / groundedness — is every claim supported by the context, or did the model invent something? The core anti-hallucination metric.
- Answer relevance — does the answer actually address the question?
- Correctness — does it match the ideal answer?
Frameworks like RAGAS and TruLens automate much of this, typically with LLM-as-judge scoring. Run the suite on every change so improvements are proven and regressions are caught.
The failure-mode playbook
Section titled “The failure-mode playbook”Diagnose RAG by isolating which stage failed:
| Symptom | Diagnosis | Fix |
|---|---|---|
| Right chunks retrieved, answer still wrong | Generation problem | Improve the prompt; lower temperature; try a stronger model |
| Needed chunks never retrieved | Retrieval problem | Better chunking, hybrid search, query rewriting |
| Right chunks retrieved but ranked low | Ordering problem | Add a reranker; raise k |
| Exact terms/IDs missed | Vector-only search | Add hybrid (keyword) search |
| Answer mixes in untrue outside facts | Faithfulness problem | Strengthen grounding prompt; enforce citations |
| Model invents an answer to an unanswerable question | No honest exit | Allow and instruct “I don’t know” |
| Slow responses | Too many stages / chunks | Trim k; cache; parallelize retrieval |
Key takeaways
Section titled “Key takeaways”Advanced RAG patterns — agentic, multi-hop, graph, contextual retrieval — each target a specific weakness; adopt one only when evaluation proves you need it. Evaluate retrieval and generation separately, using a curated test set and metrics like context recall and faithfulness. When debugging, inspect the retrieved chunks first — it tells you whether you have a retrieval or a generation problem, and that determines the fix.