Advanced RAG & Evaluation

Once a baseline RAG system runs, two things move it from “demo” to “dependable”: advanced patterns for hard cases, and evaluation so you know whether any change actually helped.

Advanced RAG patterns

Reach for these only when evaluation shows naive RAG is failing on a specific class of query.

Agentic RAG

Instead of always retrieving once, an LLM decides how to handle the query: retrieve or not, from which source, and whether to search again after seeing weak results. Retrieval becomes a tool an agent calls, with self-correcting loops. Powerful for varied queries; costs more calls and latency.

Multi-hop RAG

Some questions need chained retrieval: “What did the CEO of our biggest competitor say about pricing?” requires finding the competitor, then the CEO, then the quote. The system retrieves, partially reasons, retrieves again — using query decomposition.

Graph RAG

When relationships between entities matter, build a knowledge graph and retrieve connected subgraphs rather than isolated text chunks. Strong for “how do these things relate” questions; heavier to build and maintain.

Contextual retrieval

At indexing time, an LLM adds a short, document-aware context blurb to each chunk before embedding — so an isolated chunk that says “the limit is 500” becomes “In the Pro plan billing section: the limit is 500.” A cheap step that measurably raises retrieval accuracy.

Pattern	Solves	Cost
Agentic RAG	Varied, unpredictable queries	High — multiple LLM calls
Multi-hop	Questions needing chained facts	Medium
Graph RAG	Relationship-heavy questions	High — graph to build/maintain
Contextual retrieval	Chunks lacking standalone context	Low — one-time indexing cost

Evaluating RAG

You cannot improve RAG without measuring it, and “it seemed fine” is not a measurement. RAG has two failure surfaces — retrieval and generation — and you must score them separately to know where to fix things.

Build a test set

Assemble 30–100+ representative (question, ideal answer, source chunks) triples. Include the hard cases: multi-part questions, questions your data can’t answer, and exact-term lookups. This set is the most valuable artifact in a RAG project.

Retrieval metrics

Did the right chunks come back?

Context recall — were all the chunks needed to answer actually retrieved?
Context precision — of the retrieved chunks, how many were relevant (and ranked high)?

Generation metrics

Given the retrieved context, was the answer good?

Faithfulness / groundedness — is every claim supported by the context, or did the model invent something? The core anti-hallucination metric.
Answer relevance — does the answer actually address the question?
Correctness — does it match the ideal answer?

Frameworks like RAGAS and TruLens automate much of this, typically with LLM-as-judge scoring. Run the suite on every change so improvements are proven and regressions are caught.

The failure-mode playbook

Diagnose RAG by isolating which stage failed:

Symptom	Diagnosis	Fix
Right chunks retrieved, answer still wrong	Generation problem	Improve the prompt; lower temperature; try a stronger model
Needed chunks never retrieved	Retrieval problem	Better chunking, hybrid search, query rewriting
Right chunks retrieved but ranked low	Ordering problem	Add a reranker; raise k
Exact terms/IDs missed	Vector-only search	Add hybrid (keyword) search
Answer mixes in untrue outside facts	Faithfulness problem	Strengthen grounding prompt; enforce citations
Model invents an answer to an unanswerable question	No honest exit	Allow and instruct “I don’t know”
Slow responses	Too many stages / chunks	Trim k; cache; parallelize retrieval

Key takeaways

Advanced RAG patterns — agentic, multi-hop, graph, contextual retrieval — each target a specific weakness; adopt one only when evaluation proves you need it. Evaluate retrieval and generation separately, using a curated test set and metrics like context recall and faithfulness. When debugging, inspect the retrieved chunks first — it tells you whether you have a retrieval or a generation problem, and that determines the fix.