Skip to content
About

Advanced RAG & Evaluation

Once a baseline RAG system runs, two things move it from “demo” to “dependable”: advanced patterns for hard cases, and evaluation so you know whether any change actually helped.

Reach for these only when evaluation shows naive RAG is failing on a specific class of query.

Instead of always retrieving once, an LLM decides how to handle the query: retrieve or not, from which source, and whether to search again after seeing weak results. Retrieval becomes a tool an agent calls, with self-correcting loops. Powerful for varied queries; costs more calls and latency.

Some questions need chained retrieval: “What did the CEO of our biggest competitor say about pricing?” requires finding the competitor, then the CEO, then the quote. The system retrieves, partially reasons, retrieves again — using query decomposition.

When relationships between entities matter, build a knowledge graph and retrieve connected subgraphs rather than isolated text chunks. Strong for “how do these things relate” questions; heavier to build and maintain.

At indexing time, an LLM adds a short, document-aware context blurb to each chunk before embedding — so an isolated chunk that says “the limit is 500” becomes “In the Pro plan billing section: the limit is 500.” A cheap step that measurably raises retrieval accuracy.

PatternSolvesCost
Agentic RAGVaried, unpredictable queriesHigh — multiple LLM calls
Multi-hopQuestions needing chained factsMedium
Graph RAGRelationship-heavy questionsHigh — graph to build/maintain
Contextual retrievalChunks lacking standalone contextLow — one-time indexing cost

You cannot improve RAG without measuring it, and “it seemed fine” is not a measurement. RAG has two failure surfaces — retrieval and generation — and you must score them separately to know where to fix things.

Assemble 30–100+ representative (question, ideal answer, source chunks) triples. Include the hard cases: multi-part questions, questions your data can’t answer, and exact-term lookups. This set is the most valuable artifact in a RAG project.

Did the right chunks come back?

  • Context recall — were all the chunks needed to answer actually retrieved?
  • Context precision — of the retrieved chunks, how many were relevant (and ranked high)?

Given the retrieved context, was the answer good?

  • Faithfulness / groundedness — is every claim supported by the context, or did the model invent something? The core anti-hallucination metric.
  • Answer relevance — does the answer actually address the question?
  • Correctness — does it match the ideal answer?

Frameworks like RAGAS and TruLens automate much of this, typically with LLM-as-judge scoring. Run the suite on every change so improvements are proven and regressions are caught.

Diagnose RAG by isolating which stage failed:

SymptomDiagnosisFix
Right chunks retrieved, answer still wrongGeneration problemImprove the prompt; lower temperature; try a stronger model
Needed chunks never retrievedRetrieval problemBetter chunking, hybrid search, query rewriting
Right chunks retrieved but ranked lowOrdering problemAdd a reranker; raise k
Exact terms/IDs missedVector-only searchAdd hybrid (keyword) search
Answer mixes in untrue outside factsFaithfulness problemStrengthen grounding prompt; enforce citations
Model invents an answer to an unanswerable questionNo honest exitAllow and instruct “I don’t know”
Slow responsesToo many stages / chunksTrim k; cache; parallelize retrieval

Advanced RAG patterns — agentic, multi-hop, graph, contextual retrieval — each target a specific weakness; adopt one only when evaluation proves you need it. Evaluate retrieval and generation separately, using a curated test set and metrics like context recall and faithfulness. When debugging, inspect the retrieved chunks first — it tells you whether you have a retrieval or a generation problem, and that determines the fix.