Why Your RAG System Returns Garbage (and Where to Actually Look)

A retrieval-augmented generation system gives a confidently wrong answer, and the team’s first move is to rewrite the prompt. It rarely helps, because the prompt is the last link in the chain, not the first. If the right information never made it into the context window, no amount of prompt engineering will conjure it back.

RAG fails in a predictable order, and debugging it in that order will save you weeks. Here’s where to look, top to bottom.

First: did retrieval even find the right chunk?

Before touching anything else, answer one question — for a query that’s giving a bad answer, is the correct information in the retrieved chunks at all?

Log the actual chunks your retriever returns for a failing query and read them. This single step resolves most RAG debugging, because it cleanly splits the problem in two: either the right context was retrieved and the model mishandled it (a generation problem), or the right context was never retrieved (a retrieval problem). These have completely different fixes, and you cannot tell which you have without looking.

In my experience the answer is “the right chunk wasn’t retrieved” far more often than teams expect. Which means the prompt was never the issue.

The retrieval problems, in order of frequency

When the right chunk isn’t being found, it’s usually one of these:

Chunking that split the answer. The information that answers the question got cut across two chunks, so neither one is a strong match on its own. This is the most common and most overlooked cause. Naive fixed-size splitting cuts through the middle of the very passages that hold complete answers.
A query–document mismatch. The user asks in plain language; the document is written in formal or technical terms. Pure semantic similarity misses the connection. This is what hybrid search — combining keyword and semantic matching — exists to fix.
Too few chunks retrieved. Pulling the top 3 when the answer needed context from the top 8. Cheap to test: just raise the count and see if quality jumps.
No reranking. Your vector search returns roughly relevant chunks, but the best one is ranked fourth and falls outside your cutoff. A reranking pass over the top candidates fixes ordering cheaply.

Notice none of these are prompt problems. They’re all what got into the window problems.

Then, and only then: generation

If you’ve confirmed the right context is being retrieved and the answer is still wrong, now the generation layer is worth your attention:

The model has the right chunk but ignores it in favor of its own training knowledge. Instruct it to answer only from the provided context, and to say it doesn’t know when the context doesn’t cover the question.
The context is there but buried among many irrelevant chunks, diluting the signal. This is an argument for retrieving fewer, better chunks — quality over quantity in the window.
The model is hedging or padding. That’s a genuine prompt fix, and at this point in the chain it’s the right one.

Build the eval before you tune

The reason RAG debugging spirals is that teams change things and judge the result by vibes on a handful of queries. Improvement on three examples often means regression on thirty others you didn’t check.

Assemble a set of real questions with known-good answers — even twenty is enough to start. Run every change against it. When you adjust chunking, you’ll see immediately whether retrieval got better or just different. Without this, you’re optimizing blind, and you’ll chase your tail indefinitely.

The order that saves weeks

Log and read the retrieved chunks for failing queries.
If the right chunk is missing — fix chunking, add hybrid search, retrieve more, rerank.
Only if the right chunk is present and the answer is still wrong — fix generation.
Hold all of it to a fixed eval set so you can tell improvement from noise.

Almost every “our RAG doesn’t work” problem resolves in steps one and two. The prompt, the part everyone reaches for first, is usually the last thing that needed changing.

If your RAG system is returning answers you can’t trust and you’re not sure where the breakdown is, that’s exactly the kind of thing I help teams diagnose. The first consultation is free — get in touch.