Your RAG Problem Might Be a Data Issue, Not Retrieval Issue

My blog has a chat widget. It's a RAG system that answers questions using my posts. Today I asked it about Model Armor, a Google Cloud AI safety service I'd only mentioned in a blog draft, and it told me with total confidence that Model Armor "screens prompts and responses and runs scale-to-zero on Cloud Run." Half right. Model Armor is a managed API; the thing running on Cloud Run is my agent, not Model Armor. The bot had welded two facts into one wrong one.

[Screenshot: the widget confidently claiming Model Armor runs scale-to-zero on Cloud Run.]

I did what most people do when a RAG answer is wrong: I assumed retrieval was the problem. I added hybrid search. I tightened the system prompt. I even built a second LLM pass that verifies every answer against its retrieved context before returning it. The answers got shorter and more careful — and stayed wrong. Because the problem was never retrieval. It was the data, in two separate ways.

Lesson 1: your grounding data can't be ambiguous

Here's the sentence the bot was reading, lifted from an old summary bullet in my post:

"Model Armor screening prompts and responses, all of it scale-to-zero on Cloud Run."

I meant "all of it" to describe the whole build. But read it cold, the sentence bolts two distinct ideas together, and the second one sits right next to Model Armor. RAG doesn't retrieve ideas; it retrieves chunks of text. When a chunk fuses two thoughts, the model gets the fusion handed to it and faithfully attaches the trailing detail to the subject. It wasn't hallucinating. It was being loyal to muddy ground truth.

That reframed how I write. Your posts, your docs, your wiki — the moment they feed a RAG system, they're also a database, so write them like one. One idea per sentence. Don't park an unrelated attribute next to a subject in a list and trust the reader to untangle it, because the reader is now a language model that won't. The ambiguity a human glides past is exactly what a retriever amplifies.

Lesson 2: a stale index is poison

Then it got worse, and more interesting. I had already edited that sentence out of the post. The live post didn't contain it. The widget served it anyway.

The culprit was my own indexing code. When I re-indexed a post, I updated its chunks in place — but I never deleted chunks that no longer existed. Shorten a post and the now-orphaned chunks just sit in the vector store forever, fully retrievable. I went looking, and there it was: a ghost chunk of a draft I'd long since rewritten, still answering questions on my behalf.

An index is a cache of your content. The instant it drifts from the source of truth, you're confidently serving deleted or outdated material, and no amount of clever ranking saves you — you just rank the wrong thing more efficiently. The fix was to make indexing authoritative: replace each post's chunks wholesale on every update, and prune chunks for anything no longer published. Delete-then-insert, not patch-in-place.

[Screenshot: after the fix, the same question correctly returns "that topic isn't covered in my blog posts."]

The part nobody wants to hear

There is a genuinely deep menu of retrieval tricks, and I use several: hybrid search (dense embeddings plus keyword/BM25, fused with reciprocal rank fusion), cross-encoder rerankers, query rewriting like HyDE and multi-query, smarter chunking, contextual compression, metadata filters. They all work. But every one of them is in the business of ranking the right chunk higher. Not one helps if the right chunk is wrong, or if it should have been deleted three edits ago. Garbage in, ranked garbage out. Fix the data first; the retrieval cleverness is the finish, not the foundation.

Keeping the index honest at scale

My fix re-indexes the entire corpus on every change. For a couple dozen posts, that's fine. At scale it's wasteful — you don't want to re-embed your whole knowledge base because someone fixed a typo. The patterns that scale:

Event-driven indexing tied to the content lifecycle: on create or update, re-embed only the changed document; on delete or unpublish, remove its chunks.
Per-document delete-then-insert keyed by a stable ID, so a shrinking document can never orphan.
Content hashing, so you skip re-embedding chunks whose text didn't actually change and save the API bill.
A periodic reconciliation job that diffs the source of truth against the index by counts and hashes, catching the drift your event stream missed.

At real volume this becomes a pipeline: change data capture from the source database, a queue, embedding workers, idempotent upserts into the vector store. The principle underneath all of it is the same — the index is downstream of the truth, and it needs a process that keeps it that way.

Let your users find the rot

One more, because automation only goes so far. The cheapest, highest-signal bug detector I have is the person reading the answer. I'm adding a thumbs up/down to the widget that logs the question and the exact chunks that grounded the answer. A thumbs down becomes a triage ticket: bad grounding (fix the source), stale index (re-index), or a true retrieval miss (tune the ranking). And those captured failures aren't just bug reports — they're the start of an eval set, or few-shot examples to harden the prompt.

Clean data and a fresh index are the foundation. Retrieval tuning is the polish. Your users are the smoke detector. Build all three, but build them in that order — because a RAG problem is a data problem long before it's a retrieval problem.