A week ago I added a RAG-powered chat widget to this site. You've probably noticed the floating button in the corner. Ask it a question and it answers from my blog posts - not from the internet or training data, just from what I've actually written here. I want to walk through how it works, because the implementation makes some deliberate choices worth explaining. Especially one in particular: it costs almost nothing to run.
What I Didn't Build
I wanted it to be near-zero cost so I didn't use Gemini Enterprise Agent Search, a BigQuery embedding pipeline, or a dedicated vector database like Pinecone. Those are fine products for the right scale. A personal blog is not that scale.
What It Actually Uses
The entire stack runs as Docker containers on the same GCP VM that already hosts this site:
- pgvector - a Postgres extension that adds a vector column type and cosine similarity search. One additional container, zero new services, zero new bills.
- Python FastAPI - a lightweight service that handles ingestion and answering. Also runs on the same VM.
- Gemini Embedding API (gemini-embedding-001) - called via a free AI Studio API key. A blog with dozens of posts costs literal cents to embed.
- Gemini 2.5 Flash - for answer generation. Also via the free AI Studio key. Fast, cheap, and more than capable enough for this use case.
The ongoing cost is the fraction of the VM's compute that the two new containers use - which on a small GCP instance rounds to noise. And whenever a post is published, the index updates automatically.
How the Index Works - and Why There Are Two Different Kinds
This is worth clarifying because the terminology gets muddled.
When a blog post is published, the RAG service strips the HTML, splits the content into overlapping 400-word chunks, and sends each chunk to the Gemini embedding API. What comes back is a 3,072-dimensional vector - a list of numbers that encodes the semantic meaning of that text. Those vectors get stored in pgvector alongside the chunk text and post metadata.
That table of vectors is the index - the embedding index. It's what makes semantic search possible at all.
A separate concept is an index on that index - a data structure like IVFFlat or HNSW that organizes vectors spatially so similarity searches can skip most of the table instead of scanning every row. For millions of vectors, this is essential. For a blog with a few hundred chunks, it's unnecessary overhead. I deliberately skipped it. Postgres does a full sequential scan, finds the closest vectors in milliseconds, and moves on. No approximation, no tuning required.
As the blog grows, I can add that acceleration layer later without changing anything else.
How Grounding Actually Works
This is the part I'm most deliberate about, because a RAG system that answers off-topic questions with confident nonsense is worse than no RAG at all.
Grounding happens in two layers.
Layer one is mathematical. When a question comes in, it gets embedded using the same model as the blog content. The resulting vector is compared against every stored chunk using cosine similarity - a measure of how closely two vectors point in the same direction in 3,072-dimensional space. Only chunks that score above a 0.55 threshold are considered relevant. If nothing clears that threshold, the question never reaches the language model. It just returns a "that's not covered here" message. No hallucination possible because no LLM was involved.
Layer two is prompt injection. If relevant chunks do pass the threshold, they get assembled into a context block and injected into the prompt alongside a strict instruction set: answer only from the provided content, never blend in outside knowledge, and if the question still can't be answered from what's there, respond with the literal token OUT_OF_SCOPE. The service checks for that token before returning - if it sees it, sources are suppressed and the user gets a clean out-of-scope message instead of a sourced hallucination.
The result is a system that is genuinely grounded. It won't answer questions about the weather in Frankfurt. It won't speculate about things I haven't written about. It stays in its lane.
A Few Other Details Worth Noting
Rate limiting is handled in PHP before requests even reach the Python service - 10 questions per minute per IP, enforced at the proxy layer. The widget itself is vanilla JS with no dependencies, styled to match the site's dark theme.
The whole thing - ingestion, serving, storage - runs on infrastructure I was already paying for. The API calls are free tier. The marginal cost of the RAG feature on this site is effectively zero.
That felt worth building and worth writing up. If you want to ask it something, the button's in the corner.
0 Comments
Leave a Comment