Building RAG Pipelines That Don't Hallucinate: What I Learned

March 10, 2025

Building RAG Pipelines That Don't Hallucinate: What I Learned

Retrieval-Augmented Generation (RAG) is one of those ideas that sounds almost too clean. You have a language model that doesn't know your data. You retrieve the relevant chunks from your data. You hand those chunks to the model as context. The model answers correctly.

In practice, it fails in at least five ways before you get it right.

I built a RAG system at HackDavis 2025 — a GenAI assistant for community health workers grounded in CDC and California county health datasets. What started as "just RAG" became a crash course in everything that breaks between the architecture diagram and a working system.

The retrieval step is where most teams give up too early

The most common mistake I see is treating retrieval as solved once you get cosine similarity working. It isn't.

When a user asks "What is the diabetes prevalence in South LA?", your vector store needs to return chunks that are actually about South LA diabetes rates — not just chunks that share vocabulary with the question. That sounds obvious. It's surprisingly hard to guarantee.

Problems I ran into:

  • Chunks were too large. A 1,500-token chunk from a CDC report might be about diabetes in California, but burying the South LA statistic 800 tokens in. The similarity score is diluted. Smaller, semantically coherent chunks (300–500 tokens) worked much better.
  • Embeddings didn't match the query style. The documents used clinical language. Users asked in plain English. I switched from a general-purpose embedding model to one fine-tuned on biomedical text, and retrieval recall jumped immediately.
  • Top-K wasn't enough. Returning the top 3 chunks by similarity sometimes missed the correct chunk entirely. I moved to top-10 retrieval with a re-ranker (cross-encoder) to select the best 3 before passing to the LLM. Much more reliable.

Context stuffing breaks your prompt budget fast

Once retrieval works, you have a new problem: you're passing 3–10 chunks plus conversation history plus a system prompt into a single context window. At 4K tokens of context, that fills up fast.

What helped:

  • Summarize long chunks before insertion when the chunk exceeds ~400 tokens. A quick map-reduce summary pass keeps the signal without the noise.
  • Maintain a rolling conversation buffer, not the full history. Keep the last 3 turns only.
  • Put the most relevant chunks last in the context. LLMs tend to weight later tokens more heavily (the "lost in the middle" problem is real).

Grounding citations are non-negotiable for trust

The health workers using our system needed to trust the answers enough to act on them. An answer without a source is just an expensive guess.

I added a citation step: after the model generates a response, a second lightweight prompt extracts which chunks the answer draws from and formats them as footnote references. The output looks like:

"The diabetes prevalence in Los Angeles County is 10.3% as of 2023. [Source: CDPH County Health Data 2023, Table 4.2]"

This sounds like extra engineering. It's not optional if the output matters.

Prompt engineering is 30% of your accuracy

I spent two days improving retrieval and one afternoon improving the system prompt. The prompt afternoon had more impact.

Key patterns that worked:

You are a data assistant for community health workers. Answer ONLY using the provided context. If the context does not contain enough information to answer, say: "I don't have enough data on this." Do NOT generate statistics or dates not present in the context. Format numbers clearly. Always cite the source document.

The explicit "do not generate statistics not in the context" instruction cut hallucinations by roughly 70% in our testing.

Evaluation isn't optional

The hardest part of RAG isn't building it — it's knowing when it's good enough. I set up a small evaluation harness:

  1. 40 question-answer pairs drawn from the source documents (ground truth)
  2. Automated scoring using exact-match on numbers/statistics (the part that matters most)
  3. Manual review of 10 random outputs per experiment

Without this, you're flying blind. Every retrieval change, chunk size tweak, and prompt edit needs a signal.

What I'd do differently

  • Start with smaller chunks immediately. I wasted a week debugging retrieval before realizing chunk size was the root cause.
  • Build the evaluation harness on day one, not day five.
  • Use a hybrid retrieval approach (BM25 + semantic) from the start. Keyword search catches things embedding search misses, especially for proper nouns and numeric values.

RAG is one of the most practically useful things you can build with LLMs right now. It's also one of the most nuanced to get right. The gap between a demo and a system someone trusts with real decisions is where all the actual engineering lives.

GitHub
LinkedIn
CV