Most RAG systems quietly underperform their demos. The reason is consistent: roughly 80% of RAG failures trace back to the ingestion and chunking layer, not the LLM. The model is doing its job — generating fluent text from whatever context it gets. The problem is that the context you handed it was the wrong context.
This article is the operator's view of production RAG in 2026. The patterns that consistently work, the failures that consistently kill projects, and the order to attack them in.
1. The naive RAG pipeline (and why it underperforms)
The textbook RAG pipeline:
- Split your documents into chunks.
- Embed each chunk and store in a vector database.
- For each user query, embed it and retrieve the top-K nearest chunks.
- Pass the chunks plus the query to an LLM, get an answer.
This works well enough for demos. In production it routinely fails: the retriever returns the wrong chunks, the chunks are too long or too short, queries that depend on specific terminology miss entirely, and the model hallucinates anyway because the retrieved context did not actually contain the answer.
Naive pipelines fail at retrieval roughly 40% of the time on real-world corpora. The model cannot answer correctly with bad context.
2. The five fixes that compound
Production RAG pipeline (after the 5 fixes)
The patterns below, applied in order, take a naive RAG system from "sometimes works" to "production-grade."
Fix 1: Better chunking
The single highest-leverage change. The goal is chunks that are semantically complete — each chunk should be able to answer a query on its own.
- Start with character-based chunking at ~2000 characters with ~200 character overlap (≈ 500 tokens at 4 chars/token). Tune for your embedding model's ideal input length.
- For structured documents (markdown, HTML, code), respect the structure. Chunk by section heading, not arbitrary character count.
- For very long technical content, semantic chunking — splitting at topic boundaries detected by embedding similarity — typically outperforms fixed-size splitting.
Fix 2: Hybrid search
Pure vector search misses queries that turn on specific terminology — proper nouns, codes, identifiers, version numbers. Combine vector search with sparse keyword search (BM25 or TF-IDF), merge with reciprocal rank fusion. This is the change with the largest recall improvement per unit of complexity. Implement first, before reranking.
Fix 3: Reranking
Retrieve more candidates than you return. A reasonable starting point: retrieve 20 candidates, rerank with a model like Cohere rerank-v3.5, return the top 5 to the LLM. Reranking adds about 50ms of latency and $0.001–$0.01 per query. Skip reranking only when latency is critical or accuracy is acceptable to be lower.
Fix 4: Contextual retrieval
An approach published by Anthropic in late 2024: before embedding each chunk, prepend a short LLM-generated summary placing the chunk in the context of its source document. This dramatically improves retrieval quality for chunks where the surrounding document context matters (legal documents, technical specifications, code).
Fix 5: Query transformation
For complex queries, transform them before retrieval. Common patterns include query rewriting (an LLM rewrites the user query into a clearer search query), query expansion (generate multiple variants of the query, retrieve for each, merge), and HyDE (hypothetical document embeddings — the LLM generates a hypothetical answer, you embed that, and retrieve based on it). Query transformation helps most when users ask vaguely or in conversational style.
3. Evaluation: the discipline most teams skip
The 4 RAGAS metrics every RAG system must track
You cannot fix what you cannot measure. Yet most RAG deployments ship without any systematic evaluation, and then the team spends quarters debugging on user complaints.
The minimum viable evaluation, using RAGAS or equivalent:
- Context precision: how much of the retrieved context is actually relevant to the query.
- Context recall: whether the pipeline retrieved the information needed to answer correctly.
- Faithfulness: whether the generated answer stays grounded in the provided context (does not hallucinate beyond what was retrieved).
- Answer relevance: whether the final output directly addresses the user query.
Build a small evaluation set: 50–100 query-with-correct-answer pairs from real usage. Run your pipeline against it weekly. Watch the trend lines, not absolute numbers. A declining trend is the signal worth investigating.
4. The architecture decisions that scale
One: keep ingestion idempotent. You will re-process your corpus repeatedly — embedding model upgrades, chunking strategy changes, metadata schema evolution. Make the ingestion pipeline runnable with no side effects.
Two: store the source. Every retrieved chunk should be traceable to a specific document, version, and offset. Without this, you cannot debug retrieval failures and cannot answer audit questions about what data informed which response.
Three: log everything. The query, the retrieved chunks (with scores), the final prompt, the LLM response. Logs are the difference between a system you can improve and a system you can only complain about.
Four: think about freshness. Vector search returns the closest match — even if all your data is six months stale. Build refresh pipelines, alert on staleness, and decide explicitly how to surface freshness to end users.
5. When NOT to use RAG
RAG is the right answer for many problems but not all. Skip it when:
- Your knowledge base fits comfortably in the LLM's context window (under 50K tokens) and changes infrequently. Just include it directly.
- Your task is reasoning, not knowledge retrieval. RAG does not help a model think better — it helps it stay grounded in your data.
- You need real-time computation or transactional data — use direct API calls, not retrieval.
- The cost of being slightly wrong is unacceptably high (medical diagnosis, legal advice). RAG reduces hallucination but does not eliminate it.
6. The order to build in
If you are starting from zero, build in this order:
- Naive pipeline first. Get it working end-to-end. Establish a baseline.
- Build a small evaluation set immediately. Measure your baseline.
- Add hybrid search. Measure the change.
- Add reranking. Measure.
- Improve chunking. Measure.
- Add contextual retrieval and/or query transformation if your evaluation shows specific recall gaps.
The discipline matters more than the techniques. Most teams add all the techniques at once, cannot tell which one helped, and end up with a complex system they cannot improve. Iterate one change at a time, with measurement.
For Malaysian teams building production RAG with proper evaluation discipline, our AI Engineering programme covers the full RAG stack hands-on, HRDC SBL-KHAS claimable for eligible employers.