AI Engineering

Production RAG: Patterns, Pitfalls, and What Actually Works

80% of RAG failures trace to ingestion, not the LLM. The 2026 patterns — chunking, hybrid search, reranking, and evaluation — that consistently work in production.

By Marcus Chia 2025-12-23 11 min read

Production RAG patterns 2026 — chunking, hybrid search, reranking, evaluation

Most RAG systems quietly underperform their demos. The reason is consistent: roughly 80% of RAG failures trace back to the ingestion and chunking layer, not the LLM. The model is doing its job — generating fluent text from whatever context it gets. The problem is that the context you handed it was the wrong context.

This article is the operator's view of production RAG in 2026. The patterns that consistently work, the failures that consistently kill projects, and the order to attack them in.

1. The naive RAG pipeline (and why it underperforms)

The textbook RAG pipeline:

Split your documents into chunks.
Embed each chunk and store in a vector database.
For each user query, embed it and retrieve the top-K nearest chunks.
Pass the chunks plus the query to an LLM, get an answer.

This works well enough for demos. In production it routinely fails: the retriever returns the wrong chunks, the chunks are too long or too short, queries that depend on specific terminology miss entirely, and the model hallucinates anyway because the retrieved context did not actually contain the answer.

Naive pipelines fail at retrieval roughly 40% of the time on real-world corpora. The model cannot answer correctly with bad context.

2. The five fixes that compound

Production RAG pipeline (after the 5 fixes)

The patterns below, applied in order, take a naive RAG system from "sometimes works" to "production-grade."

Fix 1: Better chunking

The single highest-leverage change. The goal is chunks that are semantically complete — each chunk should be able to answer a query on its own.

Start with character-based chunking at ~2000 characters with ~200 character overlap (≈ 500 tokens at 4 chars/token). Tune for your embedding model's ideal input length.
For structured documents (markdown, HTML, code), respect the structure. Chunk by section heading, not arbitrary character count.
For very long technical content, semantic chunking — splitting at topic boundaries detected by embedding similarity — typically outperforms fixed-size splitting.

Fix 2: Hybrid search

Pure vector search misses queries that turn on specific terminology — proper nouns, codes, identifiers, version numbers. Combine vector search with sparse keyword search (BM25 or TF-IDF), merge with reciprocal rank fusion. This is the change with the largest recall improvement per unit of complexity. Implement first, before reranking.

Fix 3: Reranking

Retrieve more candidates than you return. A reasonable starting point: retrieve 20 candidates, rerank with a model like Cohere rerank-v3.5, return the top 5 to the LLM. Reranking adds about 50ms of latency and $0.001–$0.01 per query. Skip reranking only when latency is critical or accuracy is acceptable to be lower.

Fix 4: Contextual retrieval

An approach published by Anthropic in late 2024: before embedding each chunk, prepend a short LLM-generated summary placing the chunk in the context of its source document. This dramatically improves retrieval quality for chunks where the surrounding document context matters (legal documents, technical specifications, code).

Fix 5: Query transformation

For complex queries, transform them before retrieval. Common patterns include query rewriting (an LLM rewrites the user query into a clearer search query), query expansion (generate multiple variants of the query, retrieve for each, merge), and HyDE (hypothetical document embeddings — the LLM generates a hypothetical answer, you embed that, and retrieve based on it). Query transformation helps most when users ask vaguely or in conversational style.

3. Evaluation: the discipline most teams skip

The 4 RAGAS metrics every RAG system must track

You cannot fix what you cannot measure. Yet most RAG deployments ship without any systematic evaluation, and then the team spends quarters debugging on user complaints.

The minimum viable evaluation, using RAGAS or equivalent:

Context precision: how much of the retrieved context is actually relevant to the query.
Context recall: whether the pipeline retrieved the information needed to answer correctly.
Faithfulness: whether the generated answer stays grounded in the provided context (does not hallucinate beyond what was retrieved).
Answer relevance: whether the final output directly addresses the user query.

Build a small evaluation set: 50–100 query-with-correct-answer pairs from real usage. Run your pipeline against it weekly. Watch the trend lines, not absolute numbers. A declining trend is the signal worth investigating.

4. The architecture decisions that scale

One: keep ingestion idempotent. You will re-process your corpus repeatedly — embedding model upgrades, chunking strategy changes, metadata schema evolution. Make the ingestion pipeline runnable with no side effects.

Two: store the source. Every retrieved chunk should be traceable to a specific document, version, and offset. Without this, you cannot debug retrieval failures and cannot answer audit questions about what data informed which response.

Three: log everything. The query, the retrieved chunks (with scores), the final prompt, the LLM response. Logs are the difference between a system you can improve and a system you can only complain about.

Four: think about freshness. Vector search returns the closest match — even if all your data is six months stale. Build refresh pipelines, alert on staleness, and decide explicitly how to surface freshness to end users.

5. When NOT to use RAG

RAG is the right answer for many problems but not all. Skip it when:

Your knowledge base fits comfortably in the LLM's context window (under 50K tokens) and changes infrequently. Just include it directly.
Your task is reasoning, not knowledge retrieval. RAG does not help a model think better — it helps it stay grounded in your data.
You need real-time computation or transactional data — use direct API calls, not retrieval.
The cost of being slightly wrong is unacceptably high (medical diagnosis, legal advice). RAG reduces hallucination but does not eliminate it.

6. The order to build in

If you are starting from zero, build in this order:

Naive pipeline first. Get it working end-to-end. Establish a baseline.
Build a small evaluation set immediately. Measure your baseline.
Add hybrid search. Measure the change.
Add reranking. Measure.
Improve chunking. Measure.
Add contextual retrieval and/or query transformation if your evaluation shows specific recall gaps.

The discipline matters more than the techniques. Most teams add all the techniques at once, cannot tell which one helped, and end up with a complex system they cannot improve. Iterate one change at a time, with measurement.

For Malaysian teams building production RAG with proper evaluation discipline, our AI Engineering programme covers the full RAG stack hands-on, HRDC SBL-KHAS claimable for eligible employers.

About the author

Marcus Chia →

12+ yrs Product Design · Vibe Coding Specialist · ASEAN-scale Products

Marcus has 12+ years in product design and front-end engineering, having shipped consumer and SaaS products used by millions across ASEAN. He specialises in vibe-coding workflows that turn Figma concepts into deployable apps using Claude Code, Antigravity, and Cursor — and teaches non-developers to ship polished, user-centric interfaces in days rather than sprints.

Sources & References

All references checked at time of publication. AITraining2U is not affiliated with the cited sources.

Frequently Asked Questions

About 80 percent of failures trace back to the ingestion and chunking layer rather than the LLM. The model generates fluent text from whatever context it receives — but if retrieval returned irrelevant chunks, no amount of model quality will produce a correct answer. The fix is to invest in chunking strategy, hybrid search, reranking, and evaluation in that order, before assuming the LLM is the problem.

Hybrid search combines dense vector retrieval (semantic similarity via embeddings) with sparse keyword retrieval (BM25 or TF-IDF). The two result sets are merged using reciprocal rank fusion. Pure vector search misses queries that turn on specific terminology — proper nouns, identifiers, version numbers — and hybrid search recovers most of that recall. It is the single highest-leverage change for naive RAG pipelines, and should be implemented before reranking.

Significant for accuracy, modest in cost. Adding a reranker like Cohere rerank-v3.5 typically delivers meaningful improvements in retrieval relevance at about 50ms additional latency and $0.001-0.01 per query. The pattern: retrieve 20 candidates with hybrid search, rerank, return top 5 to the LLM. Skip reranking only for latency-critical applications or cost-sensitive internal tools where approximate answers are acceptable.

RAGAS is the most widely adopted reference-free evaluation framework in 2026. Its four core metrics — context precision, context recall, faithfulness, and answer relevance — cover the full pipeline. Build a small evaluation set of 50-100 query-with-correct-answer pairs from real usage. Run weekly. Watch trend lines, not absolute numbers — a declining trend is the signal to investigate.

When your knowledge base fits comfortably in the LLM's context window (under 50K tokens) and changes infrequently, direct inclusion typically outperforms RAG. Modern frontier models with 200K+ context handle moderately large knowledge bases natively. Use RAG when the corpus is too large for context, when freshness requires real-time retrieval, or when you need traceability to specific source documents.

Want to apply this in your organisation?

AITraining2U runs HRDC-claimable corporate AI training for Malaysian organisations — from leadership awareness to hands-on builder workshops. Talk to us about a programme tailored to your team.

Book a Programme WhatsApp Us