Why Your RAG Pipeline Feels Off: Eight Failure Modes Production Teams Keep Hitting
Retrieval-augmented generation is the most common AI architecture in production right now and the one most often built wrong. The eight failure modes we keep seeing, with concrete fixes, tool recommendations, and the diagnostic order to apply them.
Table of contents +
Retrieval-augmented generation looks deceptively simple in a diagram. Documents in a vector database, query gets embedded, top-k results stuffed into a prompt, model returns an answer. A weekend hackathon project. A demo that works.
Then it ships, and the team discovers that the answers are confidently wrong about 30% to 50% of the time. The model isn’t the problem. The retrieval is. RAG in production is mostly a retrieval problem with a language model bolted on top, and most of the failure modes live in the retrieval layer.
We’ve debugged enough production RAG systems at this point to recognize the patterns, and the same eight failure modes account for somewhere around 90% of the problems we see. Below, each one with the diagnostic, the fix, and the specific tools and parameter choices that have worked in production.
How to use this post
Two ways. If you’re triaging a system that’s already shipping, skip to the diagnostic order section at the bottom, work through the queries listed there, then jump to the relevant failure mode here. If you’re building a new system, read top to bottom; the failures show up roughly in the order listed.
A note on benchmarks: most of the parameter recommendations below come from internal evals on our client work plus public benchmarks from the BEIR retrieval suite, the MTEB embedding benchmark, and the LegalBench-RAG / FinanceBench task-specific evals. None of these substitute for evaluating on your own data, which is point #5 below.
1. Chunking that destroys structure
The naive chunker splits documents into 500-token blocks. The structure of the original document, headings, sections, lists, tables, gets shredded in the process. A query that should retrieve a specific subsection retrieves the middle of an unrelated paragraph that happens to share keywords.
Symptom: answers reference plausible-sounding context that, on inspection, doesn’t actually contain the information needed. The chunks pulled back contain related vocabulary but not the relevant fact.
Fix: chunk on document structure first, token count second. Preserve metadata about the section so the model knows what context it’s looking at. For tables and code blocks, chunk them as units.
Concretely, the chunking strategy we use as a default:
# Bad: fixed-size chunking with arbitrary token cutoffs
def chunk_naive(text, size=500, overlap=50):
tokens = tokenize(text)
return [tokens[i:i+size] for i in range(0, len(tokens), size-overlap)]
# Better: structure-aware chunking with metadata
def chunk_structured(doc):
chunks = []
for section in doc.sections: # split on headings first
if section.token_count <= 800: # small enough, keep whole
chunks.append({
"text": section.text,
"heading": section.heading,
"section_path": section.path, # e.g. ["Guides", "Auth", "OAuth"]
"doc_id": doc.id,
})
else:
for sub in split_paragraphs(section.text, target=600, max=800):
chunks.append({
"text": sub,
"heading": section.heading,
"section_path": section.path,
"doc_id": doc.id,
})
return chunks
A few specifics that matter:
- Target chunk size of 600 tokens, hard cap at 800. For most current embedding models (OpenAI
text-embedding-3-large, Cohere Embed v3, Voyage 3, Jina v3), 600-800 is the sweet spot. Larger and the embedding becomes too averaged; smaller and you lose context. - Keep section headings inside the chunk. “Authentication > OAuth > Refresh tokens” prepended to the chunk text gives the retriever a useful keyword signal and tells the model where the chunk came from.
- Special-case tables and code blocks. Markdown tables broken across chunks become unreadable. Same for code. We extract them as atomic units and link to the surrounding context via metadata.
- For long-form documents (legal contracts, SEC filings), use semantic chunking. Libraries like
langchain.text_splitter.SemanticChunker(or rolling-window embedding similarity) split where the topic shifts, not where the token count hits.
The single highest-leverage change in most underperforming RAG systems is replacing naive chunking with structure-aware chunking. We’ve seen relevance@5 jump 20-40 percentage points from this alone.
2. Pure semantic search where keyword search would win
Vector search is great at finding semantically similar content. It is bad at finding exact matches. A user searching for “GPT-4 Turbo” expects results that contain that exact string. Pure semantic search will return things that are “about GPT models” and miss the specific match.
Symptom: queries that contain product names, function names, version numbers, or proper nouns return irrelevant-feeling results. Queries that paraphrase well return relevant results.
Fix: hybrid retrieval. Vector search plus BM25 (or similar keyword search) with a re-ranker on top.
The pattern that works in production:
1. Vector search: pull top 50 candidates by cosine similarity
2. BM25 search: pull top 50 candidates by keyword match
3. Reciprocal rank fusion (RRF) or weighted union: dedupe + merge into ~80 candidates
4. Re-rank with cross-encoder: top 5 returned to LLM
RRF is the simplest blend and works well: score(doc) = sum(1 / (k + rank_in_method)) across each retriever, where k=60 is the standard constant. No tuning required.
Tool choices we use:
| Layer | Default choice | When to swap |
|---|---|---|
| Embeddings | OpenAI text-embedding-3-large or Voyage voyage-3 | Use Cohere Embed v3 for multilingual; use a fine-tuned bi-encoder for very domain-specific text |
| BM25 | Postgres tsvector if you already run pg; otherwise Elasticsearch or OpenSearch | Tantivy/Meilisearch for self-hosted lightweight setups |
| Fusion | RRF with k=60 | Linear-weighted fusion if you have eval data to tune weights |
If you’re on Postgres, the whole stack can live in one database with pgvector for vector search and tsvector for keyword search. We’ve shipped this pattern at production scale and it removes an entire moving part.
3. No re-ranking
The top 5 results from a vector search are not the top 5 most relevant results. They are the top 5 most similar by vector distance. These are different things in practice.
Symptom: retrieval looks like it’s working (the right doc is in top 20) but the LLM is given the wrong subset (the right doc is at rank 17, the top 5 are tangentially related).
Fix: a re-ranker on top of the initial retrieval. Cross-encoders score query-document pairs jointly rather than embedding each separately, and the quality improvement is consistent across most domains.
Re-ranker comparison (April 2026 snapshot):
| Model | Quality (BEIR avg) | Latency (top 50) | Cost | When to use |
|---|---|---|---|---|
| Cohere Rerank v3.5 | ~0.55 | ~150ms | $1.00/1k searches | Default, easy API |
| Voyage rerank-2 | ~0.54 | ~140ms | $1.20/1k | Pairs well with Voyage embeddings |
| Jina Reranker v2 | ~0.52 | ~120ms | $0.50/1k | Cheaper option, slightly weaker |
bge-reranker-v2-m3 (open) | ~0.50 | ~80ms (GPU) | Self-host | Self-hosted, no vendor lock |
mxbai-rerank-large-v2 (open) | ~0.49 | ~70ms (GPU) | Self-host | Same |
For most teams, Cohere Rerank v3.5 is the right default. Plug it in over a weekend, get a measurable quality lift, decide later if the cost justifies self-hosting.
A correct implementation re-ranks after fusion, not before. Re-ranking only vector results misses keyword-strong candidates; re-ranking only keyword results misses semantically-strong ones. Fuse first, then re-rank the merged pool.
4. Context window stuffing without ranking
A common pattern: retrieve 20 chunks, stuff them all into the prompt, hope the model uses the relevant ones. The model is now distracted by 18 irrelevant chunks. Answer quality drops.
Symptom: with small context (3-5 chunks), answers are good but sometimes miss information. The team adds more chunks. Answers get worse. The team adds even more. Answers get noticeably worse and slower.
Fix: smaller, ranked context. Three to five carefully chosen chunks beat twenty chunks dumped in. Modern long-context models (Claude 4.6, GPT-5, Gemini 2.5) tolerate more padding but don’t reward it.
This is the “lost in the middle” effect, formalized in the Liu et al. paper of the same name. Models attend most strongly to the start and end of the context window. Information in the middle gets weighted less. With 20 chunks, the relevant one ends up in the middle.
What we do in production:
- Return top 3-5 chunks after re-ranking. Not top 20.
- Put the most relevant chunk last (closest to the question), counterintuitive but works.
- Include a brief “source” header before each chunk so the model knows the boundaries.
- For multi-document answers (where the LLM should synthesize across docs), bias toward 5-7 chunks. For single-fact answers, 3 is plenty.
If your team is dumping 15+ chunks into the prompt because “more context is better,” that’s costing you both money and quality.
5. Evals against synthetic queries
The team builds an eval suite by asking a model to generate questions about the documents. The eval passes. The system ships. Real users ask questions that are nothing like the synthetic ones, and the answers are bad.
Symptom: internal evals look great. Production user satisfaction is mediocre. The gap is wide enough that the team starts questioning the evals.
Fix: a real query set. Log production queries (even from a beta or pilot), annotate a sample, use that as the eval suite.
The minimum viable eval setup we recommend for a new RAG system:
# 1. Sample 100 production queries (real, not synthetic)
queries = load_recent_queries(days=14, sample=100)
# 2. Annotate: for each, the ideal retrieved chunks (1-3)
# Have a domain expert do this manually. It takes about a day.
annotated = annotate_queries(queries)
# 3. Define metrics that match what matters
metrics = {
"recall@5": fraction of queries where >=1 ideal chunk in top 5,
"mrr": mean reciprocal rank of first ideal chunk,
"ndcg@5": discounted gain accounting for rank order,
"answer_acc": LLM-judge or human eval on generated answers,
}
# 4. Run on every retrieval/model change
for run_id in ["baseline", "new_chunker", "new_reranker"]:
results = evaluate(run_id, annotated, metrics)
print(f"{run_id}: {results}")
Tooling: PromptFoo, Inspect (AI safety institute), and Phoenix (Arize) all work for this. The framework choice matters less than the discipline of running evals before shipping retrieval changes.
A few realities about evals:
- Synthetic queries are okay as a starting point. They’re better than no evals. They’re not a substitute for real queries.
- Plan to annotate 100-300 queries. Below 100, your metrics will be too noisy. Above 300 with diminishing returns.
- Re-annotate periodically. Users’ query patterns shift as they get more comfortable with the product.
- Separate retrieval evals from end-to-end answer evals. When the answer is wrong, you need to know whether retrieval failed or the model failed.
6. No grounding signal in the output
The model returns an answer with full confidence. The user has no way to tell whether the answer is grounded in the retrieved documents or was synthesized from the model’s prior knowledge. This is exactly where hallucinations sneak in.
Symptom: users start catching factual errors that the system was confident about. Trust erodes faster than the team expects.
Fix: make the model cite. Inline citations to specific chunks, with the source document accessible. When the model is forced to attribute, it tends to either ground its claims or admit it can’t.
The prompt pattern we use (simplified):
SYSTEM: You answer questions based only on the provided context.
For each factual claim in your answer, cite the source chunk using
[^id] where id is the chunk's source_id. If the context does not
contain enough information to answer, say so explicitly.
CONTEXT:
[chunk source_id=A1] {chunk_text}
[chunk source_id=A2] {chunk_text}
[chunk source_id=B1] {chunk_text}
QUESTION: {user_question}
Three downstream consequences:
- Hallucinations drop. Anthropic published internal data on Claude 4.5 showing citation-required prompts roughly halve hallucinated facts on factuality benchmarks.
- User trust improves. Users can click a citation to verify. The product feels more trustworthy.
- Failure modes become inspectable. When a wrong answer cites a real chunk, you find a retrieval problem. When it cites nothing, you find a prompting problem. Different fixes.
7. Stale indexes
The vector database was indexed three months ago. Documents have been updated since. The system is confidently answering questions with information that’s out of date. The user does not know.
Symptom: correct-looking answers that are based on superseded information. Most common in product docs, internal wikis, and policy documents.
Fix: incremental indexing on a schedule, with a “last indexed” timestamp surfaced to the user where appropriate. For documents with high churn (product docs, internal wikis), continuous indexing.
The implementation pattern that has worked across our deployments:
- Track document versions at ingestion. Every chunk has
source_id,version, andindexed_atmetadata. - Webhook or polling on the source. When a document changes, re-chunk just that document and replace its chunks in the index. Don’t reindex the world.
- Soft delete with grace period. Don’t immediately remove old chunks; mark them and remove after 24-48 hours. Avoids race conditions during reindexing.
- Show indexed-at in the UI. “Based on docs from [date]” or similar. Users handle “this might be stale” much better than they handle confidently-wrong.
8. The metadata gap
Document metadata (date, author, source, version, access level) is dropped during chunking. The retrieval returns content but no provenance. The model can’t tell the user where the information came from, and the team can’t filter results by recency, source, or version.
Symptom: can’t answer “show me only the latest version” or “filter to docs from the engineering team.” The vector DB returns chunks without context the application needs.
Fix: preserve metadata as part of each chunk’s embedding metadata. Use it for filtering at retrieval time and for surfacing source in the output.
chunk = {
"text": "...",
"embedding": [...],
"metadata": {
"source_id": "doc_a4f12",
"section_path": ["Guides", "Auth", "OAuth"],
"doc_title": "OAuth Integration Guide",
"doc_url": "/docs/auth/oauth",
"doc_author": "platform-team",
"doc_version": "v2.3.1",
"published_at": "2026-02-14",
"access_level": "public",
"tags": ["auth", "oauth", "integration"],
}
}
At retrieval time, the query can filter on metadata before semantic ranking:
-- pgvector example
SELECT chunk_id, text, embedding <=> $query_embedding AS distance
FROM chunks
WHERE metadata->>'access_level' = 'public'
AND (metadata->>'published_at')::date > '2025-01-01'
AND metadata->>'doc_author' = ANY($allowed_authors)
ORDER BY distance
LIMIT 50;
This is also how you implement permission-aware retrieval for multi-tenant systems. The metadata filter happens before vector ranking, so users only see chunks they’re authorized to see.
The diagnostic order we use
When a client comes to us with a production RAG system that’s underperforming, the order is consistent:
- Look at real query logs first. What are users actually asking? Sample 30-50 recent queries.
- Pick five failed queries. Walk through the retrieval, manually. What did it return? What should it have returned? Use the vector DB’s debug interface to inspect top 20, not just top 5.
- Identify which failure modes are present. Usually two or three at a time. The eight modes above cover almost everything.
- Fix retrieval first. The model is almost never the problem at this stage.
- Re-run evals. Quantify the improvement before moving on.
- Then look at prompting. Citation prompt, format prompt, refusal prompt.
- Then consider model swap. Almost always the last lever.
A surprising share of “the LLM is wrong” complaints get fixed by changing the retrieval, not the model. We’ve shipped production RAG systems where the model is GPT-4o-mini and the answer quality matches systems running Claude 4.6, purely because the retrieval is doing more work.
What a production RAG system looks like
The setup we’d build today for a new client, defaults for a medium-complexity domain:
| Component | Choice |
|---|---|
| Document loader | Unstructured.io for PDFs/HTML; custom for structured sources |
| Chunker | Structure-aware, target 600 tokens, max 800 |
| Embeddings | OpenAI text-embedding-3-large (3072d) or Voyage voyage-3 (1024d) |
| Vector DB | pgvector if on Postgres; Qdrant otherwise |
| Keyword search | Postgres tsvector or BM25 via Tantivy |
| Fusion | RRF, k=60 |
| Re-ranker | Cohere Rerank v3.5 |
| LLM | Claude 4.5 Sonnet or GPT-5 (task-dependent) |
| Tracing | LangSmith, Phoenix, or Helicone (pick one) |
| Evals | 100+ real annotated queries, automated runs on every change |
| Citations | Required in system prompt, rendered as clickable in UI |
| Re-indexing | Webhook-driven, with indexed_at metadata |
None of this is exotic. The systems that ship all of it are uncommon. This is also the shape of the AI work we ship with clients building on production-grade retrieval.
A note on agents and “agentic RAG”
A common 2026 trend: replacing the retrieval pipeline above with an “agent” that decides what to retrieve. Tool-using LLMs that can call a search function multiple times, reason about results, and refine queries.
This works for some problems (complex multi-hop questions) and is overkill for most (single-fact lookups). The cost is latency and predictability. A single-shot retrieval pipeline returns in 300-500ms. An agent loop returns in 3-10 seconds.
Our default recommendation: ship single-shot RAG first, with strong evals. Add agentic retrieval as a fallback for the queries that fail in eval, not as a default for every query. Most production RAG systems we see that are running agentic-everywhere would be both faster and cheaper as single-shot with better retrieval.
Closing
If your RAG system feels off and you can’t pin down why, the answer is almost always in the retrieval layer. The eight failure modes above cover the great majority of what we see. Most are fixable in a sprint, not a quarter.
If you want a fresh pair of eyes on a production system, schedule a call. We’ll walk through the diagnostic order against your actual queries and identify the highest-leverage fix.
Key takeaways
- Structure-aware chunking with 600-800 token chunks and section-path metadata can lift relevance@5 by 20-40 percentage points over naive fixed-size chunking.
- Hybrid retrieval (vector + BM25 + reciprocal rank fusion with k=60) plus a cross-encoder re-ranker beats pure semantic search across almost every domain.
- Top 3-5 ranked chunks beat 20 stuffed chunks because of the lost-in-the-middle effect, put the most relevant chunk last, closest to the question.
- Evaluate against 100-300 real annotated queries, not synthetic ones, separate retrieval evals from end-to-end answer evals so you know what failed.
- Citation-required prompts roughly halve hallucinated facts and make failures inspectable: wrong answer citing a real chunk is a retrieval problem; citing nothing is a prompting problem.
- Fix retrieval first, then prompting, then consider model swap last, most 'the LLM is wrong' complaints get solved by changing the retrieval.
Frequently asked
Why is my RAG pipeline returning wrong or irrelevant answers? +
RAG in production is mostly a retrieval problem with a language model bolted on top. The eight most common failure modes are naive chunking that destroys structure, pure semantic search where keyword search would win, no re-ranking, context window stuffing without ranking, evals against synthetic queries instead of real ones, no citation grounding, stale indexes, and dropped document metadata. Fix retrieval before swapping the model, most 'the LLM is wrong' complaints are actually retrieval failures.
What chunking strategy works best for RAG? +
Structure-aware chunking with a target of 600 tokens and a hard cap at 800. Split on document headings first, then on paragraphs if a section is too large. Keep section headings inside the chunk as a path like 'Authentication > OAuth > Refresh tokens' so both the retriever and the model know where the chunk came from. Special-case tables and code blocks as atomic units. For long-form documents, use semantic chunking that splits where the topic shifts.
Do I need a re-ranker in my RAG pipeline? +
Yes. The top 5 results from vector search are the top 5 most similar by vector distance, not the top 5 most relevant, these are different things in practice. Use a cross-encoder re-ranker after fusing vector and keyword candidates. Cohere Rerank v3.5 is the right default for most teams (~$1.00 per 1k searches, ~150ms latency, ~0.55 BEIR average). Self-host bge-reranker-v2-m3 if you need to avoid vendor lock-in.
How many chunks should I put in the LLM context for RAG? +
Three to five carefully re-ranked chunks, not twenty. Modern long-context models (Claude 4.6, GPT-5, Gemini 2.5) tolerate more padding but don't reward it. The 'lost in the middle' effect means information in the middle of a long context gets weighted less by the model. Put the most relevant chunk last, closest to the question, and include a brief source header before each chunk so the model knows the boundaries.
Should I evaluate my RAG system with synthetic or real queries? +
Real queries, annotated by a domain expert. Synthetic queries from an LLM are okay as a starting point but produce evals that pass while production users still get bad answers. Sample 100-300 production queries, annotate the ideal retrieved chunks for each, and measure recall@5, MRR, nDCG@5, and answer accuracy. Below 100 queries the metrics are too noisy; above 300 the returns diminish. Re-annotate periodically as user query patterns shift.