AI · · 16 min read

Why Your RAG Pipeline Feels Off: Eight Failure Modes Production Teams Keep Hitting

Retrieval-augmented generation is the most common AI architecture in production right now and the one most often built wrong. The eight failure modes we keep seeing, with concrete fixes, tool recommendations, and the diagnostic order to apply them.

Why Your RAG Pipeline Feels Off: Eight Failure Modes Production Teams Keep Hitting
H
Hooman Digital Senior design + engineering studio for AI, Web3, developer products
Schedule a Call →
Table of contents +

    Retrieval-augmented generation looks deceptively simple in a diagram. Documents in a vector database, query gets embedded, top-k results stuffed into a prompt, model returns an answer. A weekend hackathon project. A demo that works.

    Then it ships, and the team discovers that the answers are confidently wrong about 30% to 50% of the time. The model isn’t the problem. The retrieval is. RAG in production is mostly a retrieval problem with a language model bolted on top, and most of the failure modes live in the retrieval layer.

    We’ve debugged enough production RAG systems at this point to recognize the patterns, and the same eight failure modes account for somewhere around 90% of the problems we see. Below, each one with the diagnostic, the fix, and the specific tools and parameter choices that have worked in production.

    How to use this post

    Two ways. If you’re triaging a system that’s already shipping, skip to the diagnostic order section at the bottom, work through the queries listed there, then jump to the relevant failure mode here. If you’re building a new system, read top to bottom; the failures show up roughly in the order listed.

    A note on benchmarks: most of the parameter recommendations below come from internal evals on our client work plus public benchmarks from the BEIR retrieval suite, the MTEB embedding benchmark, and the LegalBench-RAG / FinanceBench task-specific evals. None of these substitute for evaluating on your own data, which is point #5 below.


    1. Chunking that destroys structure

    The naive chunker splits documents into 500-token blocks. The structure of the original document, headings, sections, lists, tables, gets shredded in the process. A query that should retrieve a specific subsection retrieves the middle of an unrelated paragraph that happens to share keywords.

    Symptom: answers reference plausible-sounding context that, on inspection, doesn’t actually contain the information needed. The chunks pulled back contain related vocabulary but not the relevant fact.

    Fix: chunk on document structure first, token count second. Preserve metadata about the section so the model knows what context it’s looking at. For tables and code blocks, chunk them as units.

    Concretely, the chunking strategy we use as a default:

    # Bad: fixed-size chunking with arbitrary token cutoffs
    def chunk_naive(text, size=500, overlap=50):
        tokens = tokenize(text)
        return [tokens[i:i+size] for i in range(0, len(tokens), size-overlap)]
    
    # Better: structure-aware chunking with metadata
    def chunk_structured(doc):
        chunks = []
        for section in doc.sections:           # split on headings first
            if section.token_count <= 800:     # small enough, keep whole
                chunks.append({
                    "text": section.text,
                    "heading": section.heading,
                    "section_path": section.path,   # e.g. ["Guides", "Auth", "OAuth"]
                    "doc_id": doc.id,
                })
            else:
                for sub in split_paragraphs(section.text, target=600, max=800):
                    chunks.append({
                        "text": sub,
                        "heading": section.heading,
                        "section_path": section.path,
                        "doc_id": doc.id,
                    })
        return chunks

    A few specifics that matter:

    • Target chunk size of 600 tokens, hard cap at 800. For most current embedding models (OpenAI text-embedding-3-large, Cohere Embed v3, Voyage 3, Jina v3), 600-800 is the sweet spot. Larger and the embedding becomes too averaged; smaller and you lose context.
    • Keep section headings inside the chunk. “Authentication > OAuth > Refresh tokens” prepended to the chunk text gives the retriever a useful keyword signal and tells the model where the chunk came from.
    • Special-case tables and code blocks. Markdown tables broken across chunks become unreadable. Same for code. We extract them as atomic units and link to the surrounding context via metadata.
    • For long-form documents (legal contracts, SEC filings), use semantic chunking. Libraries like langchain.text_splitter.SemanticChunker (or rolling-window embedding similarity) split where the topic shifts, not where the token count hits.

    The single highest-leverage change in most underperforming RAG systems is replacing naive chunking with structure-aware chunking. We’ve seen relevance@5 jump 20-40 percentage points from this alone.


    2. Pure semantic search where keyword search would win

    Vector search is great at finding semantically similar content. It is bad at finding exact matches. A user searching for “GPT-4 Turbo” expects results that contain that exact string. Pure semantic search will return things that are “about GPT models” and miss the specific match.

    Symptom: queries that contain product names, function names, version numbers, or proper nouns return irrelevant-feeling results. Queries that paraphrase well return relevant results.

    Fix: hybrid retrieval. Vector search plus BM25 (or similar keyword search) with a re-ranker on top.

    The pattern that works in production:

    1. Vector search:      pull top 50 candidates by cosine similarity
    2. BM25 search:        pull top 50 candidates by keyword match
    3. Reciprocal rank fusion (RRF) or weighted union: dedupe + merge into ~80 candidates
    4. Re-rank with cross-encoder: top 5 returned to LLM

    RRF is the simplest blend and works well: score(doc) = sum(1 / (k + rank_in_method)) across each retriever, where k=60 is the standard constant. No tuning required.

    Tool choices we use:

    LayerDefault choiceWhen to swap
    EmbeddingsOpenAI text-embedding-3-large or Voyage voyage-3Use Cohere Embed v3 for multilingual; use a fine-tuned bi-encoder for very domain-specific text
    BM25Postgres tsvector if you already run pg; otherwise Elasticsearch or OpenSearchTantivy/Meilisearch for self-hosted lightweight setups
    FusionRRF with k=60Linear-weighted fusion if you have eval data to tune weights

    If you’re on Postgres, the whole stack can live in one database with pgvector for vector search and tsvector for keyword search. We’ve shipped this pattern at production scale and it removes an entire moving part.


    3. No re-ranking

    The top 5 results from a vector search are not the top 5 most relevant results. They are the top 5 most similar by vector distance. These are different things in practice.

    Symptom: retrieval looks like it’s working (the right doc is in top 20) but the LLM is given the wrong subset (the right doc is at rank 17, the top 5 are tangentially related).

    Fix: a re-ranker on top of the initial retrieval. Cross-encoders score query-document pairs jointly rather than embedding each separately, and the quality improvement is consistent across most domains.

    Re-ranker comparison (April 2026 snapshot):

    ModelQuality (BEIR avg)Latency (top 50)CostWhen to use
    Cohere Rerank v3.5~0.55~150ms$1.00/1k searchesDefault, easy API
    Voyage rerank-2~0.54~140ms$1.20/1kPairs well with Voyage embeddings
    Jina Reranker v2~0.52~120ms$0.50/1kCheaper option, slightly weaker
    bge-reranker-v2-m3 (open)~0.50~80ms (GPU)Self-hostSelf-hosted, no vendor lock
    mxbai-rerank-large-v2 (open)~0.49~70ms (GPU)Self-hostSame

    For most teams, Cohere Rerank v3.5 is the right default. Plug it in over a weekend, get a measurable quality lift, decide later if the cost justifies self-hosting.

    A correct implementation re-ranks after fusion, not before. Re-ranking only vector results misses keyword-strong candidates; re-ranking only keyword results misses semantically-strong ones. Fuse first, then re-rank the merged pool.


    4. Context window stuffing without ranking

    A common pattern: retrieve 20 chunks, stuff them all into the prompt, hope the model uses the relevant ones. The model is now distracted by 18 irrelevant chunks. Answer quality drops.

    Symptom: with small context (3-5 chunks), answers are good but sometimes miss information. The team adds more chunks. Answers get worse. The team adds even more. Answers get noticeably worse and slower.

    Fix: smaller, ranked context. Three to five carefully chosen chunks beat twenty chunks dumped in. Modern long-context models (Claude 4.6, GPT-5, Gemini 2.5) tolerate more padding but don’t reward it.

    This is the “lost in the middle” effect, formalized in the Liu et al. paper of the same name. Models attend most strongly to the start and end of the context window. Information in the middle gets weighted less. With 20 chunks, the relevant one ends up in the middle.

    What we do in production:

    • Return top 3-5 chunks after re-ranking. Not top 20.
    • Put the most relevant chunk last (closest to the question), counterintuitive but works.
    • Include a brief “source” header before each chunk so the model knows the boundaries.
    • For multi-document answers (where the LLM should synthesize across docs), bias toward 5-7 chunks. For single-fact answers, 3 is plenty.

    If your team is dumping 15+ chunks into the prompt because “more context is better,” that’s costing you both money and quality.


    5. Evals against synthetic queries

    The team builds an eval suite by asking a model to generate questions about the documents. The eval passes. The system ships. Real users ask questions that are nothing like the synthetic ones, and the answers are bad.

    Symptom: internal evals look great. Production user satisfaction is mediocre. The gap is wide enough that the team starts questioning the evals.

    Fix: a real query set. Log production queries (even from a beta or pilot), annotate a sample, use that as the eval suite.

    The minimum viable eval setup we recommend for a new RAG system:

    # 1. Sample 100 production queries (real, not synthetic)
    queries = load_recent_queries(days=14, sample=100)
    
    # 2. Annotate: for each, the ideal retrieved chunks (1-3)
    # Have a domain expert do this manually. It takes about a day.
    annotated = annotate_queries(queries)
    
    # 3. Define metrics that match what matters
    metrics = {
        "recall@5":  fraction of queries where >=1 ideal chunk in top 5,
        "mrr":       mean reciprocal rank of first ideal chunk,
        "ndcg@5":    discounted gain accounting for rank order,
        "answer_acc": LLM-judge or human eval on generated answers,
    }
    
    # 4. Run on every retrieval/model change
    for run_id in ["baseline", "new_chunker", "new_reranker"]:
        results = evaluate(run_id, annotated, metrics)
        print(f"{run_id}: {results}")

    Tooling: PromptFoo, Inspect (AI safety institute), and Phoenix (Arize) all work for this. The framework choice matters less than the discipline of running evals before shipping retrieval changes.

    A few realities about evals:

    • Synthetic queries are okay as a starting point. They’re better than no evals. They’re not a substitute for real queries.
    • Plan to annotate 100-300 queries. Below 100, your metrics will be too noisy. Above 300 with diminishing returns.
    • Re-annotate periodically. Users’ query patterns shift as they get more comfortable with the product.
    • Separate retrieval evals from end-to-end answer evals. When the answer is wrong, you need to know whether retrieval failed or the model failed.

    6. No grounding signal in the output

    The model returns an answer with full confidence. The user has no way to tell whether the answer is grounded in the retrieved documents or was synthesized from the model’s prior knowledge. This is exactly where hallucinations sneak in.

    Symptom: users start catching factual errors that the system was confident about. Trust erodes faster than the team expects.

    Fix: make the model cite. Inline citations to specific chunks, with the source document accessible. When the model is forced to attribute, it tends to either ground its claims or admit it can’t.

    The prompt pattern we use (simplified):

    SYSTEM: You answer questions based only on the provided context.
    For each factual claim in your answer, cite the source chunk using
    [^id] where id is the chunk's source_id. If the context does not
    contain enough information to answer, say so explicitly.
    
    CONTEXT:
    [chunk source_id=A1] {chunk_text}
    [chunk source_id=A2] {chunk_text}
    [chunk source_id=B1] {chunk_text}
    
    QUESTION: {user_question}

    Three downstream consequences:

    • Hallucinations drop. Anthropic published internal data on Claude 4.5 showing citation-required prompts roughly halve hallucinated facts on factuality benchmarks.
    • User trust improves. Users can click a citation to verify. The product feels more trustworthy.
    • Failure modes become inspectable. When a wrong answer cites a real chunk, you find a retrieval problem. When it cites nothing, you find a prompting problem. Different fixes.

    7. Stale indexes

    The vector database was indexed three months ago. Documents have been updated since. The system is confidently answering questions with information that’s out of date. The user does not know.

    Symptom: correct-looking answers that are based on superseded information. Most common in product docs, internal wikis, and policy documents.

    Fix: incremental indexing on a schedule, with a “last indexed” timestamp surfaced to the user where appropriate. For documents with high churn (product docs, internal wikis), continuous indexing.

    The implementation pattern that has worked across our deployments:

    • Track document versions at ingestion. Every chunk has source_id, version, and indexed_at metadata.
    • Webhook or polling on the source. When a document changes, re-chunk just that document and replace its chunks in the index. Don’t reindex the world.
    • Soft delete with grace period. Don’t immediately remove old chunks; mark them and remove after 24-48 hours. Avoids race conditions during reindexing.
    • Show indexed-at in the UI. “Based on docs from [date]” or similar. Users handle “this might be stale” much better than they handle confidently-wrong.

    8. The metadata gap

    Document metadata (date, author, source, version, access level) is dropped during chunking. The retrieval returns content but no provenance. The model can’t tell the user where the information came from, and the team can’t filter results by recency, source, or version.

    Symptom: can’t answer “show me only the latest version” or “filter to docs from the engineering team.” The vector DB returns chunks without context the application needs.

    Fix: preserve metadata as part of each chunk’s embedding metadata. Use it for filtering at retrieval time and for surfacing source in the output.

    chunk = {
        "text": "...",
        "embedding": [...],
        "metadata": {
            "source_id": "doc_a4f12",
            "section_path": ["Guides", "Auth", "OAuth"],
            "doc_title": "OAuth Integration Guide",
            "doc_url": "/docs/auth/oauth",
            "doc_author": "platform-team",
            "doc_version": "v2.3.1",
            "published_at": "2026-02-14",
            "access_level": "public",
            "tags": ["auth", "oauth", "integration"],
        }
    }

    At retrieval time, the query can filter on metadata before semantic ranking:

    -- pgvector example
    SELECT chunk_id, text, embedding <=> $query_embedding AS distance
    FROM chunks
    WHERE metadata->>'access_level' = 'public'
      AND (metadata->>'published_at')::date > '2025-01-01'
      AND metadata->>'doc_author' = ANY($allowed_authors)
    ORDER BY distance
    LIMIT 50;

    This is also how you implement permission-aware retrieval for multi-tenant systems. The metadata filter happens before vector ranking, so users only see chunks they’re authorized to see.


    The diagnostic order we use

    When a client comes to us with a production RAG system that’s underperforming, the order is consistent:

    1. Look at real query logs first. What are users actually asking? Sample 30-50 recent queries.
    2. Pick five failed queries. Walk through the retrieval, manually. What did it return? What should it have returned? Use the vector DB’s debug interface to inspect top 20, not just top 5.
    3. Identify which failure modes are present. Usually two or three at a time. The eight modes above cover almost everything.
    4. Fix retrieval first. The model is almost never the problem at this stage.
    5. Re-run evals. Quantify the improvement before moving on.
    6. Then look at prompting. Citation prompt, format prompt, refusal prompt.
    7. Then consider model swap. Almost always the last lever.

    A surprising share of “the LLM is wrong” complaints get fixed by changing the retrieval, not the model. We’ve shipped production RAG systems where the model is GPT-4o-mini and the answer quality matches systems running Claude 4.6, purely because the retrieval is doing more work.

    What a production RAG system looks like

    The setup we’d build today for a new client, defaults for a medium-complexity domain:

    ComponentChoice
    Document loaderUnstructured.io for PDFs/HTML; custom for structured sources
    ChunkerStructure-aware, target 600 tokens, max 800
    EmbeddingsOpenAI text-embedding-3-large (3072d) or Voyage voyage-3 (1024d)
    Vector DBpgvector if on Postgres; Qdrant otherwise
    Keyword searchPostgres tsvector or BM25 via Tantivy
    FusionRRF, k=60
    Re-rankerCohere Rerank v3.5
    LLMClaude 4.5 Sonnet or GPT-5 (task-dependent)
    TracingLangSmith, Phoenix, or Helicone (pick one)
    Evals100+ real annotated queries, automated runs on every change
    CitationsRequired in system prompt, rendered as clickable in UI
    Re-indexingWebhook-driven, with indexed_at metadata

    None of this is exotic. The systems that ship all of it are uncommon. This is also the shape of the AI work we ship with clients building on production-grade retrieval.

    A note on agents and “agentic RAG”

    A common 2026 trend: replacing the retrieval pipeline above with an “agent” that decides what to retrieve. Tool-using LLMs that can call a search function multiple times, reason about results, and refine queries.

    This works for some problems (complex multi-hop questions) and is overkill for most (single-fact lookups). The cost is latency and predictability. A single-shot retrieval pipeline returns in 300-500ms. An agent loop returns in 3-10 seconds.

    Our default recommendation: ship single-shot RAG first, with strong evals. Add agentic retrieval as a fallback for the queries that fail in eval, not as a default for every query. Most production RAG systems we see that are running agentic-everywhere would be both faster and cheaper as single-shot with better retrieval.

    Closing

    If your RAG system feels off and you can’t pin down why, the answer is almost always in the retrieval layer. The eight failure modes above cover the great majority of what we see. Most are fixable in a sprint, not a quarter.

    If you want a fresh pair of eyes on a production system, schedule a call. We’ll walk through the diagnostic order against your actual queries and identify the highest-leverage fix.

    Key takeaways

    • Structure-aware chunking with 600-800 token chunks and section-path metadata can lift relevance@5 by 20-40 percentage points over naive fixed-size chunking.
    • Hybrid retrieval (vector + BM25 + reciprocal rank fusion with k=60) plus a cross-encoder re-ranker beats pure semantic search across almost every domain.
    • Top 3-5 ranked chunks beat 20 stuffed chunks because of the lost-in-the-middle effect, put the most relevant chunk last, closest to the question.
    • Evaluate against 100-300 real annotated queries, not synthetic ones, separate retrieval evals from end-to-end answer evals so you know what failed.
    • Citation-required prompts roughly halve hallucinated facts and make failures inspectable: wrong answer citing a real chunk is a retrieval problem; citing nothing is a prompting problem.
    • Fix retrieval first, then prompting, then consider model swap last, most 'the LLM is wrong' complaints get solved by changing the retrieval.

    Frequently asked

    Why is my RAG pipeline returning wrong or irrelevant answers? +

    RAG in production is mostly a retrieval problem with a language model bolted on top. The eight most common failure modes are naive chunking that destroys structure, pure semantic search where keyword search would win, no re-ranking, context window stuffing without ranking, evals against synthetic queries instead of real ones, no citation grounding, stale indexes, and dropped document metadata. Fix retrieval before swapping the model, most 'the LLM is wrong' complaints are actually retrieval failures.

    What chunking strategy works best for RAG? +

    Structure-aware chunking with a target of 600 tokens and a hard cap at 800. Split on document headings first, then on paragraphs if a section is too large. Keep section headings inside the chunk as a path like 'Authentication > OAuth > Refresh tokens' so both the retriever and the model know where the chunk came from. Special-case tables and code blocks as atomic units. For long-form documents, use semantic chunking that splits where the topic shifts.

    Do I need a re-ranker in my RAG pipeline? +

    Yes. The top 5 results from vector search are the top 5 most similar by vector distance, not the top 5 most relevant, these are different things in practice. Use a cross-encoder re-ranker after fusing vector and keyword candidates. Cohere Rerank v3.5 is the right default for most teams (~$1.00 per 1k searches, ~150ms latency, ~0.55 BEIR average). Self-host bge-reranker-v2-m3 if you need to avoid vendor lock-in.

    How many chunks should I put in the LLM context for RAG? +

    Three to five carefully re-ranked chunks, not twenty. Modern long-context models (Claude 4.6, GPT-5, Gemini 2.5) tolerate more padding but don't reward it. The 'lost in the middle' effect means information in the middle of a long context gets weighted less by the model. Put the most relevant chunk last, closest to the question, and include a brief source header before each chunk so the model knows the boundaries.

    Should I evaluate my RAG system with synthetic or real queries? +

    Real queries, annotated by a domain expert. Synthetic queries from an LLM are okay as a starting point but produce evals that pass while production users still get bad answers. Sample 100-300 production queries, annotate the ideal retrieved chunks for each, and measure recall@5, MRR, nDCG@5, and answer accuracy. Below 100 queries the metrics are too noisy; above 300 the returns diminish. Re-annotate periodically as user query patterns shift.

    RAGretrievalembeddingsre-rankingvector searchpgvectorCohere RerankAI evals

    We are ready to tell your story.

    Product design, AI systems, brand, and DevOps infrastructure, one senior team, shipped together.

    Start a Project