Retrieval-Augmented Generation sounds simple in theory: upload your documents, build an index, query with natural language, get answers grounded in your data. In practice, most teams hit the same set of pitfalls — and most of them are not obvious until you are debugging wrong answers in production.

Here are five mistakes we have seen (and made ourselves) building Wauldo, a production RAG system that scores 100% on retrieval benchmarks with zero hallucinations. Each one looks innocent. Each one will quietly degrade your system.

Mistake 1: No Tenant Isolation

This is the most dangerous mistake because it is both a security disaster and a quality disaster. If your RAG chunks from User A can leak into User B's search results, you have two problems: a data breach, and answers that cite irrelevant documents. Users lose trust in both cases.

The common approach is to filter results after retrieval — run BM25 or vector search across the entire corpus, then remove chunks that do not belong to the requesting tenant. This is wrong. Post-filtering corrupts your confidence scores because the system scored against the global corpus, not the tenant's documents.

Wrong: filter after scoring

BM25 scores all 100k chunks globally, returns top 10, then removes chunks not belonging to the current tenant. Your confidence score reflects the global corpus, not the tenant's data. A 92% confidence might drop to noise after filtering.

Right: filter before scoring

Filter the chunk pool by tenant ID first, then run BM25 scoring on only that tenant's chunks. Confidence now reflects the actual signal in the tenant's documents.

pseudocode
// Wrong: score globally, filter later
let results = bm25.search(query, all_chunks);
let filtered = results.filter(|c| c.tenant == tenant_id); // scores are stale

// Right: scope first, then score
let tenant_chunks = all_chunks.filter(|c| c.tenant == tenant_id);
let results = bm25.search(query, tenant_chunks); // scores reflect reality

Mistake 2: Trusting the LLM's Confidence

You ask the LLM "how confident are you?" and it says 95%. Reassuring, right? Except the source document only marginally matched the query, with a BM25 score of 0.12. The LLM is not measuring how well the retrieved context supports the answer — it is measuring how fluent and plausible its own output sounds. These are very different things.

LLM confidence is not retrieval confidence. A model will happily generate a confident-sounding answer from weak context because that is what it is trained to do: produce coherent text. The retrieval score is the actual signal. If your BM25 score is 0.12, it does not matter that the LLM thinks it nailed it.

Wrong: ask the LLM how confident it is

The model's self-assessment is uncalibrated. It will say "95% confident" from a chunk that barely matched the query.

Right: compute confidence from retrieval scores

Use BM25 and/or vector similarity scores to derive a calibrated confidence label (high / medium / low). Return it alongside the answer so your application can decide what to show users vs. what to flag for human review.

json — audit trail
{
  "answer": "The late fee is 2% per month...",
  "audit": {
    "confidence": 0.72,
    "confidence_label": "high",    // derived from retrieval score
    "grounded": true,             // fact-checked against sources
    "retrieval_path": "BM25Only",
    "sources_used": 2
  }
}

Mistake 3: Using Only Keyword Search OR Only Vector Search

This is the retrieval version of "tabs vs. spaces" — except the answer is "both." BM25 (keyword search) is excellent at finding exact term matches. If your user types "section 4.2 penalty clause," BM25 will find it instantly. But if they type "what happens if I pay late?" BM25 might miss the chunk entirely because none of those exact words appear in the penalty clause section.

Dense vector search catches semantic meaning. "Pay late" and "penalty clause" are close in embedding space. But vectors are poor at exact terminology — they might rank a vaguely related paragraph higher than the one containing the exact phrase your user needs.

Fix: hybrid retrieval with rank fusion

Run both BM25 and dense vector search, then merge results using Reciprocal Rank Fusion (RRF). You get keyword precision and semantic recall in a single pass.

pseudocode — hybrid retrieval
// Cost-aware routing: pick the cheapest path that works
let bm25_score = bm25.best_score(query, tenant_chunks);

if bm25_score >= 0.45 {
  // Strong keyword match — BM25 is enough
  return bm25.search(query, tenant_chunks);
} else if bm25_score >= 0.20 {
  // Moderate match — rerank with BGE for precision
  return reranker.rerank(bm25.search(query, tenant_chunks));
} else {
  // Weak keyword signal — full hybrid (BM25 + dense + RRF)
  let bm25_results = bm25.search(query, tenant_chunks);
  let dense_results = vector.search(query, tenant_chunks);
  return rrf_merge(bm25_results, dense_results);
}

This approach is not just more accurate — it is also cheaper. Most queries hit the BM25-only fast path. You only pay the embedding cost for queries that actually need it.

Mistake 4: Returning Answers Without Sources

Your user asks "What is the late fee?" and gets back "The late fee is 2% per month." Sounds right. But is it? Which document said that? Which paragraph? What if there are two contracts with different late fees?

Without source attribution, every answer is a black box. The user has no way to verify it, no way to click through to the original document, and no way to know if the system pulled from the right source. This is especially dangerous in legal, financial, and compliance contexts where the provenance of information matters as much as the information itself.

Wrong: return just the answer

"The late fee is 2%." — From which document? Which version? The user has no way to verify.

Right: return answer + sources + confidence

Every answer should include the source passages, relevance scores, and a grounded/ungrounded flag. Make the audit trail a first-class part of your API response.

json — response with sources
{
  "answer": "The late fee is 2% per month, as specified in...",
  "sources": [
    {
      "title": "vendor-contract-2026.pdf",
      "chunk": "Section 4.2: Late payments incur a 2% monthly...",
      "relevance": 0.87
    }
  ],
  "audit": {
    "grounded": true,
    "confidence_label": "high",
    "sources_used": 1,
    "retrieval_path": "BM25Only"
  }
}

Mistake 5: One Model for Everything

Using GPT-4 for every query is like taking a taxi to the mailbox. It works, but you are burning money on a task that a bicycle could handle. Conversely, routing complex multi-document reasoning queries to a cheap, fast model gives you bad answers at a low price — which is worse than no answer at all.

The right approach is cost-aware model routing. Simple lookups where BM25 found a strong match can use a lightweight model. Complex queries that require cross-document synthesis or reasoning need a more capable model. The routing decision should be automatic, based on measurable signals like retrieval confidence and query complexity.

pseudocode — model routing
// Route each query to the right model tier
if rag_confidence >= 0.6 {
  model = "qwen-3.5-flash";      // strong retrieval — fast model is enough
} else if rag_confidence >= 0.3 {
  model = "gpt-4.1-mini";         // moderate retrieval — needs reasoning
} else {
  model = "gemini-2.0-flash";     // weak retrieval — fast chat fallback
}
// User can override with quality_mode: "premium" → GPT-4.1

This pattern cuts costs by 70-90% on simple queries while preserving quality on hard ones. In our benchmarks, the auto-routing approach matches GPT-4.1-mini's accuracy at one-third the cost and half the latency.

Conclusion

These are not exotic edge cases. They are the table stakes of production RAG:

  • Tenant isolation before scoring — not after
  • Retrieval-based confidence — not LLM self-assessment
  • Hybrid retrieval — BM25 + dense vectors + rank fusion
  • Source attribution — with every answer, always
  • Cost-aware model routing — right model for the right query

Wauldo handles all five out of the box. Every response includes an audit trail with confidence scores, source passages, retrieval path, and a grounded/ungrounded flag. The system routes queries to the optimal model tier automatically.

Try it yourself: Upload a document to the live demo and inspect the audit trail. Or grab an API key and build with it — the free tier gives you 300 requests/month.