Zero Hallucinations: How Our RAG Pipeline Works

Large language models predict the most likely next token, not the most accurate one. When a model tells you a contract clause says something it does not say, or cites a paper that was never written, the model is not malfunctioning. It is doing exactly what it was trained to do: produce fluent, plausible text. The problem is that plausible and true are not the same thing.

For toy projects, this is a curiosity. For production systems — legal tools, financial dashboards, healthcare applications, internal knowledge bases — it is a dealbreaker. You cannot ship an answer to a user unless you can prove it comes from their actual documents. This is the core problem Wauldo is designed to solve.

This article is a technical walkthrough of how Wauldo's RAG pipeline eliminates hallucinations. Not by hoping the LLM will be accurate, but by verifying every answer against the source material before it reaches the user. If you want to try it hands-on first, our tutorial shows how to get verified AI answers in 5 minutes.

The 3-path retrieval.

Not every query needs the same retrieval strategy. A keyword search for "Q3 revenue" is fundamentally different from a semantic question like "What were the main themes discussed across all board meetings?" Using dense vector search for the first is wasteful. Using BM25 for the second will miss relevant passages entirely.

Wauldo's retrieval engine automatically selects one of three paths based on the BM25 score of the incoming query:

BM25Only (score ≥ 0.45) — The fast path. When the query has strong keyword overlap with your documents, BM25 alone produces excellent results. No embeddings computation, no reranking. Response times under 500ms. This handles the majority of straightforward factual queries: specific names, dates, numbers, exact phrases.
BM25Reranked (score ≥ 0.20) — The balanced path. BM25 finds candidates, then a reranker model rescores them by semantic relevance. This catches cases where the keywords are present but the most relevant passage is not the one with the highest keyword density. Best balance of speed and accuracy.
DenseFull (score < 0.20) — The thorough path. When BM25 scores are low, the query likely uses different vocabulary than the documents. Dense vector search finds semantically similar passages even without keyword overlap. Results are merged with BM25 via Reciprocal Rank Fusion to get the best of both worlds.

The routing is automatic. Your application sends a query; the system picks the right path. No configuration needed, no knobs to tune. Each query takes the cheapest path that will produce good results.

Why not always dense?Because it is slower and, for keyword-heavy queries, BM25 alone is often more precise. Dense retrieval can surface semantically related but factually irrelevant passages. The 3-path approach gives you speed when you can afford it and thoroughness when you need it.

Multi-source merge.

Real questions rarely have answers in a single paragraph. "Compare the pricing terms across our three vendor contracts" requires synthesizing information from multiple documents. Most RAG systems retrieve a single best chunk and pass it to the LLM. This forces the model to answer with incomplete context — or hallucinate the missing parts.

Wauldo includes all chunks scoring ≥ 0.20, up to a maximum of 3 sources. Each source is labeled with its relevance score so the LLM knows which evidence is strongest:

[Source 1 — relevance 87%]
Q3 revenue grew 23% year-over-year, driven by
enterprise expansion. Total ARR reached $4.2M.

[Source 2 — relevance 52%]
Operating expenses increased 12% due to the
new sales team hired in Q2. Margins improved
despite the headcount increase.

[Source 3 — relevance 31%]
Q2 revenue was $3.1M, representing 18% YoY
growth. The board approved the expansion plan.

When sources contain conflicting numbers — one document says revenue was $4.2M, another says $4.1M — the system uses a deterministic conflict resolution strategy. The highest-relevance source takes priority on numeric disagreements. This is not a heuristic — it is a hard rule. The most relevant source is treated as authoritative, eliminating random LLM guesses.

The result: multi-document synthesis without the hallucinated "blending" that happens when LLMs try to reconcile contradictory inputs on their own.

The fact-checking layer.

Retrieval and multi-source merge get the right context to the LLM. But the LLM can still hallucinate. It might add a number that was not in any source. It might rephrase a conclusion in a way that changes its meaning. It might extrapolate a trend the documents do not support.

After the LLM generates its answer, Wauldo runs a post-generation fact-checker that compares the answer against the source passages using two complementary methods:

Token overlap — Checks whether the key terms in the answer actually appear in the source chunks. Fast, deterministic, catches obvious fabrications: invented names, numbers, dates.
Semantic similarity — Uses sentence embeddings to verify that the meaning of each claim in the answer is supported by the source material. Catches subtler hallucinations where the LLM uses the right words but draws an unsupported conclusion.

The two signals are combined. If the answer contains claims that are not supported by the retrieved passages, the response is flagged: grounded: false. Your application can then decide what to do — show a warning, fall back to a different answer, or escalate to a human reviewer.

This is the critical difference from systems that rely on the LLM to self-report its confidence. LLMs are notoriously bad at knowing when they are wrong. Our fact-checker does not ask the model whether it is confident. It checks the evidence independently.

Confidence calibration.

The confidence score in a Wauldo response is not the LLM's self-assessment. It is not a softmax probability. It is a retrieval-derived metric computed from how well the query matched the documents.

High (≥ 0.45) — Strong keyword and semantic match. The system found highly relevant passages. Answers at this level are almost always correct and well-grounded.
Medium (≥ 0.20) — Partial match. The system found related passages but the evidence is not as strong. Answers are usually correct but may lack specificity.
Low (< 0.20) — Weak match. The documents may not contain the answer to this question. Instead of guessing, the system responds with "insufficient evidence" and tells you exactly why.

This is a deliberate design choice. When evidence is weak, we do not ask the LLM to try harder or be creative. We refuse to answer. A non-answer is always better than a wrong answer in production systems.

The confidence_label field (high/medium/low) is designed for application logic. You can use it to show trust badges in your UI, route low-confidence answers to human review, or suppress answers below a threshold entirely.

The audit trail.

Every Wauldo response includes a structured audit object. This is not a debug feature — it is a first-class part of the API designed for production use.

{
  "audit": {
    "confidence": 0.87,
    "confidence_label": "high",
    "grounded": true,
    "retrieval_path": "BM25Only",
    "sources_used": 2,
    "sources_evaluated": 12,
    "model": "auto",
    "latency_ms": 1243
  }
}

With the debug: true parameter, you get the full retrieval funnel — how many candidate chunks were found, how many survived tenant filtering, how many passed the score threshold, and how many were actually used:

{
  "audit": {
    "confidence": 0.87,
    "confidence_label": "high",
    "grounded": true,
    "retrieval_path": "BM25Only",
    "sources_used": 2,
    "sources_evaluated": 12,
    "model": "auto",
    "latency_ms": 1243,
    "candidates_found": 47,
    "candidates_after_tenant": 31,
    "candidates_after_score": 12,
    "query_type": "factual"
  }
}

This funnel is how you diagnose issues in production. If candidates_found is high but candidates_after_tenant drops sharply, the user's documents do not contain the answer — other tenants' documents do. If candidates_after_score is low, the query wording does not match the document content well. Each answer is self-explicable without access to server logs.

Results.

We run a public benchmark suite against the live production API. The results as of the latest run:

100% retrieval accuracy on 12 RAG benchmark tasks (9 retrieval + 3 negative tests).
0 hallucinations across all test categories.
Average latency: 1.5 seconds per query (including LLM generation).
3 retrieval paths used automatically — no manual configuration required.

We do not claim the system is perfect. No RAG system is. What we claim is more specific: when the evidence is insufficient, the system says so instead of guessing. The zero hallucination rate is not a result of the LLM being exceptionally accurate. It is a result of the pipeline refusing to pass through answers that are not grounded in the source documents.

ReproducibleThe benchmark suite is open. You can run it against the live API yourself: cargo run -p benchmarks --bin quality_bench -- --suite eval --url https://api.wauldo.com

The architecture is intentionally conservative. We use the cheapest retrieval path that works, the smallest model that produces correct answers, and a fact-checker that flags anything unsupported. The result is a system that is fast, affordable, and — most importantly — trustworthy.

If you are building an application where wrong answers have consequences, this is the pipeline you want behind your API. Avoid the 5 most common RAG mistakes that quietly degrade production systems, and see why LangChain plus retrieval alone isn't enough to stop hallucinations on adversarial prompts.

Try it freePaste any AI answer into our home widget for a numeric support_score. 500 verifications/month free on RapidAPI. See pricing →

The 3-path retrieval.

Multi-source merge.

The fact-checking layer.

Confidence calibration.

The audit trail.

Results.

Related essays.

5 RAG mistakes.

LangChain + retrieval isn't enough.

Verified AI answers in 5 minutes.