Large language models predict the most likely next token, not the most accurate one. When GPT-4 tells you a contract clause says something it does not say, or Claude cites a paper that was never written, the model is not malfunctioning. It is doing exactly what it was trained to do: produce fluent, plausible text. The problem is that plausible and true are not the same thing.
For toy projects, this is a curiosity. For production systems — legal tools, financial dashboards, healthcare applications, internal knowledge bases — it is a dealbreaker. You cannot ship an answer to a user unless you can prove it comes from their actual documents.
This article is a technical walkthrough of how Wauldo's RAG pipeline eliminates hallucinations. Not by hoping the LLM will be accurate, but by verifying every answer against the source material before it reaches the user.
The 3-Path Retrieval
Not every query needs the same retrieval strategy. A keyword search for "Q3 revenue" is fundamentally different from a semantic question like "What were the main themes discussed across all board meetings?" Using dense vector search for the first is wasteful. Using BM25 for the second will miss relevant passages entirely.
Wauldo's retrieval engine automatically selects one of three paths based on the BM25 score of the incoming query:
- BM25Only (score ≥ 0.45) — The fast path. When the query has strong keyword overlap with your documents, BM25 alone produces excellent results. No embeddings computation, no reranking. Response times under 500ms. This handles the majority of straightforward factual queries: specific names, dates, numbers, exact phrases
- BM25Reranked (score ≥ 0.20) — The balanced path. BM25 finds candidates, then a BGE reranker model rescores them by semantic relevance. This catches cases where the keywords are present but the most relevant passage is not the one with the highest keyword density. Best balance of speed and accuracy
- DenseFull (score < 0.20) — The thorough path. When BM25 scores are low, the query likely uses different vocabulary than the documents. Dense vector search (BGE embeddings) finds semantically similar passages even without keyword overlap. Results are merged with BM25 via Reciprocal Rank Fusion (RRF) to get the best of both worlds
The routing is automatic. Your application sends a query; the system picks the right path. No configuration needed, no knobs to tune. Each query takes the cheapest path that will produce good results.
Why not always use DenseFull? Because it is slower and, for keyword-heavy queries, BM25 alone is often more precise. Dense retrieval can surface semantically related but factually irrelevant passages. The 3-path approach gives you speed when you can afford it and thoroughness when you need it.
Multi-Source Merge
Real questions rarely have answers in a single paragraph. "Compare the pricing terms across our three vendor contracts" requires synthesizing information from multiple documents. Most RAG systems retrieve a single best chunk and pass it to the LLM. This forces the model to answer with incomplete context — or hallucinate the missing parts.
Wauldo includes all chunks scoring ≥ 0.20, up to a maximum of 3 sources. Each source is labeled with its relevance score so the LLM knows which evidence is strongest:
[Source 1 — relevance 87%] Q3 revenue grew 23% year-over-year, driven by enterprise expansion. Total ARR reached $4.2M. [Source 2 — relevance 52%] Operating expenses increased 12% due to the new sales team hired in Q2. Margins improved despite the headcount increase. [Source 3 — relevance 31%] Q2 revenue was $3.1M, representing 18% YoY growth. The board approved the expansion plan.
When sources contain conflicting numbers — one document says revenue was $4.2M, another says $4.1M — the LLM is instructed to follow a deterministic conflict resolution rule: Source 1 wins on numeric disagreements. This is not a heuristic. It is a hard rule baked into the prompt. The highest-relevance source is the most likely to contain the correct figure, and the LLM is told to treat it as authoritative.
The result: multi-document synthesis without the hallucinated "blending" that happens when LLMs try to reconcile contradictory inputs on their own.
The Fact-Checking Layer
Retrieval and multi-source merge get the right context to the LLM. But the LLM can still hallucinate. It might add a number that was not in any source. It might rephrase a conclusion in a way that changes its meaning. It might extrapolate a trend that the documents do not support.
After the LLM generates its answer, Wauldo runs a post-generation fact-checker that compares the answer against the source passages using two complementary methods:
- Token overlap — Checks whether the key terms in the answer actually appear in the source chunks. Fast, deterministic, catches obvious fabrications (invented names, numbers, dates)
- Semantic similarity — Uses BGE embeddings to verify that the meaning of each claim in the answer is supported by the source material. Catches subtler hallucinations where the LLM uses the right words but draws an unsupported conclusion
The two signals are combined. If the answer contains claims that are not supported by the retrieved passages, the response is flagged: grounded: false. Your application can then decide what to do — show a warning, fall back to a different answer, or escalate to a human reviewer.
This is the critical difference from systems that rely on the LLM to self-report its confidence. LLMs are notoriously bad at knowing when they are wrong. Our fact-checker does not ask the model whether it is confident. It checks the evidence independently.
Confidence Calibration
The confidence score in a Wauldo response is not the LLM's self-assessment. It is not a softmax probability. It is a retrieval-derived metric computed from how well the query matched the documents.
- High (≥ 0.45) — Strong keyword and semantic match. The system found highly relevant passages. Answers at this level are almost always correct and well-grounded
- Medium (≥ 0.20) — Partial match. The system found related passages but the evidence is not as strong. Answers are usually correct but may lack specificity
- Low (< 0.20) — Weak match. The documents may not contain the answer to this question. Instead of guessing, the system responds with "insufficient evidence" and tells you exactly why
This is a deliberate design choice. When evidence is weak, we do not ask the LLM to try harder or be creative. We refuse to answer. A non-answer is always better than a wrong answer in production systems.
The confidence_label field (high/medium/low) is designed for application logic. You can use it to show green/yellow/red trust badges in your UI, route low-confidence answers to human review, or suppress answers below a threshold entirely.
The Audit Trail
Every Wauldo response includes a structured audit object. This is not a debug feature — it is a first-class part of the API designed for production use.
{
"audit": {
"confidence": 0.87,
"confidence_label": "high",
"grounded": true,
"retrieval_path": "BM25Only",
"sources_used": 2,
"sources_evaluated": 12,
"model": "qwen/qwen3.5-flash-02-23",
"latency_ms": 1243
}
}
With the debug: true parameter, you get the full retrieval funnel — how many candidate chunks were found, how many survived tenant filtering, how many passed the score threshold, and how many were actually used:
{
"audit": {
"confidence": 0.87,
"confidence_label": "high",
"grounded": true,
"retrieval_path": "BM25Only",
"sources_used": 2,
"sources_evaluated": 12,
"model": "qwen/qwen3.5-flash-02-23",
"latency_ms": 1243,
"candidates_found": 47,
"candidates_after_tenant": 31,
"candidates_after_score": 12,
"query_type": "factual"
}
}
This funnel is how you diagnose issues in production. If candidates_found is high but candidates_after_tenant drops sharply, the user's documents do not contain the answer — other tenants' documents do. If candidates_after_score is low, the query wording does not match the document content well. Each answer is self-explicable without access to server logs.
Results
We run a public benchmark suite against the live production API. The results as of the latest run:
- 100% retrieval accuracy on 12 RAG benchmark tasks (9 retrieval + 3 negative tests)
- 0 hallucinations across all test categories
- Average latency: 1.5 seconds per query (including LLM generation)
- 3 retrieval paths used automatically — no manual configuration required
We do not claim the system is perfect. No RAG system is. What we claim is more specific: when the evidence is insufficient, the system says so instead of guessing. The zero hallucination rate is not a result of the LLM being exceptionally accurate. It is a result of the pipeline refusing to pass through answers that are not grounded in the source documents.
Reproducible: The benchmark suite is open. You can run it against the live API yourself: cargo run -p benchmarks --bin quality_bench -- --suite eval --url https://api.wauldo.com
The architecture is intentionally conservative. We use the cheapest retrieval path that works, the smallest model that produces correct answers, and a fact-checker that flags anything unsupported. The result is a system that is fast, affordable, and — most importantly — trustworthy.
If you are building an application where wrong answers have consequences, this is the pipeline you want behind your API.