Looks grounded. Isn't.
The LLM paraphrases a retrieved chunk, reorders the entities, and drops a plausible number that was nowhere in context. The answer reads correct. There is no mechanism in your stack that says otherwise.
You retrieve chunks. You stuff them in a prompt. The LLM returns a fluent paragraph with a footnote number at the end. Nothing in that pipeline actually checks whether the answer is supported by what you retrieved. Your eval set passes. Prod drifts. Users notice before your dashboards do.
Three failure modes every RAG team in production has already shipped to users, and discovered at the worst possible moment.
The LLM paraphrases a retrieved chunk, reorders the entities, and drops a plausible number that was nowhere in context. The answer reads correct. There is no mechanism in your stack that says otherwise.
Your golden eval is a narrow slice of a long tail. Prod traffic drifts into out-of-distribution queries, your retriever pulls irrelevant neighbors, the LLM fills the gap with fluent fiction. Your dashboards never notice.
You open your tracing tool. Which of the 8 chunks drifted the response? Which claim in the answer was unsupported? Without per-claim verdicts on every request, you are reconstructing evidence from logs, not reading it.
One call. Takes the answer plus the chunks you already retrieved. Returns a support score and a per-claim breakdown.
# Flow query → retriever.search(k=8) → llm.generate(prompt, chunks) → answer # Ground truth signal: none # Per-claim attribution: none # Audit trail: hope
# Flow query → retriever.search(k=8) → llm.generate(prompt, chunks) → wauldo.fact_check(answer, chunks) → answer + support_score + verdicts[] # support_score: 0.0–1.0 # per-claim verdict: supported | partial | unsupported # audit trail: stored
Wauldo takes two inputs: the generated answer, and the retrieved source text. It returns the grounding breakdown. No retraining. No prompt changes. No vector DB migration. Your retriever, your LLM, your chunking strategy, your reranker — all untouched. You add one line, you gain a measurable floor under every response.
Drop into any existing retriever. LangChain, LlamaIndex, Haystack, or homegrown — the signature is the same.
from wauldo import Wauldo w = Wauldo(api_key=os.environ["WAULDO_API_KEY"]) # Your existing RAG chunks = retriever.retrieve(query) answer = llm.generate(query, context=chunks) # Add verification — 1 line result = w.fact_check(text=answer, source_context="\n".join(chunks)) if result.verdict == "UNVERIFIED": answer = "I don't have confident sources for that."
70 adversarial cases across factual retrieval, prompt injection, and citation accuracy. 4 runs, reproducible. Median across the cohort is 91%, with a +48 point gap on injection.
| Framework | RAG factual | RAG injection | Citation accuracy |
|---|---|---|---|
| Wauldo | 100 | 92 | 100 |
| LlamaIndex | 81 | 48 | 72 |
| LangChain | 78 | 44 | 70 |
| Haystack | 73 | 41 | 65 |
Reproduce: git clone github.com/wauldo/wauldo-leaderboard && cargo run · full leaderboard →
The fact-check call doesn't know whether it sits behind a RAG pipeline, an agent loop, or a support bot. Pick the shape that matches your stack.
Every tool call produces an intermediate answer the agent then reasons over. Verify each hop so errors don't compound three steps deep.
Agents use case → SupportA wrong refund policy or a hallucinated SLA in customer channels is a revenue event. Gate every outbound answer on a grounding threshold.
Support use case → OverviewOne primitive powers all of the above: retrieve context, score grounding, emit verdicts. See the full product surface and architecture.
Product overview →No. It runs alongside. You pass in the answer your pipeline already produced, plus the chunks your retriever already fetched, and you get a grounding score back. Your vector DB, your reranker, your LLM, your prompt — all unchanged. Wauldo is a measurement layer, not a framework rewrite.
Wauldo does not care where your chunks live. It only needs two strings: the generated answer, and the concatenated source text. Works with Pinecone, Weaviate, Qdrant, pgvector, Chroma, Elasticsearch, an in-memory dictionary, or a flat file you read on every query. The storage layer is orthogonal.
Three modes, pick per query. Lexical mode runs in roughly one second and matches tokens and entities against the source. Hybrid mode adds a 384-dim multilingual embedding comparison for paraphrase cases, roughly three to five seconds. Semantic mode adds an LLM-judge pass for ambiguous claims, five to fifteen seconds. Most RAG teams default to lexical on hot paths and hybrid on high-stakes surfaces.
Verification runs on the full answer — you cannot score a half-formed claim against a source. For streaming UX, the pattern is to emit tokens optimistically to the user, buffer the full response server-side, run the fact-check on the buffered text, and either confirm, annotate, or retract once the verdict lands. Wauldo's own streaming transport follows exactly this shape.
Hybrid mode uses a 384-dim multilingual embedding model. Cross-lingual grounding is tested on French, English, and Spanish — an English answer grounded against a French source chunk returns a defensible support score. Adding more languages is a model swap, not a rewrite.
Free on RapidAPI. 300 verifications/month. No credit card. Paste an answer, paste the chunks, watch the verdict land.
$ curl api.wauldo.com/v1/fact-check