// for rag pipelines

Your RAG is confidently wrong.
Measure it.

You retrieve chunks. You stuff them in a prompt. The LLM returns a fluent paragraph with a footnote number at the end. Nothing in that pipeline actually checks whether the answer is supported by what you retrieved. Your eval set passes. Prod drifts. Users notice before your dashboards do.

MIT SDKs · 300 verifications/mo free
// the pain

Retrieves, answers, cites nothing.

Three failure modes every RAG team in production has already shipped to users, and discovered at the worst possible moment.

01 · CITATION GAP

Looks grounded. Isn't.

The LLM paraphrases a retrieved chunk, reorders the entities, and drops a plausible number that was nowhere in context. The answer reads correct. There is no mechanism in your stack that says otherwise.

02 · EVAL BLINDNESS

100% on your set. 15–40% off.

Your golden eval is a narrow slice of a long tail. Prod traffic drifts into out-of-distribution queries, your retriever pulls irrelevant neighbors, the LLM fills the gap with fluent fiction. Your dashboards never notice.

03 · AUDIT PANIC

A user flagged an answer.

You open your tracing tool. Which of the 8 chunks drifted the response? Which claim in the answer was unsupported? Without per-claim verdicts on every request, you are reconstructing evidence from logs, not reading it.

// before / after

Drop Wauldo between retrieval and response.

One call. Takes the answer plus the chunks you already retrieved. Returns a support score and a per-claim breakdown.

BEFORE · your pipeline todayunverified
# Flow
query
   retriever.search(k=8)
   llm.generate(prompt, chunks)
   answer

# Ground truth signal: none
# Per-claim attribution: none
# Audit trail: hope
AFTER · one extra callverified
# Flow
query
   retriever.search(k=8)
   llm.generate(prompt, chunks)
   wauldo.fact_check(answer, chunks)
   answer + support_score + verdicts[]

# support_score: 0.0–1.0
# per-claim verdict: supported | partial | unsupported
# audit trail: stored

Wauldo takes two inputs: the generated answer, and the retrieved source text. It returns the grounding breakdown. No retraining. No prompt changes. No vector DB migration. Your retriever, your LLM, your chunking strategy, your reranker — all untouched. You add one line, you gain a measurable floor under every response.

// 8 lines

Your RAG + Wauldo in 8 lines of Python.

Drop into any existing retriever. LangChain, LlamaIndex, Haystack, or homegrown — the signature is the same.

python · wauldo sdk 0.10.0pip install wauldo
from wauldo import Wauldo

w = Wauldo(api_key=os.environ["WAULDO_API_KEY"])

# Your existing RAG
chunks = retriever.retrieve(query)
answer = llm.generate(query, context=chunks)

# Add verification — 1 line
result = w.fact_check(text=answer, source_context="\n".join(chunks))

if result.verdict == "UNVERIFIED":
    answer = "I don't have confident sources for that."
policy Pick your own gate. Tight gates eliminate the top of the hallucination distribution but reject more partial answers. Loose gates let partial answers through with a caveat. Regulated surfaces want the strictest gate you can tolerate. The score is the primitive; the policy is yours — use the verdict field (SAFE / PARTIAL / UNVERIFIED) or set a numeric floor you own.
// rag accuracy bench

89% on RAG-only tasks. 100% injection defense.

70 adversarial cases across factual retrieval, prompt injection, and citation accuracy. 4 runs, reproducible. Median across the cohort is 91%, with a +48 point gap on injection.

RAG benchmark · 70 cases · 4 runs live
Framework RAG factual RAG injection Citation accuracy
Wauldo 100 92 100
LlamaIndex 81 48 72
LangChain 78 44 70
Haystack 73 41 65

Reproduce: git clone github.com/wauldo/wauldo-leaderboard && cargo run  ·  full leaderboard →

// faq

Questions RAG teams ask first.

Does Wauldo replace my retriever or LLM?

No. It runs alongside. You pass in the answer your pipeline already produced, plus the chunks your retriever already fetched, and you get a grounding score back. Your vector DB, your reranker, your LLM, your prompt — all unchanged. Wauldo is a measurement layer, not a framework rewrite.

Does it work with my vector DB?

Wauldo does not care where your chunks live. It only needs two strings: the generated answer, and the concatenated source text. Works with Pinecone, Weaviate, Qdrant, pgvector, Chroma, Elasticsearch, an in-memory dictionary, or a flat file you read on every query. The storage layer is orthogonal.

What's the latency impact?

Three modes, pick per query. Lexical mode runs in roughly one second and matches tokens and entities against the source. Hybrid mode adds a 384-dim multilingual embedding comparison for paraphrase cases, roughly three to five seconds. Semantic mode adds an LLM-judge pass for ambiguous claims, five to fifteen seconds. Most RAG teams default to lexical on hot paths and hybrid on high-stakes surfaces.

Does it support streaming?

Verification runs on the full answer — you cannot score a half-formed claim against a source. For streaming UX, the pattern is to emit tokens optimistically to the user, buffer the full response server-side, run the fact-check on the buffered text, and either confirm, annotate, or retract once the verdict lands. Wauldo's own streaming transport follows exactly this shape.

What about multi-language?

Hybrid mode uses a 384-dim multilingual embedding model. Cross-lingual grounding is tested on French, English, and Spanish — an English answer grounded against a French source chunk returns a defensible support score. Adding more languages is a model swap, not a rewrite.

Reproducible build MIT · self-hostable 5ms p50 · CDG region 300 verifications/mo · free

Measure your RAG grounding. Stop guessing.

Free on RapidAPI. 300 verifications/month. No credit card. Paste an answer, paste the chunks, watch the verdict land.

$ curl api.wauldo.com/v1/fact-check