// for rag pipelines

Your RAG is confidently wrong.
Measure it.

You retrieve chunks. You stuff them in a prompt. The LLM returns a fluent paragraph with a footnote number at the end. Nothing in that pipeline actually checks whether the answer is supported by what you retrieved. Your eval set passes. Prod drifts. Users notice before your dashboards do.

Verify a RAG answer → See the bench →

MIT SDKs · 500 verifications/mo free

// the pain

Retrieves, answers, cites nothing.

Three failure modes every RAG team in production has already shipped to users, and discovered at the worst possible moment.

01 · CITATION GAP

Looks grounded. Isn't.

The LLM paraphrases a retrieved chunk, reorders the entities, and drops a plausible number that was nowhere in context. The answer reads correct. There is no mechanism in your stack that says otherwise.

02 · EVAL BLINDNESS

100% on your set. 15–40% off.

Your golden eval is a narrow slice of a long tail. Prod traffic drifts into out-of-distribution queries, your retriever pulls irrelevant neighbors, the LLM fills the gap with fluent fiction. Your dashboards never notice.

03 · AUDIT PANIC

A user flagged an answer.

You open your tracing tool. Which of the 8 chunks drifted the response? Which claim in the answer was unsupported? Without per-claim verdicts on every request, you are reconstructing evidence from logs, not reading it.

// before / after

Drop Wauldo between retrieval and response.

One call. Takes the answer plus the chunks you already retrieved. Returns a support score and a per-claim breakdown.

BEFORE · your pipeline todayunverified

# Flow
query
  → retriever.search(k=8)
  → llm.generate(prompt, chunks)
  → answer

# Ground truth signal: none
# Per-claim attribution: none
# Audit trail: hope

AFTER · one extra callverified

# Flow
query
  → retriever.search(k=8)
  → llm.generate(prompt, chunks)
  → wauldo.fact_check(answer, chunks)
  → answer + support_score + verdicts[]

# support_score: 0.0–1.0
# per-claim verdict: supported | partial | unsupported
# audit trail: stored

Wauldo takes two inputs: the generated answer, and the retrieved source text. It returns the grounding breakdown. No retraining. No prompt changes. No vector DB migration. Your retriever, your LLM, your chunking strategy, your reranker — all untouched. You add one line, you gain a measurable floor under every response.

// 8 lines

Your RAG + Wauldo in 8 lines of Python.

Drop into any existing retriever. LangChain, LlamaIndex, Haystack, or homegrown — the signature is the same.

python · wauldo sdk 0.10.0pip install wauldo

from wauldo import Wauldo

w = Wauldo(api_key=os.environ["WAULDO_API_KEY"])

# Your existing RAG
chunks = retriever.retrieve(query)
answer = llm.generate(query, context=chunks)

# Add verification — 1 line
result = w.fact_check(text=answer, source_context="\n".join(chunks))

if result.verdict == "UNVERIFIED":
    answer = "I don't have confident sources for that."

policy Pick your own gate. Tight gates eliminate the top of the hallucination distribution but reject more partial answers. Loose gates let partial answers through with a caveat. Regulated surfaces want the strictest gate you can tolerate. The score is the primitive; the policy is yours — use the verdict field (SAFE / PARTIAL / UNVERIFIED) or set a numeric floor you own.

// rag accuracy bench

89% on RAG-only tasks. 100% injection defense.

70 adversarial cases across factual retrieval, prompt injection, and citation accuracy. 4 runs, reproducible. Median across the cohort is 91%, with a +48 point gap on injection.

RAG benchmark · 70 cases · 4 runs live

Framework	RAG factual	RAG injection	Citation accuracy
Wauldo	100	92	100
LlamaIndex	81	48	72
LangChain	78	44	70
Haystack	73	41	65

Reproduce: git clone github.com/wauldoai/wauldo-leaderboard && cargo run · full leaderboard →

// also for

Different surface. Same primitive.

The fact-check call doesn't know whether it sits behind a RAG pipeline, an agent loop, or a support bot. Pick the shape that matches your stack.

Agents

For multi-step agents

Every tool call produces an intermediate answer the agent then reasons over. Verify each hop so errors don't compound three steps deep.

Agents use case → Support

For AI support bots

A wrong refund policy or a hallucinated SLA in customer channels is a revenue event. Gate every outbound answer on a grounding threshold.

Support use case → Overview

The verification runtime

One primitive powers all of the above: retrieve context, score grounding, emit verdicts. See the full product surface and architecture.

Product overview →

// faq

Questions RAG teams ask first.

Does Wauldo replace my retriever or LLM?

No. It runs alongside. You pass in the answer your pipeline already produced, plus the chunks your retriever already fetched, and you get a grounding score back. Your vector DB, your reranker, your LLM, your prompt — all unchanged. Wauldo is a measurement layer, not a framework rewrite.

Does it work with my vector DB?

Wauldo does not care where your chunks live. It only needs two strings: the generated answer, and the concatenated source text. Works with Pinecone, Weaviate, Qdrant, pgvector, Chroma, Elasticsearch, an in-memory dictionary, or a flat file you read on every query. The storage layer is orthogonal.

What's the latency impact?

Three modes, pick per query. Lexical mode runs in roughly one second and matches tokens and entities against the source. Hybrid mode adds a multilingual embedding comparison for paraphrase cases, roughly three to five seconds. Semantic mode adds an LLM-judge pass for ambiguous claims, five to fifteen seconds. Most RAG teams default to lexical on hot paths and hybrid on high-stakes surfaces.

Does it support streaming?

Verification runs on the full answer — you cannot score a half-formed claim against a source. For streaming UX, the pattern is to emit tokens optimistically to the user, buffer the full response server-side, run the fact-check on the buffered text, and either confirm, annotate, or retract once the verdict lands. Wauldo's own streaming transport follows exactly this shape.

What about multi-language?

Hybrid mode uses a multilingual embedding model. Cross-lingual grounding is tested on French, English, and Spanish — an English answer grounded against a French source chunk returns a defensible support score. Adding more languages is a model swap, not a rewrite.

Reproducible build MIT · self-hostable 5ms p50 · CDG region 500 verifications/mo · free

Measure your RAG grounding. Stop guessing.

Free on RapidAPI. 500 verifications/month. No credit card. Paste an answer, paste the chunks, watch the verdict land.

Verify a RAG answer → Read docs ↗

$ curl api.wauldo.com/v1/fact-check

Your RAG is confidently wrong.Measure it.

Retrieves, answers, cites nothing.

Looks grounded. Isn't.

100% on your set. 15–40% off.

A user flagged an answer.

Drop Wauldo between retrieval and response.

Your RAG + Wauldo in 8 lines of Python.

89% on RAG-only tasks. 100% injection defense.

Different surface. Same primitive.

For multi-step agents

For AI support bots

The verification runtime

Questions RAG teams ask first.

Measure your RAG grounding. Stop guessing.

Your RAG is confidently wrong.
Measure it.