91%
Adversarial pass rate across 4 runs. Range 86–97 on 70 cases.
Wauldo is a verification layer for AI agents. It returns a numeric support_score on every answer, grounded against the sources you provide. No LLM-as-judge, no vibes. A measurement you can log, audit, and regress against.
LLMs are trained to sound helpful. "Helpful" is orthogonal to "correct" — and on adversarial inputs, the two diverge fast. Wauldo is the instrument that measures the divergence, claim-by-claim, in the loop.
Every verification runtime on the market today is either (a) an LLM-as-judge that inherits the same failure modes it's supposed to detect, or (b) a rules-based guardrail that filters obvious bad strings and calls it a day. Neither measures grounding. Neither returns a number you can regress against week over week.
Wauldo ships a primitive: send (answer, sources), get back (support_score ∈ [0,1], per-claim verdicts, hallucination_rate). The scorer is deterministic. The dataset is public. The command to reproduce the leaderboard fits in three lines.
The goal is not to sell a layer. The goal is to make "is this AI output grounded?" a measurable question instead of a vibe check.
What exists today, measurable and public.
Adversarial pass rate across 4 runs. Range 86–97 on 70 cases.
Over LangChain baseline. Reproducible, post-hoc layer doesn't close it.
Fast-path verification (/v1/fact-check lexical mode).
Hand-crafted factual retrieval bench, 4 runs, no claim confabulated. Small by design, public, reproducible — we ship the dataset, not the headline.
We don't ship "trust signals" or "confidence indicators" that obscure the math. Every verification returns a number between 0 and 1, with the per-claim breakdown.
The adversarial bench is on GitHub. The scorer is Python you can read in 60 lines. One command runs it against any framework.
We publish all 4 runs, not the best one. If a metric regresses, the CI bot commits the regression. No silent rollbacks.
500 verifications a month, free. Paste anything, get the number.