// about

Built by one engineer. Measured by everyone.

Wauldo is a verification layer for AI agents. It returns a numeric support_score on every answer, grounded against the sources you provide. No LLM-as-judge, no vibes. A measurement you can log, audit, and regress against.

MIT · live since 2026-03-08 · reproducible build · public leaderboard

Built and maintained by Nizar Benmebrouk in Lyon, France · solo founder · contact contact@wauldo.com · legal entity on /mentions-legales

// the thesis

Most AI agents lie with confidence.

LLMs are trained to sound helpful. "Helpful" is orthogonal to "correct" — and on adversarial inputs, the two diverge fast. Wauldo is the instrument that measures the divergence, claim-by-claim, in the loop.

Every verification runtime on the market today is either (a) an LLM-as-judge that inherits the same failure modes it's supposed to detect, or (b) a rules-based guardrail that filters obvious bad strings and calls it a day. Neither measures grounding. Neither returns a number you can regress against week over week.

Wauldo ships a primitive: send (answer, sources), get back (support_score ∈ [0,1], per-claim verdicts, hallucination_rate). The scorer is deterministic. The dataset is public. The command to reproduce the leaderboard fits in three lines.

The goal is not to sell a layer. The goal is to make "is this AI output grounded?" a measurable question instead of a vibe check.


// by the numbers

No funding story. Just metrics.

What exists today, measurable and public.

// median

91%

Adversarial pass rate across 4 runs. Range 86–97 on 70 cases.

// injection gap

+48pt

Over LangChain baseline. Reproducible, post-hoc layer doesn't close it.

// p50 latency

5ms

Fast-path verification (/v1/fact-check lexical mode).

// confabulation count

0/70

Hand-crafted factual retrieval bench, 4 runs, no claim confabulated. Small by design, public, reproducible — we ship the dataset, not the headline.


// philosophy

Three commitments.

01 · MEASURABLE

Every claim → a number.

We don't ship "trust signals" or "confidence indicators" that obscure the math. Every verification returns a number between 0 and 1, with the per-claim breakdown.

02 · REPRODUCIBLE

Public dataset, public scorer.

The adversarial bench is on GitHub. The scorer is Python you can read in 60 lines. One command runs it against any framework.

03 · HONEST

Ranges, not just medians.

We publish all 4 runs, not the best one. If a metric regresses, the CI bot commits the regression. No silent rollbacks.


Get in touch Questions about the verification primitive, integration help, or Enterprise self-host? Reach out.

Measuring, not hoping.

500 verifications a month, free. Paste anything, get the number.

$ curl api.wauldo.com/v1/fact-check