Verified AI,
in numbers
Most AI companies ship glossy charts. We publish the raw JSON. Reproduce every number from this page with one command.
Last updated — · — runs committed to git
Four charts. Zero smoothing.
The only line that should be flat is hallucination. Everything else is a story of the past two months.
Hallucination rate
Lower is better · anti-hallucination suite
Adversarial pass rate
Injection · OOS · contradiction · multilingual
Eval accuracy
61 tasks · RAG retrieval + reasoning + tools
Latency (avg ms)
Eval suite · fast-path 26% · P50
How we measure
Three public suites. Same prompts every run. Failures stay in git forever.
61 tasks
RAG retrieval, cross-doc, reasoning, tool-use, anti-hallucination, negative cases. Every eval run compares against a baseline snapshot — regressions block the merge.
20 tasks
5 adversarial + 5 multi-hop + 10 RAG adversarial. Includes out-of-scope, entity confusion, negation, temporal, cross-doc contradiction, verbose distractors, false premises.
70 tasks
10 factual · 15 out-of-scope · 25 injection (5 types) · 10 contradiction · 10 semantic + multilingual. Measures ISR, NOT_FOUND accuracy, conflict detection.
Run it yourself
One clone, one cargo run. No API key. Hit your own API, or ours.
# Clone the repo
git clone https://github.com/Benmebrouk/agentagentique.git
cd agentagentique
# Eval suite — 61 tasks, general quality
cargo run -p benchmarks --bin quality_bench -- --suite eval
# Hard suite — 20 tasks, adversarial RAG + multi-hop
cargo run -p benchmarks --bin quality_bench -- --suite hard
# Task adversarial — 70 tasks, injection / OOS / contradiction
cargo run -p benchmarks --bin task_adversarial -- \
--url https://api.wauldo.com
# All results land in benchmarks/results/history/ as JSON
Want to compare models? Model Arena runs the same suite across 14 LLMs and reports hallucination rate per model.
Numbers you can verify
beat numbers you can't.
Wauldo is the only RAG API with a trust_score on every answer. Start with the free tier — same pipeline, same benchmarks.