Skip to content
Every run. Every failure. In public.

Verified AI,
in numbers

Most AI companies ship glossy charts. We publish the raw JSON. Reproduce every number from this page with one command.

Last updated · runs committed to git

Hallucination rate
anti-hallucination suite
Adversarial
70 injection / contradiction tests
Eval accuracy
61 RAG + reasoning tasks
Factual (adv)
clean-source factual tests

Four charts. Zero smoothing.

The only line that should be flat is hallucination. Everything else is a story of the past two months.

Hallucination rate

Lower is better · anti-hallucination suite

target 0%

Adversarial pass rate

Injection · OOS · contradiction · multilingual

≥ 81%

Eval accuracy

61 tasks · RAG retrieval + reasoning + tools

≥ 77%

Latency (avg ms)

Eval suite · fast-path 26% · P50

trend ↓

How we measure

Three public suites. Same prompts every run. Failures stay in git forever.

Eval suite

61 tasks

RAG retrieval, cross-doc, reasoning, tool-use, anti-hallucination, negative cases. Every eval run compares against a baseline snapshot — regressions block the merge.

Hard suite

20 tasks

5 adversarial + 5 multi-hop + 10 RAG adversarial. Includes out-of-scope, entity confusion, negation, temporal, cross-doc contradiction, verbose distractors, false premises.

Task adversarial

70 tasks

10 factual · 15 out-of-scope · 25 injection (5 types) · 10 contradiction · 10 semantic + multilingual. Measures ISR, NOT_FOUND accuracy, conflict detection.

Run it yourself

One clone, one cargo run. No API key. Hit your own API, or ours.

reproduce.sh
# Clone the repo
git clone https://github.com/Benmebrouk/agentagentique.git
cd agentagentique

# Eval suite — 61 tasks, general quality
cargo run -p benchmarks --bin quality_bench -- --suite eval

# Hard suite — 20 tasks, adversarial RAG + multi-hop
cargo run -p benchmarks --bin quality_bench -- --suite hard

# Task adversarial — 70 tasks, injection / OOS / contradiction
cargo run -p benchmarks --bin task_adversarial -- \
    --url https://api.wauldo.com

# All results land in benchmarks/results/history/ as JSON

Want to compare models? Model Arena runs the same suite across 14 LLMs and reports hallucination rate per model.

Numbers you can verify
beat numbers you can't.

Wauldo is the only RAG API with a trust_score on every answer. Start with the free tier — same pipeline, same benchmarks.