70 hand-crafted cases: 20 factual retrieval, 30 prompt injection, 10 out-of-scope, 10 contradictory sources. Public sha256 shown in meta line above. Every case has ground truth.
// public leaderboard · 70 adversarial cases
Every number on this page comes from a public dataset, a public scorer, and a single command. No cherry-picking, no median-vs-best-run, no redaction. Run it yourself.
dataset: task_adversarial · 70 cases · sha256 ae89dd90… · run 2026-04-15
// results · median across 4 runs
Higher is better. Injection column is the single most honest number — it's where post-hoc guardrails collapse.
| Framework | Factual | Injection | Out-of-scope | Contradiction | Total (median) |
|---|---|---|---|---|---|
| Wauldo | 100 | 92 | 100 | 100 | 91 |
| LangChain + Wauldo Guard | 78 | 44 | 70 | 68 | 66 |
| LlamaIndex | 81 | 48 | 72 | 71 | 68 |
| LangChain | 78 | 44 | 70 | 68 | 66 |
| Haystack | 73 | 41 | 65 | 64 | 60 |
| CrewAI | 71 | 38 | 63 | 62 | 58 |
LangChain + Wauldo Guard as a post-hoc layer did NOT close the gap. Same 44% injection score as LangChain alone. The +48pt gap comes from verification INSIDE the loop, not bolted on after. Read the full ablation →
// variance across 4 runs
A single run can get lucky. Medians hide tail behavior. Here's every run.
Median: 91%. Range 86–97 (11pt spread). Variance comes from LLM sampling temperature, retrieval order on vector ties, and adversarial case difficulty. We don't average — we show you everything.
// methodology
70 hand-crafted cases: 20 factual retrieval, 30 prompt injection, 10 out-of-scope, 10 contradictory sources. Public sha256 shown in meta line above. Every case has ground truth.
Deterministic text matching. No LLM-as-judge. Source: wauldo_leaderboard/scorer.py. Same scorer for every framework.
Each framework run via its public SDK / CLI, with the same system prompt and same retrieved chunks (when applicable). cargo run --bin leaderboard.
Wauldo's own support_score is computed via POST /v1/fact-check in lexical mode. We do NOT use semantic mode here — would be unfair against the other frameworks.
// reproduce
# clone the public leaderboard repo
git clone https://github.com/wauldoai/wauldo-leaderboard
cd wauldo-leaderboard
cargo run --release --bin leaderboard -- --dataset task_adversarial
# runs 70 cases × 6 frameworks · ~14 min on 8 cores // related
Daily latency, accuracy, cost runs — all versioned.
View benchmarks →Understand how support_score is computed before trusting it.
Read the product page →The dataset, the scorer, and the runner are all public. Clone it.
$ cargo run --bin leaderboard