// public leaderboard · 70 adversarial cases

Reproducible, not marketed.
Wauldo leads.

Every number on this page comes from a public dataset, a public scorer, and a single command. No cherry-picking, no median-vs-best-run, no redaction. Run it yourself.

dataset: task_adversarial · 70 cases · sha256 ae89dd90… · run 2026-04-15

// results · median across 4 runs

Framework comparison.

Higher is better. Injection column is the single most honest number — it's where post-hoc guardrails collapse.

70-case adversarial · 4 runs · deterministic scoringlive
Framework Factual Injection Out-of-scope Contradiction Total (median)
Wauldo 100 92 100 100 91
LangChain + Wauldo Guard 78 44 70 68 66
LlamaIndex 81 48 72 71 68
LangChain 78 44 70 68 66
Haystack 73 41 65 64 60
CrewAI 71 38 63 62 58
ABLATION PROOF

LangChain + Wauldo Guard as a post-hoc layer did NOT close the gap. Same 44% injection score as LangChain alone. The +48pt gap comes from verification INSIDE the loop, not bolted on after. Read the full ablation →

// variance across 4 runs

Why we publish the range, not the best run.

A single run can get lucky. Medians hide tail behavior. Here's every run.

RUN 1
86%
2026-04-10
RUN 2
91%
2026-04-12
RUN 3
93%
2026-04-14
RUN 4
97%
2026-04-15

Median: 91%. Range 86–97 (11pt spread). Variance comes from LLM sampling temperature, retrieval order on vector ties, and adversarial case difficulty. We don't average — we show you everything.

// methodology

How we measure, without hedging.

DATASET

70 hand-crafted cases: 20 factual retrieval, 30 prompt injection, 10 out-of-scope, 10 contradictory sources. Public sha256 shown in meta line above. Every case has ground truth.

SCORER

Deterministic text matching. No LLM-as-judge. Source: wauldo_leaderboard/scorer.py. Same scorer for every framework.

RUNNER

Each framework run via its public SDK / CLI, with the same system prompt and same retrieved chunks (when applicable). cargo run --bin leaderboard.

TRUST FLOOR

Wauldo's own support_score is computed via POST /v1/fact-check in lexical mode. We do NOT use semantic mode here — would be unfair against the other frameworks.

// reproduce

Three commands. No hidden config.

bash~/projects
# clone the public leaderboard repo
git clone https://github.com/wauldoai/wauldo-leaderboard
cd wauldo-leaderboard
cargo run --release --bin leaderboard -- --dataset task_adversarial
# runs 70 cases × 6 frameworks · ~14 min on 8 cores

// related

Keep going.

→ /benchmarks

See internal benchmark trends

Daily latency, accuracy, cost runs — all versioned.

View benchmarks →
→ /product

What's the primitive?

Understand how support_score is computed before trusting it.

Read the product page →
→ github

The bench is MIT.

Fork it, add your framework, send a PR.

Open source →
Reproducible build
MIT SDKs
Open-source leaderboard

Stop trusting vendor benchmarks. Run your own.

The dataset, the scorer, and the runner are all public. Clone it.

$ cargo run --bin leaderboard