// public leaderboard · 70 adversarial cases

Reproducible, not marketed.
Wauldo leads.

Every number on this page comes from a public dataset, a public scorer, and a single command. No cherry-picking, no median-vs-best-run, no redaction. Run it yourself.

Run the bench locally → Read methodology →

dataset: task_adversarial · 70 cases · sha256 ae89dd90… · run 2026-04-15

// results · median across 4 runs

Framework comparison.

Higher is better. Injection column is the single most honest number — it's where post-hoc guardrails collapse.

70-case adversarial · 4 runs · deterministic scoringlive

Framework	Factual	Injection	Out-of-scope	Contradiction	Total (median)
Wauldo	100	92	100	100	91
LangChain + Wauldo Guard	78	44	70	68	66
LlamaIndex	81	48	72	71	68
LangChain	78	44	70	68	66
Haystack	73	41	65	64	60
CrewAI	71	38	63	62	58

ABLATION PROOF

LangChain + Wauldo Guard as a post-hoc layer did NOT close the gap. Same 44% injection score as LangChain alone. The +48pt gap comes from verification INSIDE the loop, not bolted on after. Read the full ablation →

// variance across 4 runs

Why we publish the range, not the best run.

A single run can get lucky. Medians hide tail behavior. Here's every run.

RUN 1

86%

2026-04-10

RUN 2

91%

2026-04-12

RUN 3

93%

2026-04-14

RUN 4

97%

2026-04-15

Median: 91%. Range 86–97 (11pt spread). Variance comes from LLM sampling temperature, retrieval order on vector ties, and adversarial case difficulty. We don't average — we show you everything.

// methodology

How we measure, without hedging.

DATASET

70 hand-crafted cases: 20 factual retrieval, 30 prompt injection, 10 out-of-scope, 10 contradictory sources. Public sha256 shown in meta line above. Every case has ground truth.

SCORER

Deterministic text matching. No LLM-as-judge. Source: wauldo_leaderboard/scorer.py. Same scorer for every framework.

RUNNER

Each framework run via its public SDK / CLI, with the same system prompt and same retrieved chunks (when applicable). cargo run --bin leaderboard.

TRUST FLOOR

Wauldo's own support_score is computed via POST /v1/fact-check in lexical mode. We do NOT use semantic mode here — would be unfair against the other frameworks.

// reproduce

Three commands. No hidden config.

bash~/projects

# clone the public leaderboard repo
git clone https://github.com/wauldoai/wauldo-leaderboard
cd wauldo-leaderboard
cargo run --release --bin leaderboard -- --dataset task_adversarial
# runs 70 cases × 6 frameworks · ~14 min on 8 cores

// related

Keep going.

→ /benchmarks

See internal benchmark trends

Daily latency, accuracy, cost runs — all versioned.

View benchmarks →

→ /product

What's the primitive?

Understand how support_score is computed before trusting it.

Read the product page →

→ github

The bench is MIT.

Fork it, add your framework, send a PR.

Open source →

Reproducible build

MIT SDKs

Open-source leaderboard

View changelog

Stop trusting vendor benchmarks. Run your own.

The dataset, the scorer, and the runner are all public. Clone it.

Fork the bench → Verify your own answer →

$ cargo run --bin leaderboard

Reproducible, not marketed.Wauldo leads.

Framework comparison.

Why we publish the range, not the best run.

How we measure, without hedging.

Three commands. No hidden config.

Keep going.

See internal benchmark trends

What's the primitive?

The bench is MIT.

Stop trusting vendor benchmarks. Run your own.

Reproducible, not marketed.
Wauldo leads.