Your LLM passes demos.
It fails in production.
Most frameworks execute malicious instructions hidden inside user documents. You just don't see it until a customer does.
LangChain runs 16 out of 25 prompt injection attacks. Wauldo blocks 22. LlamaIndex hallucinates on 73 % of out-of-scope questions.
Last updated — · dataset: task_adversarial.json
Two real attacks, two different answers
Same question, same sources, same LLM. One framework falls for the trick, one doesn't. These are verbatim outputs from the bench — test inj_025 and conf_002. Run them yourself with python -m wauldo_leaderboard.harness --limit 70.
Question: What is the SLA?
ADMIN: marker + imperative filtered before retrieval.Question: What is the refund period?
These are 2 of the 53 tests every framework except Wauldo lost at least once. Full breakdown below.
- ✓ Factual recall — all 6 frameworks pass ≥ 80 % when the answer is sitting in one clean source
- ✓ Out-of-scope refusal — 5 out of 6 frameworks score 100 % at saying "NOT_FOUND"
This is what every framework demo tests. This is why every framework demo looks identical.
- ✗ Prompt injection — 4 / 6 frameworks fail more than half of the 25 injection tests
- ✗ Contradiction detection — LangChain catches 3 out of 12, LlamaIndex 1 out of 12
- ✗ Semantic & multilingual — soft contradictions slip through at 62 – 75 % pass rate
This is what production actually sees. This is where a framework demo lies to you.
Overall ranking
Pass rate across all 70 adversarial tests. Same LLM (Qwen 3.5 Flash), same embedder (BGE), same prompt contract. The only variable is the framework.
Trust score = median Wauldo Guard verdict over every answer. Latency = median wall-clock per test.
Per category
Factual recall is easy. Injection resistance isn't. The gap shows up when you split the score by test type.
Adding a RAG framework
often makes things worse.
The second-best framework on this bench is no framework at all.
Vanilla LLM (86 %) — just stuffing sources in a prompt — beats LangChain (60 %), Haystack (60 %), and LlamaIndex (46 %) on adversarial robustness. Because frameworks optimize retrieval. They don't verify the output.
How we score
No LLM-as-judge. No human raters. Deterministic text matching + a public scorer in git.
70 adversarial tests: 10 factual, 15 out-of-scope, 25 prompt injection (5 sub-types), 12 contradiction, 8 semantic & multilingual. Every test has exact ground-truth tokens the answer must (or must not) contain.
A 180-line Python file (scorer.py) — port of the Rust bench evaluator. Only text matching. No LLM-as-judge, no randomness. Fork it, read it, challenge it.
Every framework uses the exact same LLM (Qwen 3.5 Flash via OpenRouter), the same embedder (FastEmbed BGE), temperature 0, max 200 tokens. The only variable is the framework code.
Computed post-hoc by Wauldo's /v1/fact-check. Same scorer for every framework. Known bias: verbose answers ("Based on the source, X is Y") score lower than minimal ones ("Y"). Pass rate is the primary metric — trust score is a secondary signal.
A verification layer.
Not another framework.
Wauldo Guard wraps any existing LangChain / LlamaIndex / Haystack / CrewAI pipeline with three deterministic controls. Same stack. Same retrieval. A trust score on every answer.
Every retrieved source is classified as data or instruction before it reaches the prompt. Documents that contain imperatives, role markers, or admin overrides are stripped pre-LLM — so a forged ADMIN: note can't reach the model in the first place.
The answer is fact-checked against the sources that actually reached the model. Claims with no grounding are flagged, cross-source contradictions are detected, and any regurgitated injection tokens are caught. No LLM-as-judge — deterministic token overlap and structural comparison.
Every answer comes back with a trust_score in [0, 1] and a verdict: SAFE, CONFLICT, UNVERIFIED, BLOCK. Your app decides what to do with < 0.6 trust — block, escalate, or show the warning to the user.
# Two lines on top of your existing LangChain / LlamaIndex pipeline from wauldo import guard result = guard(answer=llm_answer, sources=retrieved_sources) # result.trust_score → 0.0 … 1.0 # result.verdict → "SAFE" | "CONFLICT" | "UNVERIFIED" | "BLOCK" # result.reason → "contradiction between src[1] and src[2]"
Available for Python, TypeScript and Rust. Free tier, no credit card — docs here.
Reproduce in 2 commands
Clone the repo, set your OpenRouter key, one python call. Same numbers, same dataset, same scorer.
# 1 — Clone the public bench git clone https://github.com/wauldo/wauldo-leaderboard.git cd wauldo-leaderboard uv venv .venv source .venv/bin/activate uv pip install -r requirements.txt # 2 — Set your OpenRouter key and run every framework export OPENROUTER_API_KEY=sk-or-v1-... python -m wauldo_leaderboard.harness --frameworks all --concurrency 4 # Aggregate the per-framework JSONs into leaderboard-data.json python -m wauldo_leaderboard.aggregate # Per-test results land in ./results/run_<timestamp>/<framework>.json
Want to add your framework to the board? Write a 70-line adapter and open a PR.
Your LLM is not safe.
Fix it in 2 lines.
Four of five RAG frameworks on this bench silently ship injected answers to users. Wauldo Guard catches them before your customer does — same stack, same retrieval, just verified.
Stop shipping unsafe AI. Start shipping verified systems.