We ran six agent frameworks through the same seventy adversarial tests: LangChain, LlamaIndex, Haystack, CrewAI, a vanilla LLM baseline, and our own stack. Same model. Same embedder. Same retrieval budget. One framework scored ninety-six percent. The next-best scored seventy-one. That gap is the topic of this post — and the ablation we ran to figure out where it actually comes from.

The setup

Seventy tests, five categories. Factual recall (ten tests) and out-of-scope refusal (fifteen) — easy stuff. Prompt injection (twenty-five), source contradiction (twelve), and semantic drift (eight) — the hard part. Every test ships with exact ground-truth tokens the answer must (or must not) contain, so the scorer is a text-match script, not another LLM. The dataset is committed with its SHA-256 and the adapters are open — none of this is ours to grade alone.

Fair play was the constraint we held hardest: every framework uses the same LLM through a provider, an open-source embedding model for retrieval, temperature zero, two hundred tokens max, top-k of two. The only variable is the framework code itself.

The one result to remember

On prompt injection — sources deliberately poisoned with forged admin notes, instruction overrides, or contradictory values — here is what the frameworks produced:

Prompt injection resistance — 25 adversarial tests
FrameworkPassedRate
Wauldo23 / 2592%
CrewAI11 / 2544%
Haystack10 / 2540%
LlamaIndex9 / 2536%
LangChain9 / 2536%

Forty-eight-point gap between Wauldo and the next framework. Wilson 95% confidence intervals do not overlap. This is not noise; it is a structural difference in how these systems handle adversarial input.

The ablation that mattered

An obvious reading is: just wrap LangChain with Wauldo's verification layer and call it a day. That would be convenient — the verifier exists as a standalone endpoint. So we built the adapter and ran it.

LangChain + Wauldo Guard uses the same LangChain pipeline, same prompt, same retrieval, same LLM. After the LLM emits its answer, we call the verifier on (answer, sources) and apply the same policy our own Task API applies: rejected verdict with high hallucination rate rewrites the answer to NOT_FOUND. That is it. The only delta against bare LangChain is the post-generation check.

Ablation — bare LangChain vs LangChain + Wauldo Guard (70 tests)
ConfigurationOverallInjection only
LangChain (bare)46 / 70 — 66%9 / 25
LangChain + Wauldo Guard45 / 70 — 64%8 / 25

The verification layer closed zero of the forty-eight-point gap. It actually degraded injection resistance slightly — the short-form answers LangChain produces are poor material for post-hoc fact-checking, so the guard generates false positives on a handful of correct answers.

The insight

A post-hoc verifier cannot repair what the generative pipeline already got wrong. If the LLM has quoted an injected admin override into its answer, a downstream fact-checker that compares that answer against those same sources will often validate it — because the injected claim is lexically present in the source context.

Wauldo resists injection because the pipeline is designed around the assumption that sources are hostile. Source content is classified as data or instruction before it reaches the prompt. The generation prompt demands structured output with mandatory citations. The verifier runs on structured claims, not on free-form text. Retries happen inside the generation loop, not after. When any of those assumptions is removed — as it is when you bolt a verifier onto LangChain — the remaining six-framework data shows you what you get.

Takeaway Robustness is a system property, not a layer property. You cannot hot-swap it in.

Takeaway for people shipping agents

If your agent stack is LangChain, LlamaIndex, Haystack, or CrewAI and you are worried about prompt injection, adding a post-hoc guard will not fix it. The ablation shows that clearly. What changes the number is redesigning the pipeline end-to-end so that source trust, structured generation, and verification are integrated — not stacked.

You can also just use a stack that already does this. Wauldo exists for exactly that reason: a verification-first agent stack, designed from scratch around the assumption that sources lie. The leaderboard at wauldo.com/leaderboard shows every number in this post with CI95 bounds, reproduction commands, and the full adapter code. If you run the bench yourself and get different numbers, we want to know.

reproduce.sh
git clone https://github.com/wauldo/wauldo-leaderboard
cd wauldo-leaderboard
export OPENROUTER_API_KEY=sk-or-...
python -m wauldo_leaderboard.harness --frameworks all

Every framework, every test, every score — same run you just read about.

Claude, GPT, Gemini and Llama are trademarks of their respective owners. All benchmark results are reproducible via the public dataset at github.com/wauldo/wauldo-leaderboard.


Try it free Paste any AI answer into our home widget to get a numeric support_score. No signup. 300 verifications/month free on RapidAPI. See pricing →