Agent "calls" a search tool but misremembers the results in its reasoning two steps later. The tool output is in the trace. The claim in the final answer is not. No one notices until a user pastes it into Slack.
Agents hallucinate compound. Each step's slight drift becomes the next step's ground truth. A tool call is "remembered" inaccurately two steps later. A fabricated citation gets quoted as source. By the time a user sees the final answer, the reasoning is fiction dressed as audit trail. Wauldo scores every step, or the whole run, against the actual tool outputs.
Agent "calls" a search tool but misremembers the results in its reasoning two steps later. The tool output is in the trace. The claim in the final answer is not. No one notices until a user pastes it into Slack.
Step N treats step N-1's fabricated claim as source truth. The chain looks rigorous — each step "cites" the previous one — but the root is invented. ReAct loops with introspection make this worse, not better.
Your agent bench (AgentBench, SWE-bench, ToolBench) scores well. Production fails adversarial users at 40%+. Canonical benchmarks test task completion, not output grounding. Shipping requires a separate measurement.
Verify each agent step inside the loop. Pass the step's output and the step's source context (tool result, retrieved chunk, previous reasoning). Use the returned support_score to short-circuit weak branches, retry the tool call, or force the agent to re-plan. Adds latency per step, catches drift early, keeps final answers tight.
Run the full agent. At the end, verify the final output against the concatenated tool traces as sources. Good for log-time analysis without changing the hot path. Use the verdict to tag runs for human review, or as a post-hoc filter before returning to the user. One call, no loop disruption.
from wauldo import Wauldo
w = Wauldo(api_key=os.environ["WAULDO_API_KEY"])
# Any agent framework — LangChain, LlamaIndex, custom
agent = build_agent(tools=[search, calculator, db_query])
result = agent.run(query)
# tool_outputs = concatenated observations across steps
verified = w.fact_check(
text=result.final_answer,
source_context=result.tool_outputs,
mode="hybrid",
)
if verified.support_score < 0.6:
return {"answer": result.final_answer, "warning": "low grounding"}
| Framework | Factual | Injection | Out-of-scope | Total |
|---|---|---|---|---|
| Wauldo | 100 | 92 | 100 | 91 |
| LlamaIndex | 81 | 48 | 72 | 68 |
| LangChain | 78 | 44 | 70 | 66 |
| Haystack | 73 | 41 | 65 | 60 |
| CrewAI | 71 | 38 | 63 | 58 |
Reproduce: git clone github.com/wauldo/wauldo-leaderboard && cargo run · full methodology →
We ran LangChain + Wauldo Guard as a post-hoc layer. The injection score stayed at 44%. The +48pt gap doesn't come from adding a layer — it comes from verification inside the loop. System-level robustness, not bolt-on. Read the ablation study →
Score retrieved-chunk grounding. Catch citations that don't exist. Stop shipping plausible fiction.
Every reply scored against your knowledge base. Block low-grounding answers before they reach customers.
How Wauldo works under the hood. Architecture, verdicts, support score, API.
Framework-agnostic. Any agent that produces (final_answer, tool_outputs) — or per-step (step_output, step_context) — can be wrapped. LangChain, LlamaIndex, CrewAI, AutoGen, custom ReAct loops, Rust, Go, Python. The API is HTTP, not a framework integration.
No. Wauldo is orthogonal to prompting. It scores what comes out. Change prompts to reduce the rate of drift; keep Wauldo to measure whether it worked.
Yes. Call fact_check per step with the step's output and its local context (the tool result, the retrieved chunk, the previous reasoning). See inline mode above. Trade-off is latency per step versus tighter drift control.
Verification runs on full text. Buffer the final answer (or the step output) before scoring. For token-by-token UX, stream to the user, then verify in a background task and either confirm or retract.
Self-reflection assumes the agent knows when it's wrong. If it did, it wouldn't have hallucinated in the first place. Self-critique correlates with the same blind spots as the original generation. Use external measurement.
Free tier. 300 verifications/month. No credit card.
$ curl api.wauldo.com/v1/fact-check