// for multi-step agents

Step 3 invents a fact. Step 5 commits to it. By step 8, it's decorative.

Agents hallucinate compound. Each step's slight drift becomes the next step's ground truth. A tool call is "remembered" inaccurately two steps later. A fabricated citation gets quoted as source. By the time a user sees the final answer, the reasoning is fiction dressed as audit trail. Wauldo scores every step, or the whole run, against the actual tool outputs.

Test an agent end-to-end → See the bench →

// compounding drift

Every tool call is a new hallucination surface.

01 · TOOL CONFABULATION

Agent "calls" a search tool but misremembers the results in its reasoning two steps later. The tool output is in the trace. The claim in the final answer is not. No one notices until a user pastes it into Slack.

02 · SELF-REFERENTIAL GROUNDING

Step N treats step N-1's fabricated claim as source truth. The chain looks rigorous — each step "cites" the previous one — but the root is invented. ReAct loops with introspection make this worse, not better.

03 · EVAL GAP

Your agent bench (AgentBench, SWE-bench, ToolBench) scores well. Production fails adversarial users at 40%+. Canonical benchmarks test task completion, not output grounding. Shipping requires a separate measurement.

// two integration modes

Inline verification, or run-level audit.

MODE · INLINE

Verify each agent step inside the loop. Pass the step's output and the step's source context (tool result, retrieved chunk, previous reasoning). Use the returned support_score to short-circuit weak branches, retry the tool call, or force the agent to re-plan. Adds latency per step, catches drift early, keeps final answers tight.

MODE · AUDIT

Run the full agent. At the end, verify the final output against the concatenated tool traces as sources. Good for log-time analysis without changing the hot path. Use the verdict to tag runs for human review, or as a post-hoc filter before returning to the user. One call, no loop disruption.

// 10 lines

Wauldo around any agent loop.

from wauldo import Wauldo

w = Wauldo(api_key=os.environ["WAULDO_API_KEY"])

# Any agent framework — LangChain, LlamaIndex, custom
agent = build_agent(tools=[search, calculator, db_query])
result = agent.run(query)

# tool_outputs = concatenated observations across steps
verified = w.fact_check(
    text=result.final_answer,
    source_context=result.tool_outputs,
    mode="hybrid",
)

if verified.support_score < 0.6:
    return {"answer": result.final_answer, "warning": "low grounding"}

// agent adversarial bench

92% injection resistance. 48 points ahead of LangChain.

Framework	Factual	Injection	Out-of-scope	Total
Wauldo	100	92	100	91
LlamaIndex	81	48	72	68
LangChain	78	44	70	66
Haystack	73	41	65	60
CrewAI	71	38	63	58

Reproduce: git clone github.com/wauldoai/wauldo-leaderboard && cargo run · full methodology →

ABLATION PROOF

We ran LangChain + Wauldo Guard as a post-hoc layer. The injection score stayed at 44%. The +48pt gap doesn't come from adding a layer — it comes from verification inside the loop. System-level robustness, not bolt-on. Read the ablation study →

// other use cases

Verification across the stack.

RAG pipelines

Score retrieved-chunk grounding. Catch citations that don't exist. Stop shipping plausible fiction.

AI support

Every reply scored against your knowledge base. Block low-grounding answers before they reach customers.

Product overview

How Wauldo works under the hood. Architecture, verdicts, support score, API.

// faq

Agent questions, answered.

Which agent frameworks does it support?

Framework-agnostic. Any agent that produces (final_answer, tool_outputs) — or per-step (step_output, step_context) — can be wrapped. LangChain, LlamaIndex, CrewAI, AutoGen, custom ReAct loops, Rust, Go, Python. The API is HTTP, not a framework integration.

Do I need to change my prompts?

No. Wauldo is orthogonal to prompting. It scores what comes out. Change prompts to reduce the rate of drift; keep Wauldo to measure whether it worked.

Can I verify intermediate steps?

Yes. Call fact_check per step with the step's output and its local context (the tool result, the retrieved chunk, the previous reasoning). See inline mode above. Trade-off is latency per step versus tighter drift control.

Does it handle streaming agent traces?

Verification runs on full text. Buffer the final answer (or the step output) before scoring. For token-by-token UX, stream to the user, then verify in a background task and either confirm or retract.

What about agent self-reflection?

Self-reflection assumes the agent knows when it's wrong. If it did, it wouldn't have hallucinated in the first place. Self-critique correlates with the same blind spots as the original generation. Use external measurement.