// engineering notes

We write when we have something to measure.

No thought leadership. No "state of AI" post. Essays with code, benchmarks, and repro steps. If the post has a number in the title, the number is in the post.

// featured · 2026-04-12

System-level robustness vs bolt-on layer.

We ran the obvious experiment: take LangChain, add Wauldo Guard as a post-hoc check, and see if the +48pt injection gap closes. Spoiler: it doesn't. Here's why verification inside the loop is not the same as verification around the loop.

Read time ~9 min · Published 2026-04-12 · /blog/ablation-system-vs-layer

KEY TAKEAWAY

Guard around LangChain: injection 44%. LangChain alone: injection 44%. Wauldo: injection 92%. The gap lives in the reasoning path, not at its boundary.

// all posts

12 essays, newest first.

2026-04-12 · benchmark

System-level robustness vs bolt-on layer: why LangChain + Guard didn't close the gap

We bolted Wauldo Guard onto LangChain, re-ran the 70-test suite, and watched injection accuracy stay flat at 44%.

Read post →
2026-04-11 · deep-dive

Wauldo Deploy — shipping the verification primitive to prod

How we turned a research repo into a production API — and the plumbing nobody writes about.

Read post →
2026-04-05 · tutorial

How to get verified AI answers in 5 minutes

Copy-paste a curl, read back a support_score, ship a verification gate before your next deploy. Three steps, five minutes, no framework lock-in.

Read post →
2026-04-02 · benchmark

We tested 14 LLMs for hallucination on Claude 3.5 Sonnet queries — here are the results

Same 61 eval tasks, 14 models, one honest scoreboard. The hallucination rates are worse than the vendors' release notes suggest.

Read post →
2026-03-29 · essay

Your LLM is lying in production — here's how to prove it

You can't fix what you don't measure. A practical playbook for catching silent hallucinations in the wild, without waiting for a user to tweet screenshots.

Read post →
2026-03-25 · deep-dive

LangChain hallucinations: why retrieval alone doesn't fix them

RAG retrieves the right chunks, then the model ignores them and writes fiction. We trace the failure mode to the generation step, not the index.

Read post →
2026-03-20 · tutorial

How to verify OpenAI responses before users see them

Drop-in middleware pattern: OpenAI call, verification call, reject or annotate before the user loads the page. Same latency budget, fewer apologies.

Read post →
2026-03-15 · deep-dive

Zero hallucinations: how our RAG pipeline works

Hybrid BM25 + vector, tenant-scoped chunks, claim-level extraction, per-claim grounding. The architecture behind 0% measured hallucination on 61 eval tasks.

Read post →
2026-03-10 · opinion

The real cost of shipping unverified AI to users

The $0.002 LLM call is not the expensive part. Support tickets, legal exposure, and churn from one bad answer dwarf the entire inference bill.

Read post →
2026-03-05 · tutorial

How to fact-check LLM outputs automatically

Claim extraction, source grounding, numerical-mismatch detection, a support_score you can gate on. Wire it into your pipeline in one afternoon.

Read post →
2026-02-28 · opinion

'It works most of the time' is not good enough for AI

91% accuracy sounds great until you run a million requests a month. Why "most of the time" is the new "dropped every tenth packet" of AI infrastructure.

Read post →
2026-02-22 · tutorial

5 mistakes to avoid when building RAG systems

Global BM25 across tenants, naive chunking, no claim grounding, no verification gate, no evals. We've made four of these. Don't repeat them.

Read post →
SUBSCRIBE

Posts ship sporadically — when we have a number to report or a pattern to share. Subscribe via RSS to catch every one. No email capture.

// adjacent reading

The numbers, not the essays.

Essays are fine. Measuring is better.

Paste an AI answer in our home widget to see support_score live.