No thought leadership. No "state of AI" post. Essays with code, benchmarks, and repro steps. If the post has a number in the title, the number is in the post.
We ran the obvious experiment: take LangChain, add Wauldo Guard as a post-hoc check, and see if the +48pt injection gap closes. Spoiler: it doesn't. Here's why verification inside the loop is not the same as verification around the loop.
Guard around LangChain: injection 44%. LangChain alone: injection 44%. Wauldo: injection 92%. The gap lives in the reasoning path, not at its boundary.
We bolted Wauldo Guard onto LangChain, re-ran the 70-test suite, and watched injection accuracy stay flat at 44%.
Read post → 2026-04-11 · deep-diveHow we turned a research repo into a production API — and the plumbing nobody writes about.
Read post → 2026-04-05 · tutorialCopy-paste a curl, read back a support_score, ship a verification gate before your next deploy. Three steps, five minutes, no framework lock-in.
Read post → 2026-04-02 · benchmarkSame 61 eval tasks, 14 models, one honest scoreboard. The hallucination rates are worse than the vendors' release notes suggest.
Read post → 2026-03-29 · essayYou can't fix what you don't measure. A practical playbook for catching silent hallucinations in the wild, without waiting for a user to tweet screenshots.
Read post → 2026-03-25 · deep-diveRAG retrieves the right chunks, then the model ignores them and writes fiction. We trace the failure mode to the generation step, not the index.
Read post → 2026-03-20 · tutorialDrop-in middleware pattern: OpenAI call, verification call, reject or annotate before the user loads the page. Same latency budget, fewer apologies.
Read post → 2026-03-15 · deep-diveHybrid BM25 + vector, tenant-scoped chunks, claim-level extraction, per-claim grounding. The architecture behind 0% measured hallucination on 61 eval tasks.
Read post → 2026-03-10 · opinionThe $0.002 LLM call is not the expensive part. Support tickets, legal exposure, and churn from one bad answer dwarf the entire inference bill.
Read post → 2026-03-05 · tutorialClaim extraction, source grounding, numerical-mismatch detection, a support_score you can gate on. Wire it into your pipeline in one afternoon.
Read post → 2026-02-28 · opinion91% accuracy sounds great until you run a million requests a month. Why "most of the time" is the new "dropped every tenth packet" of AI infrastructure.
Read post → 2026-02-22 · tutorialGlobal BM25 across tenants, naive chunking, no claim grounding, no verification gate, no evals. We've made four of these. Don't repeat them.
Read post →Posts ship sporadically — when we have a number to report or a pattern to share. Subscribe via RSS to catch every one. No email capture.
Six adapters, 70 adversarial tests, Wilson 95% CI. Wauldo 96%, LangChain 66%, LangChain+Guard 66%.
Open leaderboard → weekly · reproducibleEval suite 77%, hard suite 85%, RAG-only accuracy 89%, 0% hallucination. Auto-refreshed every Monday.
See benchmarks → product · APIEvery answer grounded against sources, support_score on every claim, OpenAI-compatible endpoint, 5ms p50.
See the product →Paste an AI answer in our home widget to see support_score live.