"Works most of the time" isn't good enough

"Our AI is 92% accurate."

I have heard this sentence in every AI product review I have been part of. It is always said with pride. 92% feels like an A-minus. Feels like something you can ship. And so teams ship it, open the champagne, and move on to the next feature.

Nobody does the math. So let me do it for you.

The 92% problem

92% accuracy at 10,000 queries per month means hundreds of wrong answers. Every single month. Not hypothetical wrong answers. Real ones. Served to real users who asked real questions and got back responses that looked perfectly correct but were not.

hundreds of users who now believe something false about your product, your policies, your data. hundreds of decisions made on bad information. hundreds of moments where your AI sounded confident and authoritative while being completely wrong.

And here is the part that should keep you up at night: almost none of those hundreds of users will report the error. They will not file a ticket. They will not flag the response. They will just trust it — because it sounded right, because it cited something that looked like a source, because why would they second-guess a system that your company built and deployed?

The silent failure 10,000 queries/month × 8% error rate = hundreds of wrong answers. Per month. None reported. All trusted. Every one of them a small breach of the implicit promise your product makes: that the answers it gives are correct.

At 100,000 queries, it is 8,000 wrong answers. At a million, 80,000. Scale does not fix this problem. Scale makes it worse. When we tested 14 LLMs for hallucination, even the best models produced wrong answers at rates that would be unacceptable in any other category of software.

We don't accept this in traditional software

Let me put 92% in context.

92% uptime means 29 days of downtime per year. Your API is offline for an entire month, scattered across the calendar. No SaaS company would survive that. No customer would tolerate it. The industry standard is 99.9% — and even that generates angry tweets.

92% delivery rate for an email provider means 1 in 12 emails vanishes. Your invoices, your password resets, your onboarding sequences — silently dropped. You would switch providers in a week.

92% accuracy for a calculator means it gives you the wrong number roughly once per dozen calculations. You would throw it in the trash.

But somehow, 92% accuracy for AI answers is considered production-ready. We have collectively decided that a system which is wrong 8% of the time is good enough — not because the standard makes sense, but because we do not know how to do better. Or we think we do not.

The double standard We demand five nines for uptime, four nines for data integrity, three nines for email delivery — but one nine for AI accuracy? We have set the bar on the floor and congratulated ourselves for stepping over it.

When wrong looks exactly like right

A crashed API returns an error code. A failed database query throws an exception. A broken payment flow shows a clear error message. These failures are visible, immediate, and actionable.

Wrong AI answers fail differently. They look exactly like right ones.

The response is fluent. The tone is confident. The structure is professional. If the answer includes citations, they look plausible. If it includes numbers, they are formatted correctly. There is no red banner, no error code, no stack trace. The wrong answer wears the same clothes as the right one.

This is what makes AI errors uniquely dangerous. The user has no signal that something went wrong. In traditional software, failure modes are obvious. In AI, the failure mode is invisible confidence.

A wrong search result is clearly wrong — the user sees the title and skips it.
A wrong AI answer is a paragraph of fluent, authoritative text that the user reads, believes, and acts on.
A database error says "something went wrong." A hallucination says "here is your answer" — and the answer is fabricated.

This is why the 8% matters more than the 92%. In most software, errors are caught at the boundary. In AI, errors pass through the boundary wearing a disguise. Read more about how LLMs lie in production — the failure patterns are structural, not accidental.

Verification, not hope

The instinct when confronted with the 92% problem is to try to improve the model. Fine-tune it. Add more data. Rewrite the prompt. Maybe the next model version will be better.

This is hope disguised as a strategy.

Better models help at the margin. Going from 88% to 92% is real progress. But going from 92% to 99.5% through model improvements alone is not realistic with current architectures. The last few percentage points are exponentially harder, and the hallucination problem is structural — it is how these models work, not a bug to be patched.

The answer is not a better model. The answer is a verification layer for LLM outputs. Every answer checked against its sources. Every claim grounded in retrievable evidence. And when the evidence is not there — when the model is guessing, confabulating, or extrapolating — the answer is blocked. Not served with a disclaimer. Not shown with a warning. Blocked.

The right default "I don't know" is always better than a wrong answer. A verified system that says "I cannot confirm this from the available sources" builds more trust than an unverified system that sounds confident 100% of the time and is wrong 8% of it.

This is not a new idea. We do not let financial software report numbers without an audit trail. We do not let medical systems make recommendations without evidence grounding. We do not let legal tools cite cases without verification. The principle is simple: when the cost of being wrong is high, you verify before you serve.

See the real cost of unverified AI for the dollar amounts. The math is not close. Verification costs pennies per request. A single wrong answer costs orders of magnitude more.

The new standard for production AI

The era of "it works most of the time" is ending. Not because someone decided it should, but because the consequences are catching up. Companies are getting sued over AI hallucinations. Users are losing trust in AI-powered features. Regulators are writing rules that assume verification is a baseline, not a feature.

The new standard is straightforward:

Every answer verified — checked against source documents before it reaches the user.
Every claim grounded — if it cannot be traced to a source, it does not get served.
Every source cited — so users and auditors can check for themselves.
Wrong answers blocked — not flagged, not disclaimed. Blocked. The user sees "I don't have enough information" instead of a fabrication.

This is not about making AI slower or more cautious. It is about making AI trustworthy. A verified pipeline that answers 85% of questions correctly and refuses the other 15% is infinitely more valuable than an unverified pipeline that answers 100% of questions and gets 8% of them wrong.

The difference is that with the first system, you know which answers to trust. With the second, you are guessing — and so are your users.

Start here Get verified AI answers in 5 minutes with a step-by-step setup guide. Or paste an answer into our home widget to see the score live.

Try it free Paste any AI answer into our home widget to get a numeric support_score. No signup. 500 verifications/month free on RapidAPI. See pricing →

The 92% problem

We don't accept this in traditional software

When wrong looks exactly like right

Verification, not hope

The new standard for production AI

Related essays.

The real cost of unverified AI.

Your LLM is lying in production.

14 LLMs benchmarked for hallucination.