You are using OpenAI's API. GPT-4 or GPT-4.1, maybe GPT-4o for speed. It works remarkably well — until it does not. A customer asks a question grounded in your data, and the model confidently returns a number that is off by 40%, cites a source that does not exist, or inverts a policy clause. The response looks perfect. Your user trusts it. And you have just shipped a wrong answer to production.
This is not a rare edge case. Research shows GPT-4 hallucinates on 6-8% of grounded tasks — questions where the correct answer exists in the provided context. That means for every 100 answers your OpenAI-powered feature returns, 6 to 8 are wrong. Not vaguely wrong. Confidently, specifically wrong.
Yes, GPT-4 Hallucinates Too
There is a persistent myth that GPT-4 "solved" hallucination. It did not. It reduced the rate compared to GPT-3.5, but the failure modes are the same — and in some ways more dangerous, because GPT-4's fluency makes wrong answers harder to spot.
The most common hallucination types in production:
- Numerical errors — The source says "14-day return policy" and GPT-4 responds with "30-day return policy." The sentence structure is perfect. The number is fabricated.
- Fabricated sources — GPT-4 cites "Section 3.2 of the agreement" when the document only has two sections. It invents references that sound plausible but do not exist.
- Confident wrong claims — Asked whether a product supports a feature, GPT-4 says yes with detailed instructions — for a feature that was deprecated two versions ago.
- Subtle inversions — The source says "not covered under warranty" and GPT-4 rephrases it as "covered under the standard warranty." One missing word, completely opposite meaning.
Real-world impact
A support chatbot powered by GPT-4 tells a customer their subscription includes a feature it does not. The customer buys. They discover the truth. Now you have a refund, a support ticket, and a one-star review — all from a single hallucinated sentence.
OpenAI's API Has No Built-in Verification
Here is what OpenAI's chat completion API returns: the generated text, token usage, and a finish reason. That is it. No confidence score. No verification flag. No indication of whether the answer is grounded in the context you provided.
# What OpenAI gives you { "choices": [{ "message": { "content": "The return policy is 30 days..." }, "finish_reason": "stop" }], "usage": { "total_tokens": 142 } } # What OpenAI does NOT give you # - confidence score # - verification against source # - grounding flag # - hallucination probability
OpenAI assumes you will handle verification yourself. Most teams do not. They pipe the response directly to the user and hope for the best. This works until it does not — and when it fails, the cost falls on you, not on OpenAI. Understanding how LLMs lie in production is the first step toward fixing this gap.
Adding Guard to Your OpenAI Pipeline
The fix is a verification layer between OpenAI's response and your user. The pattern is simple: call OpenAI as usual, then verify the response against the source material before serving it.
The pattern
openai.chat.completions.create() → get answer → wauldo.guard() → check verdict → serve or handle gracefully. One extra API call. Zero infrastructure changes. Every answer verified before it reaches a user.
The Wauldo Guard hallucination firewall sits between your LLM and your users. It takes the claim (what the LLM said) and the source (what it should have said), and returns a verdict: verified, weak, or rejected. Along with a confidence score and the specific reason for the verdict.
Python Example: Verify Before Returning
Here is a complete working example. You keep your existing OpenAI code exactly as-is — just add the verification step before returning the answer to the user.
from openai import OpenAI from wauldo import Wauldo openai_client = OpenAI(api_key="sk-...") wauldo_client = Wauldo(api_key="your-wauldo-key") def answer_question(question, source_doc): # Step 1: Get answer from OpenAI (your existing code) completion = openai_client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "Answer based on the source."}, {"role": "user", "content": f"{question}\n\nSource: {source_doc}"} ] ) answer = completion.choices[0].message.content # Step 2: Verify before serving result = wauldo_client.guard(claim=answer, source=source_doc) # Step 3: Act on the verdict if result.verdict == "verified": return {"answer": answer, "confidence": result.confidence} elif result.verdict == "weak": return {"answer": answer, "warning": "Low confidence"} else: # Rejected — do not serve this answer return {"answer": "I could not verify this. Please contact support."}
The same pattern works in TypeScript, Rust, or any language that can make HTTP calls. See the Guard API reference for the full request and response schema.
import OpenAI from "openai"; import { Wauldo } from "wauldo"; const openai = new OpenAI(); const wauldo = new Wauldo({ apiKey: "your-wauldo-key" }); const completion = await openai.chat.completions.create({ model: "gpt-4", messages: [{ role: "user", content: question }] }); const answer = completion.choices[0].message.content; const result = await wauldo.guard({ claim: answer, source: doc }); if (result.verdict === "rejected") { // Don't serve — hallucination detected return fallbackResponse(); }
What Guard Catches That OpenAI Doesn't
Guard is not a wrapper around GPT-4. It is a purpose-built verification engine that checks specific failure modes LLMs cannot self-detect. Here is what it catches:
- Numerical mismatches — Source says "$500/month" and the LLM says "$50/month." Guard detects the numerical discrepancy and returns rejected with reason
numerical_mismatch. - Unsupported claims — The LLM adds information not present in the source. Guard checks every claim against the source text and flags those with insufficient evidence.
- Phantom citations — The LLM references "Section 4.1" but the source has no such section. Guard's citation validator detects phantom references.
- Negation conflicts — Source says "does not include" and LLM says "includes." Guard detects the negation inversion and blocks the response.
OpenAI catches none of these. The API has no mechanism to compare the output against the input context. That is not a criticism — it is a design choice. OpenAI builds the generation layer. Verification is a separate concern, and it requires a separate system. Learn how to set up automated fact-checking for LLM outputs in your pipeline.
How it works under the hood: Guard uses hybrid verification — token overlap for fast lexical checks, plus BGE embeddings for semantic matching. It runs contradiction detectors for numbers, negations, and opposition patterns. No LLM in the loop. Deterministic, sub-second, and consistent. See the full comparison of verification approaches.
Works With Any LLM Provider
The pattern above is not OpenAI-specific. Guard verifies outputs, not providers. The same three-line integration works with any LLM:
- Anthropic Claude — Same flow. Call Claude, verify with Guard, serve or reject.
- Google Gemini — Same flow. Gemini's grounding API is limited to Google Search. Guard works with your own documents.
- Meta Llama — Running Llama locally or via an API? Guard works the same way. Self-hosted models need verification even more, since they lack the RLHF tuning of commercial models.
- Mistral, Cohere, any provider — If it generates text, Guard can verify it.
This is the key advantage of a provider-agnostic verification layer. You can switch LLMs, run A/B tests between models, or use different providers for different use cases — and your verification layer stays the same. Read about LangChain hallucination prevention if you are using an orchestration framework.
Get started in 5 minutes
pip install wauldo or npm install wauldo. Get an API key (free tier: 300 requests/month). Add the three lines above to your existing code. Every OpenAI response verified before it reaches your users.