You shipped an LLM feature. Users are asking questions and getting answers. Everything looks great in the demo. But here is the question nobody wants to ask: how do you know the outputs are correct? You don't — unless you check. And checking manually does not scale past the first ten queries.
This guide walks through three approaches to automated fact-checking, from simple token overlap to full hybrid verification. By the end, you will have a working pipeline that catches hallucinations before they reach your users. If you want to understand why this matters in production, read about LLMs lying in production first.
Why You Need Post-Generation Fact-Checking
The instinct is to prevent hallucinations at the prompt level: "Only answer from the provided context. If you don't know, say so." This works some of the time. It does not work reliably. LLMs are trained to produce fluent, confident-sounding text. When the retrieved context is ambiguous or partially relevant, the model will fill gaps with plausible-sounding fabrications. It is not being malicious — it is doing exactly what it was trained to do.
Prompt engineering reduces hallucination rates. It does not eliminate them. A well-crafted system prompt might get you from a 15% hallucination rate down to 5%. That remaining 5% is the problem, because those are the confident-sounding wrong answers that users trust the most. This is the core insight behind how our zero-hallucination pipeline works — prompts are a first line of defense, not the last.
Post-generation verification treats the LLM output as an untrusted claim and checks it against the source material. It is the difference between hoping the model got it right and knowing whether it did.
Three Approaches to Verification
There are three main ways to verify LLM outputs against source documents, each with different tradeoffs:
- Token overlap — Fast, free, catches exact mismatches. Misses paraphrases.
- Semantic similarity — Catches paraphrases, handles rephrased claims. Requires embedding model.
- Hybrid — Combines both. Best accuracy. This is what production systems use.
The right choice depends on your latency budget, accuracy requirements, and whether you are willing to run an embedding model. Let's walk through each one.
Approach 1: Token Overlap (Fast, Free)
Token overlap is the simplest verification method. Tokenize the claim and the source, compute the ratio of shared tokens, and threshold it. If the overlap is above 0.7, the claim is likely grounded. Below 0.4, it is likely fabricated.
This sounds naive, but it catches more than you'd expect. Its real strength is numerical mismatch detection. When the source says "14 days" and the LLM output says "60 days," token overlap catches this instantly because the numbers simply don't match. No embedding model needed. Sub-millisecond latency.
def token_overlap(claim: str, source: str) -> float: claim_tokens = set(claim.lower().split()) source_tokens = set(source.lower().split()) if not claim_tokens: return 0.0 overlap = claim_tokens & source_tokens return len(overlap) / len(claim_tokens) # "The refund period is 60 days" vs source "refund period is 14 days" # overlap = 0.57 → below threshold → flagged
Where token overlap fails: paraphrases. If the source says "subscribers can cancel at any time" and the LLM writes "users are free to terminate their subscription whenever they choose," the overlap is low despite the claim being correct. For this, you need semantics.
Approach 2: Semantic Similarity (Catches Paraphrases)
Semantic verification uses embedding models — typically BGE, E5, or similar — to compare the meaning of the claim against the source text. You embed both, compute cosine similarity, and threshold it. A score above 0.8 means the claim is semantically grounded. Below 0.5, it is likely unsupported.
This catches the paraphrase problem that token overlap misses. "Cancel at any time" and "terminate whenever they choose" will have high cosine similarity because they mean the same thing. It also handles synonyms, different phrasings of the same fact, and multilingual content where the claim might be in a different language than the source.
The tradeoff is cost and latency. You need an embedding model running (BGE-small is a good choice at ~130MB). Each verification call takes 50–500ms depending on your hardware, compared to sub-millisecond for token overlap. For high-throughput APIs, this adds up.
The subtler problem: semantic similarity can be too forgiving. Two sentences can be semantically similar while disagreeing on a specific number. "The fee is 2%" and "the fee is 5%" are semantically close — both discuss a fee percentage — but factually different. This is why you need both approaches.
Approach 3: Hybrid Verification (Best of Both)
Hybrid verification runs both token overlap and semantic similarity, then combines the signals. Token overlap catches numerical mismatches and exact contradictions. Semantic similarity catches paraphrases and rephrased facts. Together, they cover each other's blind spots.
This is what the Wauldo Guard verification API uses in production. The pipeline works in two stages:
- Stage 1 (token overlap) — Fast pass. If confidence is above 0.7, mark as verified. If below 0.4, mark as rejected. Check for numerical mismatches and negation conflicts regardless of overlap score.
- Stage 2 (semantic) — Only runs for claims in the 0.4–0.7 gray zone. Embeds the claim and source with BGE, computes cosine similarity, and makes a final determination.
This two-stage approach keeps latency low for clear-cut cases (most claims are either obviously grounded or obviously fabricated) while using the more expensive semantic check only when it matters. In practice, about 70% of claims resolve at stage 1.
Key insight: Numerical mismatch detectors run as overrides in both stages. Even if the semantic similarity between "60 days" and "14 days" is high (both discuss a time period), the numerical mismatch detector catches the contradiction and forces a rejection. See 5 common RAG mistakes for more on why source attribution matters alongside verification.
Adding Guard to Your Pipeline
The fastest way to add fact-checking is with the Wauldo Python SDK. Install it, pass a claim and a source, and get back a verdict with a confidence score and reason.
pip install wauldo
from wauldo import Wauldo client = Wauldo(api_key="your-api-key") result = client.guard( claim="The refund period is 60 days.", source="Our refund policy allows returns within 14 days of purchase." ) print(result.verdict) # "rejected" print(result.confidence) # 0.3 print(result.reason) # "numerical_mismatch" print(result.supported) # False
The "60 days vs 14 days" example is a classic case. The claim sounds plausible. A human reviewer might skim past it. But the numbers don't match, and Guard catches it automatically. The verdict comes back as rejected with numerical_mismatch as the reason.
You can integrate this into any LLM pipeline. Generate the answer, extract the claims, verify each one, and block or flag responses that fail. For the full fact-check API reference, see the documentation. To understand how Wauldo compares to basic RAG, the comparison page breaks down each layer.
# After generating an LLM response claims = extract_claims(llm_response) for claim in claims: result = client.guard(claim=claim, source=source_text) if not result.supported: # Flag, rewrite, or block the response print(f"Unverified claim: {claim}") print(f"Reason: {result.reason}")
What to Verify and What to Skip
Not every sentence in an LLM output needs fact-checking. Verifying everything wastes compute and introduces false positives. The goal is to check the claims that matter and skip the ones that don't.
Verify these:
- Factual claims with numbers — prices, dates, durations, percentages, quantities. These are the most common source of hallucination and the easiest to catch.
- Specific named entities — company names, product names, people. If the LLM says "according to Section 4.2" and the document has no Section 4.2, that is a phantom citation.
- Comparative claims — "faster than," "more expensive than," "unlike." These require cross-referencing multiple sources and are prone to fabrication.
- Negation claims — "does not support," "never expires," "no limit." Negation is hard for LLMs. A source saying "expires after 12 months" can get distorted into "never expires."
Skip these:
- Greetings and filler — "Sure, I can help with that." Not a factual claim.
- Opinions and subjective statements — "This is a good approach." No ground truth to check against.
- Direct quotes from the source — If the LLM is quoting verbatim from the retrieved context, the overlap will be near 1.0 by definition. Checking these wastes cycles.
- Hedged statements — "It might be around 14 days." Hedged language indicates the model is already uncertain. Flag it differently than a confident wrong claim.
Rule of thumb: If the sentence contains a number, a name, or a negation, verify it. If it is filler or opinion, skip it. This simple heuristic catches 90% of the hallucinations that matter while keeping your verification costs low.