How to fact-check LLM outputs automatically

You shipped an LLM feature. Users are asking questions and getting answers. Everything looks great in the demo. But here is the question nobody wants to ask: how do you know the outputs are correct? You don't — unless you check. And checking manually does not scale past the first ten queries.

This guide walks through three approaches to automated fact-checking, from simple token overlap to full hybrid verification. By the end, you will have a working pipeline that catches hallucinations before they reach your users.

Why you need post-generation fact-checking.

The instinct is to prevent hallucinations at the prompt level: "Only answer from the provided context. If you don't know, say so." This works some of the time. It does not work reliably. LLMs are trained to produce fluent, confident-sounding text. When the retrieved context is ambiguous or partially relevant, the model will fill gaps with plausible-sounding fabrications. It is not being malicious — it is doing exactly what it was trained to do.

Prompt engineering reduces hallucination rates. It does not eliminate them. A well-crafted system prompt might get you from a 15% hallucination rate down to 5%. That remaining 5% is the problem, because those are the confident-sounding wrong answers that users trust the most. Prompts are a first line of defense, not the last.

Post-generation verification treats the LLM output as an untrusted claim and checks it against the source material. It is the difference between hoping the model got it right and knowing whether it did.

Three approaches to verification.

There are three main ways to verify LLM outputs against source documents, each with different tradeoffs:

Token overlap — fast, free, catches exact mismatches. Misses paraphrases.
Semantic similarity — catches paraphrases, handles rephrased claims. Requires an embedding model.
Hybrid — combines both. Best accuracy. This is what production systems use.

The right choice depends on your latency budget, accuracy requirements, and whether you are willing to run an embedding model. Let's walk through each one.

Approach 1 — Token overlap (fast, free).

Token overlap is the simplest verification method. Tokenize the claim and the source, compute the ratio of shared tokens, and threshold it. High overlap means the claim is likely grounded. Low overlap means it is likely fabricated.

This sounds naive, but it catches more than you'd expect. Its real strength is numerical mismatch detection. When the source says "14 days" and the LLM output says "60 days," token overlap catches this instantly because the numbers simply don't match. No embedding model needed. Sub-millisecond latency.

def token_overlap(claim: str, source: str) -> float:
    claim_tokens = set(claim.lower().split())
    source_tokens = set(source.lower().split())
    if not claim_tokens:
        return 0.0
    overlap = claim_tokens & source_tokens
    return len(overlap) / len(claim_tokens)

# "The refund period is 60 days" vs source "refund period is 14 days"
# overlap = 0.57 → below threshold → flagged

Where token overlap fails: paraphrases. If the source says "subscribers can cancel at any time" and the LLM writes "users are free to terminate their subscription whenever they choose," the overlap is low despite the claim being correct. For this, you need semantics.

Approach 2 — Semantic similarity (catches paraphrases).

Semantic verification uses embedding models (sentence-transformers, E5-family, and similar) — to compare the meaning of the claim against the source text. You embed both, compute cosine similarity, and threshold it. High similarity means the claim is semantically grounded. Low similarity means it is likely unsupported.

This catches the paraphrase problem that token overlap misses. "Cancel at any time" and "terminate whenever they choose" will have high cosine similarity because they mean the same thing. It also handles synonyms, different phrasings of the same fact, and multilingual content where the claim might be in a different language than the source.

The tradeoff is cost and latency. You need an embedding model running (small open-source models like sentence-transformers work at ~130MB). Each verification call takes 50–500ms depending on your hardware, compared to sub-millisecond for token overlap. For high-throughput APIs, this adds up.

The subtler problem: semantic similarity can be too forgiving. Two sentences can be semantically similar while disagreeing on a specific number. "The fee is 2%" and "the fee is 5%" are semantically close — both discuss a fee percentage — but factually different. This is why you need both approaches.

Approach 3 — Hybrid verification (best of both).

Hybrid verification runs both token overlap and semantic similarity, then combines the signals. Token overlap catches numerical mismatches and exact contradictions. Semantic similarity catches paraphrases and rephrased facts. Together, they cover each other's blind spots.

This is what the Wauldo verification API uses in production. The pipeline works in two stages:

Stage 1 (token overlap) — fast pass. High-confidence claims are verified, low-confidence claims are rejected. Numerical mismatches and negation conflicts are checked regardless of overlap score.
Stage 2 (semantic) — only runs for claims in the gray zone. Embeds the claim and source, computes cosine similarity, and makes a final determination.

This two-stage approach keeps latency low for clear-cut cases (most claims are either obviously grounded or obviously fabricated) while using the more expensive semantic check only when it matters. In practice, about 70% of claims resolve at stage 1.

Key insightNumerical mismatch detectors run as overrides in both stages. Even if the semantic similarity between "60 days" and "14 days" is high (both discuss a time period), the numerical mismatch detector catches the contradiction and forces a rejection.

Adding verification to your pipeline.

The fastest way to add fact-checking is with the Wauldo Python SDK. Install it, pass a claim and a source, and get back a verdict with a confidence score and reason.

pip install wauldo

from wauldo import Wauldo

client = Wauldo(api_key="your-api-key")

result = client.guard(
    claim="The refund period is 60 days.",
    source="Our refund policy allows returns within 14 days of purchase."
)

print(result.verdict)     # "rejected"
print(result.confidence)  # 0.3
print(result.reason)      # "numerical_mismatch"
print(result.supported)   # False

The "60 days vs 14 days" example is a classic case. The claim sounds plausible. A human reviewer might skim past it. But the numbers don't match, and verification catches it automatically. The verdict comes back as rejected with numerical_mismatch as the reason.

You can integrate this into any LLM pipeline. Generate the answer, extract the claims, verify each one, and block or flag responses that fail.

# After generating an LLM response
claims = extract_claims(llm_response)

for claim in claims:
    result = client.guard(claim=claim, source=source_text)
    if not result.supported:
        # Flag, rewrite, or block the response
        print(f"Unverified claim: {claim}")
        print(f"Reason: {result.reason}")

What to verify and what to skip.

Not every sentence in an LLM output needs fact-checking. Verifying everything wastes compute and introduces false positives. The goal is to check the claims that matter and skip the ones that don't.

Verify these:

Factual claims with numbers — prices, dates, durations, percentages, quantities. These are the most common source of hallucination and the easiest to catch.
Specific named entities — company names, product names, people. If the LLM says "according to Section 4.2" and the document has no Section 4.2, that is a phantom citation.
Comparative claims — "faster than," "more expensive than," "unlike." These require cross-referencing multiple sources and are prone to fabrication.
Negation claims — "does not support," "never expires," "no limit." Negation is hard for LLMs. A source saying "expires after 12 months" can get distorted into "never expires."

Skip these:

Greetings and filler — "Sure, I can help with that." Not a factual claim.
Opinions and subjective statements — "This is a good approach." No ground truth to check against.
Direct quotes from the source — if the LLM is quoting verbatim from the retrieved context, the overlap will be near 1.0 by definition. Checking these wastes cycles.
Hedged statements — "It might be around 14 days." Hedged language indicates the model is already uncertain. Flag it differently than a confident wrong claim.

Rule of thumbIf the sentence contains a number, a name, or a negation, verify it. If it is filler or opinion, skip it. This simple heuristic catches 90% of the hallucinations that matter while keeping your verification costs low.

Try it freePaste any AI answer into our home widget for a numeric support_score. 500 verifications/month free on RapidAPI. See pricing →

Why you need post-generation fact-checking.

Three approaches to verification.

Approach 1 — Token overlap (fast, free).

Approach 2 — Semantic similarity (catches paraphrases).

Approach 3 — Hybrid verification (best of both).

Adding verification to your pipeline.

What to verify and what to skip.

Related essays.

Verified AI answers in 5 minutes.

Verify OpenAI responses.

Zero-hallucination RAG.