// weekly quality runs · versioned in git

Benchmarks that commit themselves.

Every Monday at 08:00 UTC, GitHub Actions runs the full suite, commits results, and updates this page. No editorial filter, no spin. If a number drops, you see the drop. If a number improves, it's timestamped.

See the latest run → Compare frameworks →

// last run · 2026-04-15

The weekly headline.

HALLUCINATION RATE

0.00%

Across 70-case adversarial bench. Zero confabulated claims this run.

ADVERSARIAL PASS

97.14%

Best run. Median across 4 runs: 91%. Range: 86–97.

P50 LATENCY

1,566ms

Agent run end-to-end. Fast path /v1/fact-check: 5ms. Includes routing, LLM, verification.

// weekly trend

The bot commits. The chart updates.

Adversarial pass rate

Date	Rate
2026-04-10	81.43%
2026-04-12	97.14%
2026-04-15	91.00%

Median line. Best run landed Apr 12.

Factual retrieval

Date	Rate
2026-04-10	100.00%
2026-04-12	100.00%
2026-04-15	100.00%

Factual floor holds — zero regression in 3 weekly runs.

// four dimensions

What the suite actually runs.

01 · ADVERSARIAL

70 hand-crafted cases probing prompt injection, out-of-scope, contradiction, and multi-hop reasoning.

02 · FACTUAL

Closed-domain retrieval on a held-out corpus. Zero hallucination is the pass bar.

03 · LATENCY

p50 and p95 across 1000 requests. Fast path isolated from full LLM path.

04 · COST

Token count multiplied by provider price. Published in micro-USD for reproducibility.

// how it runs

Two GitHub Actions. One commit.

name: bench-weekly
on:
  schedule:
    - cron: "0 8 * * 1"   # Mondays 08:00 UTC
  workflow_dispatch:

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: cargo run --release --bin quality_bench -- --suite eval
      - run: cargo run --release --bin quality_bench -- --suite hard
      - run: python -m benchmarks.publish --out landing/benchmarks-data.json
      - uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: "chore(bench): weekly results ${{ github.run_number }}"

ALL RUNS PUBLIC

Every run's raw output is in benchmarks/history/ in the public repo. If you want to verify, clone it and run git log --oneline benchmarks/history/. You'll see the bot's commits. No edits.

View history on GitHub →

// related

Elsewhere on the site.

LEADERBOARD

Framework comparison

Wauldo vs LangChain vs LlamaIndex vs Haystack vs CrewAI on a fixed adversarial dataset. Wilson intervals, weighted score, ablation studies.

Open leaderboard →

PRODUCT

The primitive

A single HTTP endpoint that returns a support score, a verdict, and the passages that grounded the answer. Swap your LLM call for a verified one.

See the primitive →

CHANGELOG

What shipped

Version-by-version log of quality gates, new verifiers, and scorer changes. Every bump on this page lines up with a tagged release.

Read changelog →

// methodology first

The numbers shouldn't be a mystery.

See every run. Read the scorer. Run it yourself.

Read the methodology → Verify your own answer →

$ cargo run quality_bench --suite eval