Benchmarks are vanity metrics

In early 2025, OpenAI's o1 scored 8.8% on Humanity's Last Exam, a benchmark designed by nearly 1,000 subject-matter experts to be the hardest test ever given to an AI. It was supposed to take years to crack. By April 2026, Claude Opus 4.6 and Gemini 3.1 Pro are clearing 50%. GPT-5.4 sits at 41.6%. The test that was built to humble AI got humbled instead. This pattern should be familiar. MMLU, GPQA, SWE-bench, every benchmark follows the same arc: introduced with fanfare, saturated within months, replaced by something harder. On SWE-bench Verified, AI systems went from solving 4.4% of real-world coding problems in 2023 to near-100% by 2025. The numbers are staggering. But here's the uncomfortable question: if AI keeps acing every test we throw at it, why does trust keep falling?

The trust gap

A March 2026 Quinnipiac poll found that 76% of Americans say they trust AI "rarely" or "only sometimes." Just 21% trust it most of the time. And the trend is moving in the wrong direction: in 2021, 37% of Americans were more concerned than excited about AI. By mid-2025, that number hit 50%. A separate survey from Ohio State's Wexner Medical Center showed that openness to AI in healthcare dropped from 52% to 42% between 2024 and 2026. Models are getting smarter by every measure we have. People are getting less comfortable. Something doesn't add up, unless we're measuring the wrong thing.

The startup vanity metric playbook

Startup founders know this trap well. Downloads, signups, page views, these are vanity metrics. They look impressive on a pitch deck, they move up and to the right, and they tell you almost nothing about whether the product actually works. AI benchmarks have become the industry's version of vanity metrics. They're the numbers companies put on blog posts and keynote slides. They create a horse race narrative that generates headlines. And they're increasingly disconnected from what matters to anyone trying to actually use these systems. The parallel runs deeper than surface-level. Just as startups can game downloads through paid acquisition without building retention, AI labs can optimize for benchmark performance without improving real-world reliability. A study from METR in early 2026 found that roughly half of SWE-bench Verified solutions generated by AI agents would not actually be merged by repository maintainers. The benchmark said "solved." The real world said "not quite." Daniel Kang's analysis of AI agent benchmarks found severe issues in 8 out of 10 popular benchmarks, with some causing up to 100% misestimation of agent capabilities. In one case, WebArena marked an answer of "45 + 8 minutes" as correct when the right answer was 63 minutes. The test passed. The math didn't.

Why companies optimize for the wrong thing

The Stanford HAI 2026 AI Index report reveals a telling dynamic. Industry now produces over 90% of all notable AI models. But the most capable models have stopped disclosing training code, parameter counts, dataset sizes, and training duration. OpenAI, Anthropic, and Google have all pulled back the curtain. When internal details go dark, benchmarks become the only public signal of progress. They're the one number journalists can compare, investors can track, and competitors can try to beat. So that's what gets optimized. The Stanford Foundation Model Transparency Index tells the rest of the story: AI companies now average just 40 out of 100 on transparency, a significant decline from the prior year. Less transparency means more reliance on benchmarks as the primary measure of capability. More reliance on benchmarks means more incentive to optimize for them, regardless of whether the improvements translate to real-world performance. It's a flywheel, and it spins in the wrong direction.

The convergence problem

Here's what makes the benchmark arms race even more pointless: the models are converging. Claude Opus 4.6 and Gemini 3.1 Pro both clear 50% on Humanity's Last Exam. GPT-5.4 is right behind at 41.6%. Across the top tier, the gaps are narrowing. When every model aces the same tests, the test stops being useful as a differentiator. It's like comparing sprinters who all run a 9.8-second 100 meters. The number is impressive, but it doesn't tell you who wins the race that actually matters. If intelligence is becoming a commodity, measured by benchmarks at least, then the differentiator has to be something else. Taste. User experience. Reliability. How well the system handles the messy, ambiguous, context-dependent problems that benchmarks never capture.

What benchmarks can't measure

An Oxford study published at NeurIPS reviewed 445 AI benchmarks and found that many are "built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety." The EU's AI Watch identified nine fundamental challenges with AI benchmarking, including the fact that many benchmarks "fail to measure what they claim to measure." The things that matter most in production are exactly the things benchmarks are worst at capturing:

Error recovery. When the model gets something wrong, how gracefully does it fail? Does it hallucinate with confidence or flag its uncertainty?

User retention. Do people keep coming back? Not because the model scored well on a test, but because it actually helped them do their job.

Deployment success rates. How often does the AI output work the first time, without human correction?

Security and safety. Can the system be jailbroken? Does it leak sensitive data? No benchmark score tells you this.

Net time saved. After accounting for error correction, fact-checking, and rework, does the AI actually save time? Or does it just shift the work from creation to verification?

These are hard to measure. They don't fit neatly into a leaderboard. And that's precisely the point.

What honest evaluation looks like

Some efforts are moving in the right direction. SWE-bench Pro raised task difficulty and tightened dataset controls to better approximate professional software work. SWE-bench Live updates monthly with fresh tasks to avoid data contamination. METR's work on evaluating whether AI-generated code would actually be accepted by human maintainers is a step toward measuring real-world usefulness rather than test-passing ability. But the industry needs a more fundamental shift. Honest AI evaluation would look less like standardized testing and more like the metrics that mature software companies track: deployment success rates, user retention, error recovery, time-to-value, and customer satisfaction. It would measure outcomes, not outputs. The EU AI Act's push for transparency requirements is a policy-level acknowledgment that benchmark scores alone are insufficient. California's AI Training Data Transparency Act goes further, requiring disclosure of what models were actually trained on. These are imperfect tools, but they point in the right direction: toward accountability that goes beyond a number on a leaderboard.

Benchmarks served a purpose

This isn't a case for throwing benchmarks out entirely. In the early days of the current AI wave, they were genuinely useful. When GPT-3 couldn't reliably answer basic questions, MMLU was a meaningful measure of progress. When no AI could fix a real bug in a real codebase, SWE-bench was a reasonable proxy for coding ability. The problem isn't that benchmarks exist. It's that they've been captured by marketing. They've become the primary language through which AI progress is communicated to the public, to investors, and to policymakers. And that language is increasingly inadequate for the conversation we need to be having. The conversation we need is not "which model scored highest on the latest test." It's "which model actually works when you use it for something that matters." Those are very different questions, and right now, we're only asking the first one.

References

Humanity's Last Exam, Center for AI Safety & Scale AI

The 2026 AI Index Report, Stanford HAI

12 Graphs That Explain the State of AI in 2026, IEEE Spectrum

Humanity's Last Exam Leaderboard 2026, Price Per Token

As more Americans adopt AI tools, fewer say they can trust the results, TechCrunch

Quinnipiac University Poll on AI, Quinnipiac University

Key findings about how Americans view artificial intelligence, Pew Research Center

Americans May Be Losing Trust for AI in Health Care, U.S. News

Many SWE-bench-Passing PRs Would Not Be Merged into Main, METR

AI Agent Benchmarks are Broken, Daniel Kang

AI benchmarking: Nine challenges and a way forward, EU AI Watch

Study identifies weaknesses in how AI systems are evaluated, Oxford Internet Institute

Transparency in AI is on the decline, Stanford Report

The Benchmarks Are Lying to You: Why You Should A/B Test Your AI, GrowthBook

OpenAI moves beyond SWE-bench Verified as coding benchmarks saturate, Tessl