How to read an LLM benchmark without being fooled
LLM benchmarks have become marketing. Five questions to ask before believing any 'Model X beats Model Y' claim.
Every launch comes with a “beats GPT-X on N benchmarks” slide. Those slides are marketing. They don’t mean zero — they mean less than the vendor wants.
Five questions to ask before believing.
Question 1 · Which benchmark, and what does it measure?
MMLU measures English academic curriculum knowledge. HumanEval measures basic Python. GSM8K measures grade-school math. Each covers a narrow slice.
“Beats GPT-5 on MMLU by 2 points” → means marginally better at English factual knowledge. Doesn’t mean “better model.”
For your specific case (customer support in PT-BR, invoice classification, legal analysis), MMLU says almost nothing. Ask: which benchmark covers my task?
Question 2 · Did the model train on the benchmark?
Contamination check. Many benchmarks leak into training corpus and the model “memorizes” answers. When you see “100% on benchmark X,” suspect — it may have learned the test.
Anthropic, Google, OpenAI publish contamination reports. Look for them. If absent, suspect.
Signal: new models with high scores on old (4+ year) benchmarks — likely contaminated.
Question 3 · Variance and setup
LLM results vary by seed, prompt format, and temperature. “62.3%” can be ±2% across runs. Comparisons showing 1-point gain may be noise.
Ask: how many runs? What standard deviation? What prompt format? What temperature? If the answer is “we ran once,” the number isn’t reliable.
Question 4 · Cherry-picking tasks
Composite benchmark has 30 subtasks. The vendor picks the 8 where they lead, ignores the 22 where they tie or lose. Legitimate marketing, but distorting.
For aggregate benchmarks (MMLU, MMLU-Pro, BIG-bench), take the global score. For per-category, demand the full table.
Question 5 · Reproducibility
Does the paper publish prompt used, parameters, evaluation code? Can you (or a third party) run and get the same number? If the answer is “trust us,” the number isn’t peer-reviewed.
For vendor-internal benchmark without public reproduction, treat as marketing — useful as directional signal, not as truth.
The practical rule
In 2026, the gap between each vendor’s top 3 (Claude Opus, GPT-5, Gemini Ultra) in standard benchmarks is small — almost always 0-5 points. Deciding purchase by benchmark isn’t worth the effort.
Use benchmarks to:
- Eliminate clearly trailing models (10+ points below top).
- Identify specialty (“this model is strong on code, weak on multimodal”).
Don’t use benchmarks to:
- “Model X better than Model Y” based on 1-3 point lead.
- Estimate performance on YOUR use case (which probably isn’t in the benchmark).
The alternative: internal eval
For real decisions, build an internal eval: 50-100 examples from your use case, run 3-5 models, measure human quality. Costs a day of work. Delivers orders of magnitude more signal than any public benchmark.
How to build internal eval is a topic of its own — covered in the AI Engineering cluster.
How to go deeper
For practical comparison instead of benchmark, read Claude vs Copilot vs Gemini for enterprise.