Skip to content
🔵 Practitioner

How to read an LLM benchmark without being fooled

LLM benchmarks have become marketing. Five questions to ask before believing any 'Model X beats Model Y' claim.

Every launch comes with a “beats GPT-X on N benchmarks” slide. Those slides are marketing. They don’t mean zero — they mean less than the vendor wants.

Five questions to ask before believing.

Question 1 · Which benchmark, and what does it measure?

MMLU measures English academic curriculum knowledge. HumanEval measures basic Python. GSM8K measures grade-school math. Each covers a narrow slice.

“Beats GPT-5 on MMLU by 2 points” → means marginally better at English factual knowledge. Doesn’t mean “better model.”

For your specific case (customer support in PT-BR, invoice classification, legal analysis), MMLU says almost nothing. Ask: which benchmark covers my task?

Question 2 · Did the model train on the benchmark?

Contamination check. Many benchmarks leak into training corpus and the model “memorizes” answers. When you see “100% on benchmark X,” suspect — it may have learned the test.

Anthropic, Google, OpenAI publish contamination reports. Look for them. If absent, suspect.

Signal: new models with high scores on old (4+ year) benchmarks — likely contaminated.

Question 3 · Variance and setup

LLM results vary by seed, prompt format, and temperature. “62.3%” can be ±2% across runs. Comparisons showing 1-point gain may be noise.

Ask: how many runs? What standard deviation? What prompt format? What temperature? If the answer is “we ran once,” the number isn’t reliable.

Question 4 · Cherry-picking tasks

Composite benchmark has 30 subtasks. The vendor picks the 8 where they lead, ignores the 22 where they tie or lose. Legitimate marketing, but distorting.

For aggregate benchmarks (MMLU, MMLU-Pro, BIG-bench), take the global score. For per-category, demand the full table.

Question 5 · Reproducibility

Does the paper publish prompt used, parameters, evaluation code? Can you (or a third party) run and get the same number? If the answer is “trust us,” the number isn’t peer-reviewed.

For vendor-internal benchmark without public reproduction, treat as marketing — useful as directional signal, not as truth.

The practical rule

In 2026, the gap between each vendor’s top 3 (Claude Opus, GPT-5, Gemini Ultra) in standard benchmarks is small — almost always 0-5 points. Deciding purchase by benchmark isn’t worth the effort.

Use benchmarks to:

  • Eliminate clearly trailing models (10+ points below top).
  • Identify specialty (“this model is strong on code, weak on multimodal”).

Don’t use benchmarks to:

  • “Model X better than Model Y” based on 1-3 point lead.
  • Estimate performance on YOUR use case (which probably isn’t in the benchmark).

The alternative: internal eval

For real decisions, build an internal eval: 50-100 examples from your use case, run 3-5 models, measure human quality. Costs a day of work. Delivers orders of magnitude more signal than any public benchmark.

How to build internal eval is a topic of its own — covered in the AI Engineering cluster.

How to go deeper

For practical comparison instead of benchmark, read Claude vs Copilot vs Gemini for enterprise.