Models
Providers
Benchmarks
MCP
Compare
Guides

Product

Models
Providers
Benchmarks
Compare
Prompts
Find a model
Trending
Collections
News
Changelog

Learn

New to AI?
Best AI by use case
Blog
Pricing
About
Support

Legal

Privacy
Terms
Cookies

Connect

GitHub
X / Twitter
Contact

© 2026 Modeldex — the AI model registry.

Press ? for keyboard shortcuts.

Home/Benchmarks

Benchmarks

16 benchmarks tracked · 154 total results · 9 categories

Coding (3)Factual grounding (1)General knowledge (2)Human preference (1)Instruction following (1)Long context (3)Math (3)Medical (1)Reasoning (1)

Coding

Aider Polyglot
22 results
Real-world coding edits across 6 programming languages — measures whether the model produces a correct edit accepted on second attempt.
🏆GPT-5OpenAI
88 % pass@2
Updated yesterdayLatest result Apr 21, 202622 verified · 0 self-reported
SWE-bench Verified
19 results
Real GitHub issues solved end-to-end by the model.
🏆Claude Sonnet 4.6Anthropic
72.7 % resolved

Factual grounding

FACTS Grounding
5 results
Factual grounding evaluation — measures whether a model's answer is supported by the provided source documents.
🏆GPT-5.4OpenAI
91.86 %
Updated 4 days agoLatest result Apr 18, 20265 verified · 0 self-reported

General knowledge

MMLU
13 results
Massive Multitask Language Understanding — 57-subject multiple-choice exam.
🏆o3OpenAI
91 % accuracy
Updated Jun 1, 2025Latest result Jun 1, 20252 verified · 11 self-reported
MMLU-Pro
7 results
Harder reformulation of MMLU with 10 answer choices and deeper reasoning.
🏆o3OpenAI
81.2 % accuracy
Updated Jun 1, 2025

Human preference

Arena Hard Auto
2 results
Automatic proxy for LMArena Elo — pairwise head-to-head comparisons graded by GPT-4-Turbo against a curated set of 500 hard prompts. Strong predictor of human preference ratings.
🏆GPT-4OpenAI
82.63 % win rate
Updated 4 days agoLatest result Apr 18, 20262 verified · 0 self-reported

Instruction following

MultiChallenge
7 results
Multi-step instruction-following across diverse tasks (math, coding, writing, reasoning) — measures aggregate capability breadth.
🏆Qwen3.5-27BAlibaba
58.65 %
Updated 4 days agoLatest result Apr 18, 20267 verified · 0 self-reported

Long context

LongBench v2
9 results
Long-context understanding across documents of 32K–2M tokens — tests whether the model can retrieve and reason over facts deep in the input.
🏆Qwen3.5-27BAlibaba
61.11 %
Updated 4 days agoLatest result Apr 18, 20269 verified · 0 self-reported
MRCR v2
9 results
Multi-Round Conversational Reasoning — tests whether a model can maintain facts and context across long multi-turn dialogues.
🏆Claude Sonnet 4.6Anthropic
50.78 %

Math

AIME 2024
7 results
American Invitational Mathematics Examination, 2024 problems.
🏆o3OpenAI
96.7 % accuracy
Updated Jun 1, 2025Latest result Jun 1, 20251 verified · 6 self-reported
GSM8K
8 results
Grade-school math word problems.
🏆DeepSeek V3DeepSeek
97.1 % accuracy
Updated Jun 1, 2025

Medical

HealthBench
6 results
Clinical reasoning and medical knowledge benchmark — evaluates a model's ability to answer healthcare questions with accuracy and safety.
🏆GPT-5.4 nanoOpenAI
27.53 %
Updated 4 days agoLatest result Apr 18, 20266 verified · 0 self-reported

Reasoning

GPQA Diamond
16 results
Graduate-level, Google-proof science reasoning questions.
🏆Claude Opus 4.7Anthropic
87.88 % accuracy
Updated 4 days agoLatest result Apr 18, 202610 verified · 6 self-reported

Updated 4 days ago

Latest result Apr 18, 2026

16 verified · 3 self-reported

HumanEval

Python coding benchmark of 164 programming problems.

🏆GPT-5OpenAI

Updated Jun 1, 2025Latest result Jun 1, 20250 verified · 9 self-reported

Latest result Jun 1, 2025

2 verified · 5 self-reported

Updated 4 days agoLatest result Apr 18, 20269 verified · 0 self-reported

NoLiMa

Long-context information retrieval without literal matching — requires semantic reasoning to find relevant facts, not string-matching.

🏆Claude Opus 4.7Anthropic

Updated 4 days agoLatest result Apr 18, 20266 verified · 0 self-reported

Latest result Jun 1, 2025

0 verified · 8 self-reported

MATH

12,500 competition math problems across 7 subjects.

97.8 % accuracy

Updated Jun 1, 2025Latest result Jun 1, 20250 verified · 9 self-reported