Modeldex
  • Models
  • Providers
  • Benchmarks
  • MCP
  • Compare
  • Guides

Product

  • Models
  • Providers
  • Benchmarks
  • Compare
  • Prompts
  • Find a model
  • Trending
  • Collections
  • News
  • Changelog

Learn

  • New to AI?
  • Best AI by use case
  • Blog
  • Pricing
  • About
  • Support

Legal

  • Privacy
  • Terms
  • Cookies

Connect

  • GitHub
  • X / Twitter
  • Contact

© 2026 Modeldex — the AI model registry.

Press ? for keyboard shortcuts.

Home/Benchmarks

Benchmarks

16 benchmarks tracked · 154 total results · 9 categories

Coding (3)Factual grounding (1)General knowledge (2)Human preference (1)Instruction following (1)Long context (3)Math (3)Medical (1)Reasoning (1)

Coding

  • Aider Polyglot

    22 results

    Real-world coding edits across 6 programming languages — measures whether the model produces a correct edit accepted on second attempt.

    🏆GPT-5OpenAI
    88 % pass@2
    Updated yesterdayLatest result Apr 21, 202622 verified · 0 self-reported
  • SWE-bench Verified

    19 results

    Real GitHub issues solved end-to-end by the model.

    🏆Claude Sonnet 4.6Anthropic
    72.7 % resolved

Factual grounding

  • FACTS Grounding

    5 results

    Factual grounding evaluation — measures whether a model's answer is supported by the provided source documents.

    🏆GPT-5.4OpenAI
    91.86 %
    Updated 4 days agoLatest result Apr 18, 20265 verified · 0 self-reported

General knowledge

  • MMLU

    13 results

    Massive Multitask Language Understanding — 57-subject multiple-choice exam.

    🏆o3OpenAI
    91 % accuracy
    Updated Jun 1, 2025Latest result Jun 1, 20252 verified · 11 self-reported
  • MMLU-Pro

    7 results

    Harder reformulation of MMLU with 10 answer choices and deeper reasoning.

    🏆o3OpenAI
    81.2 % accuracy
    Updated Jun 1, 2025

Human preference

  • Arena Hard Auto

    2 results

    Automatic proxy for LMArena Elo — pairwise head-to-head comparisons graded by GPT-4-Turbo against a curated set of 500 hard prompts. Strong predictor of human preference ratings.

    🏆GPT-4OpenAI
    82.63 % win rate
    Updated 4 days agoLatest result Apr 18, 20262 verified · 0 self-reported

Instruction following

  • MultiChallenge

    7 results

    Multi-step instruction-following across diverse tasks (math, coding, writing, reasoning) — measures aggregate capability breadth.

    🏆Qwen3.5-27BAlibaba
    58.65 %
    Updated 4 days agoLatest result Apr 18, 20267 verified · 0 self-reported

Long context

  • LongBench v2

    9 results

    Long-context understanding across documents of 32K–2M tokens — tests whether the model can retrieve and reason over facts deep in the input.

    🏆Qwen3.5-27BAlibaba
    61.11 %
    Updated 4 days agoLatest result Apr 18, 20269 verified · 0 self-reported
  • MRCR v2

    9 results

    Multi-Round Conversational Reasoning — tests whether a model can maintain facts and context across long multi-turn dialogues.

    🏆Claude Sonnet 4.6Anthropic
    50.78 %

Math

  • AIME 2024

    7 results

    American Invitational Mathematics Examination, 2024 problems.

    🏆o3OpenAI
    96.7 % accuracy
    Updated Jun 1, 2025Latest result Jun 1, 20251 verified · 6 self-reported
  • GSM8K

    8 results

    Grade-school math word problems.

    🏆DeepSeek V3DeepSeek
    97.1 % accuracy
    Updated Jun 1, 2025

Medical

  • HealthBench

    6 results

    Clinical reasoning and medical knowledge benchmark — evaluates a model's ability to answer healthcare questions with accuracy and safety.

    🏆GPT-5.4 nanoOpenAI
    27.53 %
    Updated 4 days agoLatest result Apr 18, 20266 verified · 0 self-reported

Reasoning

  • GPQA Diamond

    16 results

    Graduate-level, Google-proof science reasoning questions.

    🏆Claude Opus 4.7Anthropic
    87.88 % accuracy
    Updated 4 days agoLatest result Apr 18, 202610 verified · 6 self-reported
Updated 4 days ago
Latest result Apr 18, 2026
16 verified · 3 self-reported
  • HumanEval

    9 results

    Python coding benchmark of 164 programming problems.

    🏆GPT-5OpenAI
    94 pass@1 %
    Updated Jun 1, 2025Latest result Jun 1, 20250 verified · 9 self-reported
  • Latest result Jun 1, 2025
    2 verified · 5 self-reported
    Updated 4 days agoLatest result Apr 18, 20269 verified · 0 self-reported
  • NoLiMa

    6 results

    Long-context information retrieval without literal matching — requires semantic reasoning to find relevant facts, not string-matching.

    🏆Claude Opus 4.7Anthropic
    83.46 %
    Updated 4 days agoLatest result Apr 18, 20266 verified · 0 self-reported
  • Latest result Jun 1, 2025
    0 verified · 8 self-reported
  • MATH

    9 results

    12,500 competition math problems across 7 subjects.

    🏆o3OpenAI
    97.8 % accuracy
    Updated Jun 1, 2025Latest result Jun 1, 20250 verified · 9 self-reported