16 benchmarks tracked · 154 total results · 9 categories
Real-world coding edits across 6 programming languages — measures whether the model produces a correct edit accepted on second attempt.
Real GitHub issues solved end-to-end by the model.
Massive Multitask Language Understanding — 57-subject multiple-choice exam.
Harder reformulation of MMLU with 10 answer choices and deeper reasoning.
Long-context understanding across documents of 32K–2M tokens — tests whether the model can retrieve and reason over facts deep in the input.
Multi-Round Conversational Reasoning — tests whether a model can maintain facts and context across long multi-turn dialogues.