HumanEval
Codingpass@1 %Python coding benchmark of 164 programming problems.
Updated Jun 1, 2025Latest measured Jun 1, 20250 verified · 9 self-reported
Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.
At a glance
Total results
9
Models tested
9
Providers
7
Verified · Self-reported
0 · 9
Average
90.44 pass@1 %
Median
90.2 pass@1 %
Range
85.4 – 94 pass@1 %
Latest result
Jun 1, 2025
Score distribution
1
0
0
2
1
1
0
1
Methodology
pass@1 on 164 handwritten Python problems with unit tests. Scores reflect whether the first generation passes all tests.
Limitations
Saturated near 95%+ for frontier models. Narrow in language and problem style.
By provider
- Average: 93.75 pass@1 %Best: 94 pass@1 %
- Anthropic· 2 models93 pass@1 %Claude Opus 4Average: 92.5 pass@1 %Best: 93 pass@1 %
- DeepSeek· 1 model90.2 pass@1 %DeepSeek V3
Full leaderboard
Showing 9 of 9| # | Model | Provider | Score (pass@1 %) |
|---|---|---|---|
| 1 | GPT-5 | OpenAI | 94 |
| 2 | o3-mini | OpenAI | 93.5 |
| 3 | Claude Opus 4 | Anthropic | 93 |
| 4 | Claude Sonnet 4 | Anthropic | 92 |
| 5 | DeepSeek V3 | DeepSeek | 90.2 |
| 6 | Llama 3.1 405B | Meta | 89 |
| 7 | Grok 3 | xAI | 88.5 |
| 8 | Gemini 2 Pro | 88.4 | |
| 9 | Codestral | Mistral AI | 85.4 |
Community ratings
No ratings yet. Be the first to rate HumanEval.
Rate HumanEval
Sign in to rate and review.
Comments
Sign in to leave a comment.