HumanEval

Codingpass@1 %

Python coding benchmark of 164 programming problems.

Updated Jun 1, 2025Latest measured Jun 1, 20250 verified · 9 self-reported

Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.

At a glance

🏆 Top score

GPT-5 OpenAI94 pass@1 %

Total results

Models tested

Providers

Verified · Self-reported

0 · 9

Average

90.44 pass@1 %

Median

90.2 pass@1 %

Range

85.4 – 94 pass@1 %

Score distribution

Methodology

pass@1 on 164 handwritten Python problems with unit tests. Scores reflect whether the first generation passes all tests.

Limitations

Saturated near 95%+ for frontier models. Narrow in language and problem style.

By provider

OpenAI· 2 models
94 pass@1 %
GPT-5
Average: 93.75 pass@1 %Best: 94 pass@1 %
Anthropic· 2 models
93 pass@1 %
Claude Opus 4
Average: 92.5 pass@1 %Best: 93 pass@1 %
DeepSeek· 1 model
90.2 pass@1 %
DeepSeek V3

Full leaderboard

Showing 9 of 9

ProviderSourceSort by

#	Model	Provider	Score (pass@1 %)	Source	Date
1	GPT-5	OpenAI	94	Self-reported OpenAI system card	Jun 1, 2025
2	o3-mini	OpenAI	93.5	Self-reported OpenAI system card	Jun 1, 2025
3	Claude Opus 4	Anthropic	93	Self-reported Anthropic model card	Jun 1, 2025
4	Claude Sonnet 4	Anthropic	92	Self-reported Anthropic model card	Jun 1, 2025
5	DeepSeek V3	DeepSeek	90.2	Self-reported DeepSeek tech report	Jun 1, 2025
6	Llama 3.1 405B	Meta	89	Self-reported Meta system card	Jun 1, 2025
7	Grok 3	xAI	88.5	Self-reported xAI model card	Jun 1, 2025
8	Gemini 2 Pro	Google	88.4	Self-reported Google model card	Jun 1, 2025
9	Codestral	Mistral AI	85.4	Self-reported Mistral Codestral announcement	Jun 1, 2025

Community ratings

No ratings yet. Be the first to rate HumanEval.