HealthBench
Medical%Clinical reasoning and medical knowledge benchmark — evaluates a model's ability to answer healthcare questions with accuracy and safety.
Updated 4 days agoLatest measured Apr 18, 20266 verified · 0 self-reported
Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.
At a glance
🏆 Top score
Total results
6
Models tested
6
Providers
2
Verified · Self-reported
6 · 0
Average
22.92 %
Median
24.46 %
Range
17.28 – 27.53 %
Latest result
Apr 18, 2026
Score distribution
2
0
0
0
0
1
0
0
Methodology
1000 medical prompts graded by a judge model against expert-written criteria. Score reflects criteria-met rate.
Limitations
Judge-model susceptibility; medical accuracy is hard to score programmatically.
By provider
- OpenAI· 3 models27.53 %GPT-5.4 nanoAverage: 25.63 %Best: 27.53 %
- Anthropic· 3 models25.59 %Claude Opus 4.7Average: 20.2 %Best: 25.59 %
Full leaderboard
Showing 6 of 6| # | Model | Provider | Score (%) |
|---|---|---|---|
| 1 | GPT-5.4 nano | OpenAI | 27.53 |
| 2 | GPT-5.4 | OpenAI | 26.04 |
| 3 | Claude Opus 4.7 | Anthropic | 25.59 |
| 4 | GPT-5.4 mini | OpenAI | 23.32 |
| 5 | Claude Sonnet 4.6 | Anthropic | 17.74 |
| 6 | Claude Haiku 4.5 | Anthropic | 17.28 |
Community ratings
No ratings yet. Be the first to rate HealthBench.
Rate HealthBench
Sign in to rate and review.
Comments
Sign in to leave a comment.