HealthBench

Medical%

Clinical reasoning and medical knowledge benchmark — evaluates a model's ability to answer healthcare questions with accuracy and safety.

Updated 4 days agoLatest measured Apr 18, 20266 verified · 0 self-reported

Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.

At a glance

🏆 Top score

GPT-5.4 nano OpenAI27.53 %

Total results

Models tested

Providers

Verified · Self-reported

6 · 0

Average

22.92 %

Median

24.46 %

Range

17.28 – 27.53 %

Score distribution

Methodology

1000 medical prompts graded by a judge model against expert-written criteria. Score reflects criteria-met rate.

Limitations

Judge-model susceptibility; medical accuracy is hard to score programmatically.

By provider

OpenAI· 3 models
27.53 %
GPT-5.4 nano
Average: 25.63 %Best: 27.53 %
Anthropic· 3 models
25.59 %
Claude Opus 4.7
Average: 20.2 %Best: 25.59 %

Full leaderboard

Showing 6 of 6

ProviderSourceSort by

#	Model	Provider	Score (%)	Source	Date
1	GPT-5.4 nano	OpenAI	27.53	Third-party llm-stats.com	Apr 18, 2026
2	GPT-5.4	OpenAI	26.04	Third-party llm-stats.com	Apr 18, 2026
3	Claude Opus 4.7	Anthropic	25.59	Third-party llm-stats.com	Apr 18, 2026
4	GPT-5.4 mini	OpenAI	23.32	Third-party llm-stats.com	Apr 18, 2026
5	Claude Sonnet 4.6	Anthropic	17.74	Third-party llm-stats.com	Apr 18, 2026
6	Claude Haiku 4.5	Anthropic	17.28	Third-party llm-stats.com	Apr 18, 2026

Community ratings

No ratings yet. Be the first to rate HealthBench.