Arena Hard Auto

Human preference% win rate

Automatic proxy for LMArena Elo — pairwise head-to-head comparisons graded by GPT-4-Turbo against a curated set of 500 hard prompts. Strong predictor of human preference ratings.

Updated 4 days agoLatest measured Apr 18, 20262 verified · 0 self-reported

Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.

At a glance

🏆 Top score

GPT-4 OpenAI82.63 % win rate

Total results

Models tested

Providers

Verified · Self-reported

2 · 0

Average

80.92 % win rate

Median

80.92 % win rate

Range

79.21 – 82.63 % win rate

Score distribution

Methodology

500 challenging prompts (coding, math, reasoning, creative writing). Each model's output is compared head-to-head against a reference response by GPT-4-Turbo. Score is the win-rate (0–100%).

Limitations

Auto-judged by a single GPT-4-Turbo instance — susceptible to judge bias. Less robust than live Arena Elo. Only reflects static prompts, not open-ended conversation.

By provider

OpenAI· 2 models
82.63 % win rate
GPT-4
Average: 80.92 % win rateBest: 82.63 % win rate

Full leaderboard

Showing 2 of 2

ProviderSourceSort by

#	Model	Provider	Score (% win rate)	Source	Date
1	GPT-4	OpenAI	82.63	Third-party Papers With Code	Apr 18, 2026
2	GPT-4o	OpenAI	79.21	Third-party Papers With Code	Apr 18, 2026

Community ratings

No ratings yet. Be the first to rate Arena Hard Auto.