Arena Hard Auto
Human preference% win rateAutomatic proxy for LMArena Elo — pairwise head-to-head comparisons graded by GPT-4-Turbo against a curated set of 500 hard prompts. Strong predictor of human preference ratings.
Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.
At a glance
Score distribution
Methodology
500 challenging prompts (coding, math, reasoning, creative writing). Each model's output is compared head-to-head against a reference response by GPT-4-Turbo. Score is the win-rate (0–100%).
Limitations
Auto-judged by a single GPT-4-Turbo instance — susceptible to judge bias. Less robust than live Arena Elo. Only reflects static prompts, not open-ended conversation.
By provider
- Average: 80.92 % win rateBest: 82.63 % win rate
Full leaderboard
Showing 2 of 2Community ratings
Rate Arena Hard Auto
Sign in to rate and review.
Comments
Sign in to leave a comment.