GPQA Diamond
Reasoning% accuracyGraduate-level, Google-proof science reasoning questions.
Updated 4 days agoLatest measured Apr 18, 202610 verified · 6 self-reported
Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.
At a glance
🏆 Top score
Total results
16
Models tested
16
Providers
6
Verified · Self-reported
10 · 6
Average
72.22 % accuracy
Median
73 % accuracy
Range
35.35 – 87.88 % accuracy
Latest result
Apr 18, 2026
Score distribution
1
0
0
0
3
0
4
2
Methodology
Expert-written multiple-choice questions in biology, chemistry, and physics, designed to be difficult even with web search. Diamond subset is the hardest.
Limitations
Small test set, high variance. Only measures a narrow slice of scientific reasoning.
By provider
- Anthropic· 3 models87.88 % accuracyClaude Opus 4.7Average: 82.24 % accuracyBest: 87.88 % accuracy
- Average: 74.07 % accuracyBest: 87.7 % accuracy
Full leaderboard
Showing 16 of 16| # | Model | Provider | Score (% accuracy) |
|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 87.88 |
| 2 | o3 | OpenAI | 87.7 |
| 3 | Claude Opus 4.6 | Anthropic | 84.85 |
| 4 | Grok 3 | xAI | 84.6 |
| 5 | GPT-5.2 | OpenAI | 82.32 |
| 6 | o1 | OpenAI | 78 |
| 7 | GPT-5.4 | OpenAI | 77.27 |
| 8 | Claude Opus 4 | Anthropic | 74 |
| 9 | GPT-5 | OpenAI | 72 |
| 10 | DeepSeek R1 | DeepSeek | 71.5 |
| 11 | Gemma 4 31B | 69.7 | |
| 12 | Gemini 2 Pro | 68.1 | |
| 13 | Qwen3.5-27B | Alibaba | 61.11 |
| 14 | GPT-5.4 mini | OpenAI | 60.61 |
| 15 | GPT-5.4 nano | OpenAI | 60.61 |
| 16 | Qwen3.5 397B A17B | Alibaba | 35.35 |
Community ratings
No ratings yet. Be the first to rate GPQA Diamond.
Rate GPQA Diamond
Sign in to rate and review.
Comments
Sign in to leave a comment.