GSM8K
Math% accuracyGrade-school math word problems.
Updated Jun 1, 2025Latest measured Jun 1, 20250 verified · 8 self-reported
Verified results come from third-party or public leaderboard sources. Self-reported results come from provider papers, blogs, or vendor disclosures and should be compared with extra caution.
At a glance
🏆 Top score
Total results
8
Models tested
8
Providers
6
Verified · Self-reported
0 · 8
Average
95.25 % accuracy
Median
95.6 % accuracy
Range
93 – 97.1 % accuracy
Latest result
Jun 1, 2025
Score distribution
2
0
0
1
0
1
1
1
Methodology
8.5k grade-school math word problems; final numeric answer is checked.
Limitations
Mostly saturated on frontier models. Low headroom for differentiation.
By provider
- DeepSeek· 1 model97.1 % accuracyDeepSeek V3Average: 97.1 % accuracyBest: 97.1 % accuracy
- Average: 95.43 % accuracyBest: 97 % accuracy
Full leaderboard
Showing 8 of 8| # | Model | Provider | Score (% accuracy) |
|---|---|---|---|
| 1 | DeepSeek V3 | DeepSeek | 97.1 |
| 2 | o3-mini | OpenAI | 97 |
| 3 | GPT-5 | OpenAI | 96.1 |
| 4 | Phi-4 | Microsoft | 95.8 |
| 5 | Claude Opus 4 | Anthropic | 95.4 |
| 6 | Gemini 2 Pro | 94.4 | |
| 7 | GPT-4o mini | OpenAI | 93.2 |
| 8 | Llama 3 70B | Meta | 93 |
Community ratings
No ratings yet. Be the first to rate GSM8K.
Rate GSM8K
Sign in to rate and review.
Comments
Sign in to leave a comment.