GPQA Diamond
ReasoningPhD-level scientific reasoning across biology, physics, and chemistry.
- Metrics
- Accuracy (%)
How to Run
pip install lm-eval && lm_eval --model hf --model_args pretrained=MODEL --tasks gpqa_diamond --batch_size auto
Leaderboard
| Rank | Model | Provider | Parameters | Score |
|---|---|---|---|---|
| 1 | GPT-5.2 Thinking | OpenAI | Unknown | 92.4% |
| 2 | Gemini 3 Pro | Unknown | 91.9% | |
| 3 | Gemini 3 Flash | Unknown | 90.4% | |
| 4 | Claude Opus 4.5 | Anthropic | Unknown | 87.0% |
| 5 | DeepSeek V3 | DeepSeek | Unknown | 78.2% |
| 6 | DeepSeek-R1 | DeepSeek | 671B MoE | 71.5% |