Benchmarks

All Coding Japanese Knowledge Math Overall Reasoning Vision
Reasoning

ARC-AGI-2

Abstraction and Reasoning Corpus for AGI evaluation.

Metrics: Accuracy (%)
Reasoning

GPQA Diamond

PhD-level scientific reasoning across biology, physics, and chemistry.

Metrics: Accuracy (%)