Benchmarks

All Coding Japanese Knowledge Math Overall Reasoning Vision
Coding

HumanEval

Function-level code generation with unit tests.

Metrics: pass@1 (%)
Coding

SWE-bench Pro

Harder version of SWE-bench for professional coding tasks.

Metrics: Resolved (%)
Coding

SWE-bench Verified

Real GitHub issues from popular repositories. Gold standard for coding.

Metrics: Resolved (%)