Coding
Function-level code generation with unit tests.
Metrics: pass@1 (%)
Coding
Harder version of SWE-bench for professional coding tasks.
Metrics: Resolved (%)
Coding
Real GitHub issues from popular repositories. Gold standard for coding.
Metrics: Resolved (%)