Coding
Function-level code generation with unit tests.
Metrics: pass@1 (%)
Coding
Harder version of SWE-bench for professional coding tasks.
Metrics: Resolved (%)
Coding
Real GitHub issues from popular repositories. Gold standard for coding.
Metrics: Resolved (%)
Japanese
100 questions requiring Japanese knowledge and reasoning by ELYZA.
Metrics: Score (0-5)
Japanese
Japanese commonsense reasoning QA dataset with 5-choice questions.
Metrics: Accuracy (%)
Japanese
Japanese General Language Understanding Evaluation - text classification, sentence pairs, QA.
Metrics: Accuracy (%)
Japanese
Japanese version of MT-Bench for multi-turn conversation evaluation.
Metrics: Score (1-10)
Japanese
Comprehensive Japanese LLM evaluation covering reasoning, knowledge, coding, safety.
Metrics: Score (0-1)
Knowledge
Harder version of MMLU with 10 answer choices.
Metrics: Accuracy (%)
Math
American Invitational Mathematics Examination 2025.
Metrics: Accuracy (%)
Math
Cutting-edge mathematics problems (Tiers 1-3).
Metrics: Accuracy (%)
Overall
Human preference ranking from blind comparisons.
Metrics: ELO Score
Reasoning
Abstraction and Reasoning Corpus for AGI evaluation.
Metrics: Accuracy (%)
Reasoning
PhD-level scientific reasoning across biology, physics, and chemistry.
Metrics: Accuracy (%)
Vision
Massive Multi-discipline Multimodal Understanding (harder version).
Metrics: Accuracy (%)