Benchmarks

All Coding Japanese Knowledge Math Overall Reasoning Vision
Coding

HumanEval

Function-level code generation with unit tests.

Metrics: pass@1 (%)
Coding

SWE-bench Pro

Harder version of SWE-bench for professional coding tasks.

Metrics: Resolved (%)
Coding

SWE-bench Verified

Real GitHub issues from popular repositories. Gold standard for coding.

Metrics: Resolved (%)
Japanese

ELYZA-tasks-100

100 questions requiring Japanese knowledge and reasoning by ELYZA.

Metrics: Score (0-5)
Japanese

JCommonsenseQA

Japanese commonsense reasoning QA dataset with 5-choice questions.

Metrics: Accuracy (%)
Japanese

JGLUE

Japanese General Language Understanding Evaluation - text classification, sentence pairs, QA.

Metrics: Accuracy (%)
Japanese

Japanese MT-Bench

Japanese version of MT-Bench for multi-turn conversation evaluation.

Metrics: Score (1-10)
Japanese

Nejumi 4

Comprehensive Japanese LLM evaluation covering reasoning, knowledge, coding, safety.

Metrics: Score (0-1)
Knowledge

MMLU-Pro

Harder version of MMLU with 10 answer choices.

Metrics: Accuracy (%)
Math

AIME 2025

American Invitational Mathematics Examination 2025.

Metrics: Accuracy (%)
Math

FrontierMath

Cutting-edge mathematics problems (Tiers 1-3).

Metrics: Accuracy (%)
Overall

LMArena ELO

Human preference ranking from blind comparisons.

Metrics: ELO Score
Reasoning

ARC-AGI-2

Abstraction and Reasoning Corpus for AGI evaluation.

Metrics: Accuracy (%)
Reasoning

GPQA Diamond

PhD-level scientific reasoning across biology, physics, and chemistry.

Metrics: Accuracy (%)
Vision

MMMU-Pro

Massive Multi-discipline Multimodal Understanding (harder version).

Metrics: Accuracy (%)