Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Background

An enhanced version of the original MMLU benchmark that addresses model saturation by expanding to 12,000 graduate-level questions with ten answer choices instead of four.

MMLU-Pro emphasizes deeper reasoning over knowledge recall, creating a more challenging evaluation that better discriminates between advanced language models.

Methodology

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen.

Highlights

Gemini 3 Pro Preview (high) scores the highest on MMLU-Pro with a score of 89.8%, followed by Gemini 3 Pro Preview (low) with a score of 89.5%, and Claude Opus 4.5 (Reasoning) with a score of 89.5%

MMLU-Pro Benchmark Leaderboard: Results

Independently conducted by Artificial Analysis

MMLU-Pro Benchmark Leaderboard: Token Usage

Tokens used to run the evaluation

Input tokens

Reasoning tokens

Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

MMLU-Pro Benchmark Leaderboard: Cost Breakdown

Cost (USD) to run the evaluation

Input cost

Reasoning cost

Answer cost

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

MMLU-Pro Benchmark Leaderboard: Score vs. Release Date

Most attractive region

Amazon

Anthropic

DeepSeek

Google

Korea Telecom

KwaiKAT

LG AI Research

Example Problems

Explore Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with ELO ratings derived from blind pairwise comparisons.

AA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

Artificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 338 sub-tasks derived from 80 genuine laboratory problems across 16 scientific disciplines.

MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

CritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.

MMLU-Pro Benchmark Leaderboard

Background

Methodology

Related links

Highlights

MMLU-Pro Benchmark Leaderboard: Results

MMLU-Pro Benchmark Leaderboard: Token Usage

Evaluation Token Usage

MMLU-Pro Benchmark Leaderboard: Cost Breakdown

Evaluation Cost

MMLU-Pro Benchmark Leaderboard: Score vs. Release Date

Example Problems

Explore Evaluations