AI Model Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 338 sub-tasks derived from 80 genuine laboratory problems across 16 scientific disciplines.

MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.