AI Model Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 338 sub-tasks derived from 80 genuine laboratory problems across 16 scientific disciplines.

Math-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2024 Benchmark Leaderboard

All 30 problems from the 2024 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.