AI Model Evaluations
Artificial Analysis Intelligence Index
A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.
MMLU-Pro Benchmark Leaderboard
An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.
GPQA Diamond Benchmark Leaderboard
The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.
Humanity's Last Exam Benchmark Leaderboard
A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.
LiveCodeBench Benchmark Leaderboard
A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.
SciCode Benchmark Leaderboard
A scientist-curated coding benchmark featuring 338 sub-tasks derived from 80 genuine laboratory problems across 16 scientific disciplines.
Math-500 Benchmark Leaderboard
A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.
AIME 2024 Benchmark Leaderboard
All 30 problems from the 2024 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.