Follow us on Twitter or LinkedIn to stay up to date with future analysis
Artificial AnalysisArtificial Analysis
Insights Login
  • Artificial AnalysisArtificial Analysis
  • Hardware
  • AI Trends
  • MicroEvals
  • Articles
Insights Login

AI Model Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 338 sub-tasks derived from 80 genuine laboratory problems across 16 scientific disciplines.

MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

Footer

Key Links

  • Compare Language Models
  • Language Models Leaderboard
  • Language Model API Leaderboard
  • Image Arena
  • Video Arena
  • Speech Arena

Artificial Analysis

  • FAQ
  • Contact & Data access
  • Terms of Use
  • Privacy Policy
  • hello@artificialanalysis.ai

Subscribe to our newsletter

TwitterLinkedIn