Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis
All evaluations

APEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

Background

APEX-Agents is an agentic benchmark created and open-sourced by Mercor that tests long-horizon, cross-application work in professional services environments, where agents operate across files and workplace tools and are evaluated against rubrics of binary success/failure criteria using an LLM grader.
APEX-Agents-AA is Artificial Analysis' independent implementation of this benchmark, built on our open-source Stirrup Agent Harness. We evaluate 452 tasks from the public APEX-Agents dataset spanning investment banking, management consulting, and corporate law, excluding two 'worlds' which have dependencies on external APIs (Investment Banking World 244 and Investment Banking World 246).
On this page, the headline score is pass@1 success rate: the share of tasks where a model fully satisfies the grading rubric, rather than the mean rubric score across criteria.

Methodology

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist, and .

Highlights

  • GPT-5.4 (xhigh) scores the highest on APEX-Agents-AA Pass@1 with a score of 33.3%, followed by Claude Opus 4.6 (Adaptive Reasoning, Max Effort) with a score of 33.0%, and Gemini 3.1 Pro Preview with a score of 32.0%

APEX-Agents-AA Benchmark Leaderboard: Results

Independently benchmarked by Artificial Analysis
+ Add model from specific provider

APEX-Agents-AA Benchmark Leaderboard: Token Usage

Tokens used to run the evaluation
+ Add model from specific provider
Input tokens
Reasoning tokens
Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

APEX-Agents-AA Benchmark Leaderboard: Cost Breakdown

Cost (USD) to run the evaluation
+ Add model from specific provider
Input cost
Reasoning cost
Answer cost

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

APEX-Agents-AA Benchmark Leaderboard: Score vs. Release Date

+ Add model from specific provider
Most attractive region
Alibaba
Anthropic
DeepSeek
Google
Kimi
MiniMax
NVIDIA
OpenAI
Z AI

Example Problems

Explore Evaluations

Artificial Analysis Intelligence IndexArtificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA LeaderboardGDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with ELO ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark LeaderboardAPEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark LeaderboardTerminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark LeaderboardSciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark LeaderboardArtificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination BenchmarkAA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark LeaderboardIFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark LeaderboardHumanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark LeaderboardCritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

Artificial Analysis Openness IndexArtificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark LeaderboardMMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark LeaderboardGlobal-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark LeaderboardLiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark LeaderboardMATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark LeaderboardAIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark LeaderboardMMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.