All evaluations

AA-Briefcase: Agentic Knowledge Work Benchmark

A private evaluation developed by Artificial Analysis for frontier agentic capability in long-horizon knowledge work, testing agents on realistic business workflows that require deliverables such as spreadsheets, presentations, and memos.

AA-Briefcase evaluates models across four multi-week knowledge work projects, comprising thousands of input files and 91 tasks in total. Across the scenarios, models must complete realistic professional workflows in fields such as data science, product management, and corporate strategy. Each scenario is a multi-week workflow that the agent works through in sequence, each week holding several tasks. Every task is a deliverable graded against a rubric of checks. Although tasks within a scenario share files and context across weeks, models currently complete each task in an independent run, without carrying over their own prior submissions.

AA-BriefcaseDataScienceProductManagementBankingOperationsHeavy IndustryStrategyWeek 1Week 2Week 3Week 4Week 5Week 6Task 1Task 2Task 3Task 4EVALUATIONScenariosAA-Briefcase is built around four major knowledge workareas, with each structured as a separate projectscenarioWeeksEach scenario is made up of a number of weeks designedto simulate a professional workflowTasksEach week has up to five distinct tasks, which theagent has to complete within the scenario and weekcontextChecksEach task is graded across three dimensions: objectiverubric checks, analytical quality, and presentationquality, with task-specific criteria for eachcomparisonRubricbinary pass or fail across criteriaAnalytical Qualitypairwise vs. another modelPresentationpairwise vs. another model

Each task is graded against three types of checks:

Rubric

Binary pass or fail per check

Did the model follow the task instructions, identify requirements hidden across source files, use the correct evidence, and reach the right conclusions?

Analytical Quality

Pairwise comparison

Compared against another model's submission, which deliverable is more thorough, analytically rigorous, and well-supported?

Presentation

Pairwise comparison

Compared against another model's submission, which one is more professionally presented?

A public fifth scenario has been released via Hugging Face as a representation of scenario structure, submission, and grading. This does not count toward official AA-Briefcase results, and is demonstrative only.

Results

AA-Briefcase Elo

AA-Briefcase is an agentic knowledge work benchmark developed by Artificial Analysis. AA-Briefcase Elo is a combined metric that aggregates rubric pass rate, analytical quality Elo and presentation Elo ยท Higher is better
Not currently available
Reasoning models are indicated by a lightbulb icon

AA-Briefcase Elo is a combined metric that aggregates analytical quality Elo, presentation Elo, and rubric pass rate, with rubric performance converted into Elo via synthetic head-to-head matches. Elo and 95% confidence interval bounds are clamped at 0.

Cost

AA-Briefcase Cost per Task

Mean cost (USD) per task to run AA-Briefcase, calculated from token usage and model pricing including representative cache hit rates
Reasoning models are indicated by a lightbulb icon

The total cost to run AA-Briefcase divided by the number of tasks (91 for full submission of tasks). Cost is calculated from token usage and model pricing, split across input, cache hit, cache write, reasoning, and answer token prices, including representative cache hit rates.

Example Task, Submissions, and Grading

Explore a representative AA-Briefcase week from the public Due Diligence scenario available via Hugging Face. The outputs and grading shown here illustrate what AA-Briefcase evaluates. Scores are shown for a representative model set. Submissions and verdicts in this representative scenario do not contribute to a model's AA-Briefcase Elo or other benchmark scores.
Model
market_overview.pdf
Open
market_overview.tex
Open

Score Comparisons

AA-Briefcase Elo vs. Artificial Analysis Intelligence Index

AA-Briefcase Elo ยท Artificial Analysis Intelligence Index
Most attractive quadrant

AA-Briefcase Elo is a combined metric that aggregates analytical quality Elo, presentation Elo, and rubric pass rate, with rubric performance converted into Elo via synthetic head-to-head matches. Elo and 95% confidence interval bounds are clamped at 0.

Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, ๐œยณ-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

File Type Results

AA-Briefcase performance broken out by the file type of the deliverable (Excel, PowerPoint, PDF, Word, Other).

AA-Briefcase Rubric Pass Rate by File Type (Normalized)

Rubric pass rate by deliverable file type ยท Scores are normalized per file type across all models tested, where green represents the highest score for that file type and red represents the lowest score for that file type
Reasoning models are indicated by a lightbulb icon

File types are categorized by the required submission format, with โ€œOtherโ€ covering formats such as HTML and LaTeX.

The share of binary rubric checks the submission passed (passed checks divided by total checks), aggregated across all AA-Briefcase tasks. Rubric checks are pass/fail criteria covering whether the deliverable includes required content and cites sources correctly, and whether it resolves planted cross-source conflicts.

Token Usage

AA-Briefcase Token Breakdown

Tokens used to run AA-Briefcase, including input, reasoning, and answer tokens
Reasoning models are indicated by a lightbulb icon

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Speed

Time per Task

Wall-clock time (minutes) per task: answer and reasoning generation plus tool execution time ยท Lower is better
Reasoning models are indicated by a lightbulb icon

Estimated wall-clock time per task: the sum of answer and reasoning tokens per task divided by the modelโ€™s canonical answer output speed, plus mean tool execution time per task. Lower is better.

Turns

Mean Turns per Task

Average number of model turns per AA-Briefcase task ยท Lower is better
Not currently available
Reasoning models are indicated by a lightbulb icon

This chart shows the mean number of turns the agent takes per task. It is a rough proxy for how many actions, tool calls, and iteration cycles an agent is using to complete benchmark tasks.

Tool Usage

Tool invocations issued by each agent during AA-Briefcase: counts by tool category, mean tool calls per turn, and source-pool exploration coverage.

AA-Briefcase Tool Calls Breakdown, Avg per Task

Average tool invocations per AA-Briefcase task, bucketed by intent
Reasoning models are indicated by a lightbulb icon

Agent tool calls are grouped into six categories: explore (navigating and searching the workspace), read (reading file contents), write (creating or editing files), compute (running code or calculations), view image (visual inspection of files), and other (anything else).

Model Size (Open Weights Models Only)

AA-Briefcase Elo vs. Total Parameters

AA-Briefcase Elo ยท Size in parameters (billions) ยท Open weights models only
Most attractive quadrant

AA-Briefcase Elo is a combined metric that aggregates analytical quality Elo, presentation Elo, and rubric pass rate, with rubric performance converted into Elo via synthetic head-to-head matches. Elo and 95% confidence interval bounds are clamped at 0.

The total number of trainable weights and biases in the model, expressed in billions. These parameters are learned during training and determine the model's ability to process and generate responses.

Score vs. Release Date

AA-Briefcase Elo vs. Release Date

AA-Briefcase Elo ยท Model release date
Most attractive region

AA-Briefcase Elo is a combined metric that aggregates analytical quality Elo, presentation Elo, and rubric pass rate, with rubric performance converted into Elo via synthetic head-to-head matches. Elo and 95% confidence interval bounds are clamped at 0.

Leaderboard

Creator
Name
Elo
CI
Release Date
1
Anthropic logoAnthropic
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)1587-17 / +17Jun 2026
2
Anthropic logoAnthropic
Claude Opus 4.8 (Adaptive Reasoning, Max Effort)1356-12 / +12May 2026
3
Anthropic logoAnthropic
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)1299-12 / +12Apr 2026
4
Z AI logoZ AI
GLM-5.2 (max)1266-13 / +13Jun 2026
5
OpenAI logoOpenAI
GPT-5.5 (xhigh)1159-11 / +11Apr 2026
6
MiniMax logoMiniMax
MiniMax-M31116-11 / +11Jun 2026
7
OpenAI logoOpenAI
GPT-5.5 (high)1104-11 / +11Apr 2026
8
Anthropic logoAnthropic
Claude Opus 4.7 (Non-reasoning, High Effort)1098-11 / +11Apr 2026
9
Anthropic logoAnthropic
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)1081-10 / +10Feb 2026
10
OpenAI logoOpenAI
GPT-5.5 (medium)1000-0 / +0Apr 2026
11
Z AI logoZ AI
GLM-5.1 (Reasoning)976-11 / +11Apr 2026
12
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, Max Effort)936-10 / +10Apr 2026
13
Alibaba logoAlibaba
Qwen3.7 Max903-11 / +11May 2026
14
Google logoGoogle
Gemini 3.5 Flash (medium)871-11 / +11May 2026
15
Google logoGoogle
Gemini 3.5 Flash (high)870-11 / +11May 2026
16
OpenAI logoOpenAI
GPT-5.3 Codex (xhigh)868-11 / +11Feb 2026
17
Xiaomi logoXiaomi
MiMo-V2.5-Pro867-11 / +11Apr 2026
18
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, Max Effort)836-11 / +11Apr 2026
19
Kimi logoKimi
Kimi K2.6809-11 / +11Apr 2026
20
xAI logoxAI
Grok 4.3 (high)748-11 / +11Apr 2026
21
OpenAI logoOpenAI
GPT-5.4 mini (xhigh)704-12 / +12Mar 2026
22
Meta logoMeta
Muse Spark632-13 / +13Apr 2026
23
Anthropic logoAnthropic
Claude 4.5 Haiku (Reasoning)605-13 / +13Oct 2025
24
Mistral logoMistral
Mistral Medium 3.5502-14 / +14Apr 2026
25
Google logoGoogle
Gemini 3.1 Pro Preview445-14 / +14Feb 2026
26
Google logoGoogle
Gemma 4 31B (Reasoning)348-15 / +15Apr 2026
27
Upstage logoUpstage
Solar Pro 3113-18 / +18Apr 2026
28
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2 Think V250-19 / +19Dec 2025
29
OpenAI logoOpenAI
gpt-oss-20B (high)0-0 / +0Aug 2025

Explore Evaluations

Artificial Analysis Intelligence IndexArtificial Analysis Intelligence Index

A composite benchmark aggregating nine challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

AA-Briefcase: Agentic Knowledge Work BenchmarkAA-Briefcase: Agentic Knowledge Work Benchmark

A private evaluation developed by Artificial Analysis for frontier agentic capability in long-horizon knowledge work, testing agents on realistic business workflows that require deliverables such as spreadsheets, presentations, and memos.

GDPval-AA v2 LeaderboardGDPval-AA v2 Leaderboard

GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark LeaderboardAPEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

๐œยฒ-Bench Telecom Benchmark Leaderboard๐œยฒ-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

๐œยณ-Banking Benchmark Leaderboard๐œยณ-Banking Benchmark Leaderboard

A fintech customer-support benchmark from the ๐œ-Knowledge framework that tests whether agents can navigate a large unstructured knowledge base and execute multi-step tool calls to resolve realistic banking workflows.

Terminal-Bench Hard Benchmark LeaderboardTerminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

Terminal-Bench v2.1 Benchmark LeaderboardTerminal-Bench v2.1 Benchmark Leaderboard

A verified refresh of Terminal-Bench v2.0 โ€” 89 curated tasks across software engineering, system administration, data processing, model training, and security, with environment and instruction fixes so scores reflect agent capability rather than environment gaps.

SciCode Benchmark LeaderboardSciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark LeaderboardArtificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination BenchmarkAA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark LeaderboardIFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark LeaderboardHumanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark LeaderboardCritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

ITBench-AA Benchmark LeaderboardITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

Artificial Analysis Openness IndexArtificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark LeaderboardMMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark LeaderboardGlobal-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark LeaderboardLiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark LeaderboardMATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark LeaderboardAIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark LeaderboardMMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.