AA-Briefcase: Agentic Knowledge Work Benchmark

Name: AA-Briefcase Elo
Creator: Artificial Analysis
License: https://artificialanalysis.ai/docs/legal/Terms-of-Use.pdf

A private evaluation developed by Artificial Analysis for frontier agentic capability in long-horizon knowledge work, testing agents on realistic business workflows that require deliverables such as spreadsheets, presentations, and memos.

AA-Briefcase Launch Article

AA-Briefcase Methodology

ArtificialAnalysis/AA-Briefcase-Lite

ArtificialAnalysis/Stirrup

AA-Briefcase evaluates models across four multi-week knowledge work projects, comprising thousands of input files and 91 tasks in total. Across the scenarios, models must complete realistic professional workflows in fields such as data science, product management, and corporate strategy. Each scenario is a multi-week workflow that the agent works through in sequence, each week holding several tasks. Every task is a deliverable graded against a rubric of checks. Although tasks within a scenario share files and context across weeks, models currently complete each task in an independent run, without carrying over their own prior submissions.

Each task is graded against three types of checks:

Rubric

Binary pass or fail per check

Did the model follow the task instructions, identify requirements hidden across source files, use the correct evidence, and reach the right conclusions?

Analytical Quality

Pairwise comparison

Compared against another model's submission, which deliverable is more thorough, analytically rigorous, and well-supported?

Presentation

Pairwise comparison

Compared against another model's submission, which one is more professionally presented?

A public fifth scenario has been released via Hugging Face as a representation of scenario structure, submission, and grading. This does not count toward official AA-Briefcase results, and is demonstrative only.

Results

AA-Briefcase Elo

AA-Briefcase is an agentic knowledge work benchmark developed by Artificial Analysis. AA-Briefcase Elo is a combined metric that aggregates rubric pass rate, analytical quality Elo and presentation Elo · Higher is better

Reasoning models are indicated by a lightbulb icon

AA-Briefcase Elo is a combined metric that aggregates analytical quality Elo, presentation Elo, and rubric pass rate, with rubric performance converted into Elo via synthetic head-to-head matches. Elo and 95% confidence interval bounds are clamped at 0.

Cost

AA-Briefcase Cost per Task

Mean cost (USD) per task to run AA-Briefcase, calculated from token usage and model pricing including representative cache hit rates

Reasoning models are indicated by a lightbulb icon

The total cost to run AA-Briefcase divided by the number of tasks (91 for full submission of tasks). Cost is calculated from token usage and model pricing, split across input, cache hit, cache write, reasoning, and answer token prices, including representative cache hit rates.

Example Task, Submissions, and Grading

Explore a representative AA-Briefcase week from the public Due Diligence scenario available via Hugging Face. The outputs and grading shown here illustrate what AA-Briefcase evaluates. Scores are shown for a representative model set. Submissions and verdicts in this representative scenario do not contribute to a model's AA-Briefcase Elo or other benchmark scores.

Model

market_overview.pdf

Open

market_overview.tex

Open

Score Comparisons

AA-Briefcase Elo vs. Artificial Analysis Intelligence Index

AA-Briefcase Elo · Artificial Analysis Intelligence Index

Most attractive quadrant

Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

File Type Results

AA-Briefcase performance broken out by the file type of the deliverable (Excel, PowerPoint, PDF, Word, Other).

AA-Briefcase Rubric Pass Rate by File Type (Normalized)

Rubric pass rate by deliverable file type · Scores are normalized per file type across all models tested, where green represents the highest score for that file type and red represents the lowest score for that file type

Reasoning models are indicated by a lightbulb icon

File types are categorized by the required submission format, with “Other” covering formats such as HTML and LaTeX.

The share of binary rubric checks the submission passed (passed checks divided by total checks), aggregated across all AA-Briefcase tasks. Rubric checks are pass/fail criteria covering whether the deliverable includes required content and cites sources correctly, and whether it resolves planted cross-source conflicts.

Token Usage

AA-Briefcase Output Tokens per Task

Mean reasoning and answer tokens consumed per AA-Briefcase task

Reasoning models are indicated by a lightbulb icon

The number of output tokens used to run the evaluation, including visible answer tokens and reasoning tokens where reported by reasoning models.

Speed

Time per Task

Wall-clock time (minutes) per task: answer and reasoning generation plus tool execution time · Lower is better

Reasoning models are indicated by a lightbulb icon

Estimated wall-clock time per task: the sum of answer and reasoning tokens per task divided by the model’s canonical answer output speed, plus mean tool execution time per task. Lower is better.

Turns

Mean Turns per Task

Average number of model turns per AA-Briefcase task · Lower is better

Reasoning models are indicated by a lightbulb icon

This chart shows the average number of turns the agent takes per task. It is a rough proxy for how many actions, tool calls, and iteration cycles an agent is using to complete benchmark tasks.

Tool Usage

Tool invocations issued by each agent during AA-Briefcase: counts by tool category, mean tool calls per turn, and source-pool exploration coverage.

AA-Briefcase Tool Calls Breakdown, Avg per Task

Average tool invocations per AA-Briefcase task, bucketed by intent

Reasoning models are indicated by a lightbulb icon

Agent tool calls are grouped into six categories: explore (navigating and searching the workspace), read (reading file contents), write (creating or editing files), compute (running code or calculations), view image (visual inspection of files), and other (anything else).

Model Size (Open Weights Models Only)

AA-Briefcase Elo vs. Total Parameters

AA-Briefcase Elo · Size in parameters (billions) · Open weights models only

Most attractive quadrant

The total number of trainable weights and biases in the model, expressed in billions. These parameters are learned during training and determine the model's ability to process and generate responses.

Score vs. Release Date

AA-Briefcase Elo vs. Release Date

AA-Briefcase Elo · Model release date

Most attractive region

Leaderboard

	Creator	Name	Elo	CI	Release Date
1	Anthropic	Claude Opus 5 (Adaptive Reasoning, Max Effort)	1721	-12 / +13	Jul 2026
2	Anthropic	Claude Opus 5 (Adaptive Reasoning, Xhigh Effort)	1693	-13 / +13	Jul 2026
3	Anthropic	Claude Opus 5 (Adaptive Reasoning, High Effort)	1606	-12 / +13	Jul 2026
4	Anthropic	Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	1574	-11 / +11	Jun 2026
5	Kimi	Kimi K3	1540	-11 / +12	Jul 2026
6	OpenAI	GPT-5.6 Sol (max)	1504	-10 / +11	Jul 2026
7	Anthropic	Claude Opus 5 (Adaptive Reasoning, Medium Effort)	1470	-12 / +13	Jul 2026
8	Anthropic	Claude Sonnet 5 (Adaptive Reasoning, Max Effort)	1385	-9 / +10	Jun 2026
9	Anthropic	Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	1345	-9 / +9	May 2026
10	SpaceXAI	Grok 4.5 (high)	1317	-10 / +11	Jul 2026
11	Anthropic	Claude Sonnet 5 (Adaptive Reasoning, Xhigh Effort)	1294	-10 / +10	Jun 2026
12	Anthropic	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	1281	-9 / +9	Apr 2026
13	Z AI	GLM-5.2 (max)	1254	-10 / +9	Jun 2026
14	Anthropic	Claude Opus 5 (Adaptive Reasoning, Low Effort)	1223	-12 / +11	Jul 2026
15	Anthropic	Claude Sonnet 5 (Adaptive Reasoning, High Effort)	1194	-10 / +10	Jun 2026
16	OpenAI	GPT-5.5 (xhigh)	1153	-9 / +9	Apr 2026
17	MiniMax	MiniMax-M3	1108	-9 / +8	Jun 2026
18	OpenAI	GPT-5.5 (high)	1099	-9 / +8	Apr 2026
19	Anthropic	Claude Opus 4.7 (Non-reasoning, High Effort)	1086	-9 / +9	Apr 2026
20	Anthropic	Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)	1076	-8 / +8	Feb 2026
21	Anthropic	Claude Sonnet 5 (Adaptive Reasoning, Medium Effort)	1056	-10 / +10	Jun 2026
22	OpenAI	GPT-5.5 (medium)	1000	-0 / +0	Apr 2026
23	Z AI	GLM-5.1 (Reasoning)	973	-9 / +8	Apr 2026
24	Google	Gemini 3.6 Flash (high)	964	-10 / +10	Jul 2026
25	DeepSeek	DeepSeek V4 Pro (Reasoning, Max Effort)	931	-8 / +8	Apr 2026
26	Anthropic	Claude Sonnet 5 (Adaptive Reasoning, Low Effort)	928	-10 / +9	Jun 2026
27	Alibaba	Qwen3.7 Max	914	-9 / +8	May 2026
28	Xiaomi	MiMo-V2.5-Pro	879	-8 / +8	Apr 2026
29	NVIDIA	Nemotron 3 Ultra 550B A55B (Reasoning)	874	-9 / +9	Jun 2026
30	Google	Gemini 3.5 Flash (medium)	873	-9 / +9	May 2026
31	Google	Gemini 3.5 Flash (high)	871	-9 / +9	May 2026
32	OpenAI	GPT-5.3 Codex (xhigh)	869	-9 / +8	Feb 2026
33	Meta	Muse Spark 1.1 (xhigh)	868	-12 / +12	Jul 2026
34	Thinking Machines	Inkling (xhigh)	839	-11 / +10	Jul 2026
35	DeepSeek	DeepSeek V4 Flash (Reasoning, Max Effort)	833	-8 / +8	Apr 2026
36	Kimi	Kimi K2.6	818	-9 / +8	Apr 2026
37	Alibaba	Qwen3.6 27B (Reasoning)	810	-10 / +10	Apr 2026
38	SpaceXAI	Grok 4.3 (high)	759	-9 / +9	Apr 2026
39	OpenAI	GPT-5.4 mini (xhigh)	717	-9 / +9	Mar 2026
40	Meta	Muse Spark	641	-11 / +10	Apr 2026
41	Google	Gemini 3.5 Flash-Lite	636	-12 / +12	Jul 2026
42	Anthropic	Claude 4.5 Haiku (Reasoning)	611	-10 / +10	Oct 2025
43	KwaiKAT	KAT-Coder-Pro V1	598	-11 / +11	Nov 2025
44	Alibaba	Qwen3.5 397B A17B (Reasoning)	554	-11 / +10	Feb 2026
45	Mistral	Mistral Medium 3.5	516	-11 / +10	Apr 2026
46	Google	Gemini 3.1 Pro Preview	457	-11 / +11	Feb 2026
47	Google	Gemma 4 31B (Reasoning)	373	-12 / +12	Apr 2026
48	Cohere	Command A+	369	-15 / +13	May 2026
49	Cohere	North Mini Code	238	-15 / +14	Jun 2026
50	Google	Gemini 3.1 Flash-Lite	230	-14 / +12	Mar 2026
51	Upstage	Solar Pro 3	138	-15 / +13	Apr 2026
52	MBZUAI Institute of Foundation Models	K2 Think V2	59	-16 / +14	Dec 2025
53	OpenAI	gpt-oss-120b (high)	7	-7 / +15	Aug 2025
54	OpenAI	gpt-oss-20b (high)	0	-0 / +0	Aug 2025
55	Meta	Llama 4 Maverick	0	-0 / +0	Apr 2025
56	NVIDIA	NVIDIA Nemotron 3 Super 120B A12B (Reasoning)	0	-0 / +0	Mar 2026

Explore Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating nine challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

Artificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

AA-Briefcase: Agentic Knowledge Work Benchmark

GDPval-AA v2 Leaderboard

GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

AutomationBench-AA: Agentic SaaS Workflow Benchmark

A benchmark measuring agentic task completion across simulated SaaS application environments, scoring the share of each task's objectives completed without guardrail violations.

Harvey LAB-AA Benchmark Leaderboard

Artificial Analysis' implementation of Harvey's Legal Agent Benchmark (LAB), testing AI agents on real-world legal work from Harvey's dataset of 120 private tasks spanning 24 legal practice areas. The agent reads case documents in a sandbox and produces legal deliverables (e.g., memos, disclosure schedules, deposition summaries), graded criterion-by-criterion by a single LLM rubric judge.

EnterpriseOps-Gym-AA Benchmark Leaderboard

Artificial Analysis' independent implementation of ServiceNow's EnterpriseOps-Gym, an agentic benchmark testing whether LLM agents can complete stateful, multi-step enterprise workflows across eight business domains via live tool use, graded on the final state of the underlying databases.

𝜏³-Banking Benchmark Leaderboard

A fintech customer-support benchmark from the 𝜏-Knowledge framework that tests whether agents can navigate a large unstructured knowledge base and execute multi-step tool calls to resolve realistic banking workflows.

Terminal-Bench v2.1 Benchmark Leaderboard

A verified refresh of Terminal-Bench v2.0 — 89 curated tasks across software engineering, system administration, data processing, model training, and security, with environment and instruction fixes so scores reflect agent capability rather than environment gaps.

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

CritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

ITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.

IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

Global-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

AA-Briefcase: Agentic Knowledge Work Benchmark

Related Links

Rubric

Analytical Quality

Presentation

Results

AA-Briefcase Elo

AA-Briefcase Elo

Cost

AA-Briefcase Cost per Task

Cost per Task

Example Task, Submissions, and Grading

Score Comparisons

AA-Briefcase Elo vs. Artificial Analysis Intelligence Index

AA-Briefcase Elo

Artificial Analysis Intelligence Index

File Type Results

AA-Briefcase Rubric Pass Rate by File Type (Normalized)

File Types

AA-Briefcase Rubric Score

Token Usage

AA-Briefcase Output Tokens per Task

Evaluation Output Tokens

Speed

Time per Task

AA-Briefcase Time per Task

Turns

Mean Turns per Task

What Turns Is Measuring

Tool Usage

AA-Briefcase Tool Calls Breakdown, Avg per Task

AA-Briefcase Tool Calls Breakdown

Model Size (Open Weights Models Only)

AA-Briefcase Elo vs. Total Parameters

AA-Briefcase Elo

Total Parameters

Score vs. Release Date

AA-Briefcase Elo vs. Release Date

AA-Briefcase Elo

Leaderboard

Explore Evaluations