Rubric
Binary pass or fail per check
Did the model follow the task instructions, identify requirements hidden across source files, use the correct evidence, and reach the right conclusions?
AA-Briefcase evaluates models across four multi-week knowledge work projects, comprising thousands of input files and 91 tasks in total. Across the scenarios, models must complete realistic professional workflows in fields such as data science, product management, and corporate strategy. Each scenario is a multi-week workflow that the agent works through in sequence, each week holding several tasks. Every task is a deliverable graded against a rubric of checks. Although tasks within a scenario share files and context across weeks, models currently complete each task in an independent run, without carrying over their own prior submissions.
Each task is graded against three types of checks:
Binary pass or fail per check
Did the model follow the task instructions, identify requirements hidden across source files, use the correct evidence, and reach the right conclusions?
Pairwise comparison
Compared against another model's submission, which deliverable is more thorough, analytically rigorous, and well-supported?
Pairwise comparison
Compared against another model's submission, which one is more professionally presented?
A public fifth scenario has been released via Hugging Face as a representation of scenario structure, submission, and grading. This does not count toward official AA-Briefcase results, and is demonstrative only.
AA-Briefcase performance broken out by the file type of the deliverable (Excel, PowerPoint, PDF, Word, Other).
Tool invocations issued by each agent during AA-Briefcase: counts by tool category, mean tool calls per turn, and source-pool exploration coverage.
Creator | Name | Elo | CI | Release Date | |
|---|---|---|---|---|---|
| 1 | Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | 1587 | -17 / +17 | Jun 2026 | |
| 2 | Claude Opus 4.8 (Adaptive Reasoning, Max Effort) | 1356 | -12 / +12 | May 2026 | |
| 3 | Claude Opus 4.7 (Adaptive Reasoning, Max Effort) | 1299 | -12 / +12 | Apr 2026 | |
| 4 | GLM-5.2 (max) | 1266 | -13 / +13 | Jun 2026 | |
| 5 | GPT-5.5 (xhigh) | 1159 | -11 / +11 | Apr 2026 | |
| 6 | MiniMax-M3 | 1116 | -11 / +11 | Jun 2026 | |
| 7 | GPT-5.5 (high) | 1104 | -11 / +11 | Apr 2026 | |
| 8 | Claude Opus 4.7 (Non-reasoning, High Effort) | 1098 | -11 / +11 | Apr 2026 | |
| 9 | Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | 1081 | -10 / +10 | Feb 2026 | |
| 10 | GPT-5.5 (medium) | 1000 | -0 / +0 | Apr 2026 | |
| 11 | GLM-5.1 (Reasoning) | 976 | -11 / +11 | Apr 2026 | |
| 12 | DeepSeek V4 Pro (Reasoning, Max Effort) | 936 | -10 / +10 | Apr 2026 | |
| 13 | Qwen3.7 Max | 903 | -11 / +11 | May 2026 | |
| 14 | Gemini 3.5 Flash (medium) | 871 | -11 / +11 | May 2026 | |
| 15 | Gemini 3.5 Flash (high) | 870 | -11 / +11 | May 2026 | |
| 16 | GPT-5.3 Codex (xhigh) | 868 | -11 / +11 | Feb 2026 | |
| 17 | MiMo-V2.5-Pro | 867 | -11 / +11 | Apr 2026 | |
| 18 | DeepSeek V4 Flash (Reasoning, Max Effort) | 836 | -11 / +11 | Apr 2026 | |
| 19 | Kimi K2.6 | 809 | -11 / +11 | Apr 2026 | |
| 20 | Grok 4.3 (high) | 748 | -11 / +11 | Apr 2026 | |
| 21 | GPT-5.4 mini (xhigh) | 704 | -12 / +12 | Mar 2026 | |
| 22 | Muse Spark | 632 | -13 / +13 | Apr 2026 | |
| 23 | Claude 4.5 Haiku (Reasoning) | 605 | -13 / +13 | Oct 2025 | |
| 24 | Mistral Medium 3.5 | 502 | -14 / +14 | Apr 2026 | |
| 25 | Gemini 3.1 Pro Preview | 445 | -14 / +14 | Feb 2026 | |
| 26 | Gemma 4 31B (Reasoning) | 348 | -15 / +15 | Apr 2026 | |
| 27 | Solar Pro 3 | 113 | -18 / +18 | Apr 2026 | |
| 28 | K2 Think V2 | 50 | -19 / +19 | Dec 2025 | |
| 29 | gpt-oss-20B (high) | 0 | -0 / +0 | Aug 2025 |
A composite benchmark aggregating nine challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.
A private evaluation developed by Artificial Analysis for frontier agentic capability in long-horizon knowledge work, testing agents on realistic business workflows that require deliverables such as spreadsheets, presentations, and memos.
GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.
Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.
A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.
A fintech customer-support benchmark from the ๐-Knowledge framework that tests whether agents can navigate a large unstructured knowledge base and execute multi-step tool calls to resolve realistic banking workflows.
An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.
A verified refresh of Terminal-Bench v2.0 โ 89 curated tasks across software engineering, system administration, data processing, model training, and security, with environment and instruction fixes so scores reflect agent capability rather than environment gaps.
A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.
A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).
A benchmark measuring factual recall and hallucination across various economically relevant domains.
A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.
A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.
The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.
A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.
Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.
A composite measure providing an industry standard to communicate model openness for users and developers.
An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.
A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.
A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.
A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.
All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.
An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.