June 24, 2026
Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects.
One of the key metrics we measure in AA-Briefcase is average time per task. This is calculated using evaluation token usage, representative model output speeds, and tool execution time recorded during evaluation.
Key time per task takeaways from AA-Briefcase:
➤ Claude Opus 4.8 is the highest-scoring available model, but it is also one of the slowest, taking ~23 minutes per task on average
➤ Several GPT-5.5 reasoning variants lie along the Pareto frontier of AA-Briefcase Elo vs. Time per Task, including medium, high, and xhigh. GPT-5.5 (xhigh) in particular stands out as one of the most efficient top-performing models, using around half the time per task of Opus 4.8 (11 minutes) while ranking top 5 on the overall AA-Briefcase Elo
➤ GLM-5.2 also sits on the Pareto frontier, scoring 1261, ahead of GPT-5.5 (xhigh, 1159) but also taking more time per task (16.3 minutes). It is also the top-performing open weights model on AA-Briefcase, with MiniMax-M3 the next best at 1113
➤ If Claude Fable 5 were still available, it would likely take around 28.5 minutes per task: while it was live, we measured ~91 output tokens per second, ~3.1 minutes of tool execution time per task, and ~139,000 output tokens per task
➤ Time spent on tool calls and execution accounts for only ~12% of the total time, with the remaining amount explained by output verbosity, turn usage, and inference speed
For more details: https://artificialanalysis.ai/articles/aa-briefcase
Read the latest

Announcing the Artificial Analysis Speech to Speech Index
Announcing the Artificial Analysis Speech to Speech Index, our new synthesis metric for native Speech to Speech model quality, comprising of Big Bench Audio, Full Duplex Bench, and 𝜏-Voice
June 23, 2026

Announcing AA-Briefcase: a frontier knowledge work evaluation
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files, combining rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality.
June 18, 2026

GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index
Benchmarks and Analysis of GLM-5.2
June 16, 2026