June 12, 2026Live benchmark — open for submissions

First results from AA-AgentPerf: the hardware benchmark for the agent era

AA-AgentPerf is the first agentic inference benchmark: it replays real coding-agent trajectories and measures how many concurrent agents a system can serve while meeting production service-level targets. Its lead metric is Agents per Megawatt, the maximum number of agents served by an accelerator platform for each megawatt of power. This is the most important metric for the AI buildout in a power-constrained world. In the first results, NVIDIA’s Blackwell systems lead, showing a significant generational leap over Hopper. The scale and disaggregated inference advantages of rack-scale deployments are also evident both in compute and power efficiency.

The benchmark for the agent era

The dominant AI workload of 2026 looks nothing like the workloads inference benchmarks were built for. Coding agents run for hundreds of turns, interleave reasoning with tool calls and code edits, and routinely carry contexts past 100K tokens. Serving them well is a different engineering problem from serving chat: long shared prefixes reward KV cache reuse, bursts of short outputs stress schedulers, and users still expect responsive, interactive speeds.

Most hardware benchmarks still measure synthetic requests at fixed input and output lengths, with the optimizations production deployments rely on switched off. The number buyers actually need — how many agents can this system serve at an experience users will accept? — never appears. And in a world where AI capacity is increasingly constrained by available power and the cost of energy, that question comes with a denominator. AA-AgentPerf’s lead metric, Agents per Megawatt, measures how many simultaneous agents a system can support per megawatt of measured power, within per-agent output speed and time-to-first-token targets.

AA-AgentPerf is the first agentic inference benchmark. It replays real coding-agent trajectories against a system under test and finds the maximum number of concurrent agents the system can sustain while meeting market-derived performance targets. It has been shaped by a year of work with inference providers, AI accelerator companies, developers, and enterprise buyers. It is designed to be the definitive resource for anyone buying or leasing AI hardware for agentic workloads.

More concurrent agents means higher total throughput but slower per-agent output speed. AA-AgentPerf measures exactly where each system hits this trade-off.

Real coding agent trajectories

Every simulated agent in AA-AgentPerf works through a real coding-agent trajectory, generated by agents resolving issues in public code repositories using top open weights models with reasoning enabled. Sessions run up to 200 turns and reach sequence lengths beyond 100K tokens:

Input lengths range from ~5K to ~131K tokens per request, with a mean of roughly 27K — driven by tool outputs and accumulated history rather than the prompt itself
Output lengths vary widely across turns: agents mostly emit short tool calls and edits, punctuated by longer stretches of reasoning
12+ programming languages are represented, based on the primary language of the source repository
The test set stays private. Participants receive a representative tuning subset for configuration validation; the full dataset is held out to prevent benchmark-targeted optimization.

Trajectory traceNew task: parser tests failing in CI

Prior turns served from KV cache — only new tokens prefilled

Turn 0 / 16 · Context: 0K tokens

Task inputReasoningTool callTool resultCode edit

Illustrative session — turn structure and token counts modeled on the AA-AgentPerf dataset (inputs 1K–131K tokens, mean ~27K · outputs median ~150 tokens, P95 ~2K)

Context compounds turn over turn: the prefix is served from KV cache, only new tokens are prefilled, and the session grows past 100K tokens.

This shape is what makes agent serving hard, and what uniform-length synthetic benchmarks miss entirely. Tool results balloon the context, outputs are often only a few hundred tokens, and the same prefix comes back turn after turn. A system’s KV cache behavior, scheduler, and memory hierarchy decide whether it thrives or collapses under this pattern.

Measuring the modern production stack

AA-AgentPerf is the first inference benchmark to allow the optimizations that labs and inference providers actually run in production: KV cache reuse, speculative decoding, and disaggregated prefill/decode. Benchmarks that switch these off measure a deployment style nobody ships. Allowing them captures what real deployments should look like. It also makes AA-AgentPerf a tool for understanding the impact of each new software and hardware advance, not just each new chip. Accuracy verification controls for quality loss, so an optimization cannot buy capacity at the expense of output quality.

Allowing production optimizations makes serving configuration a first-class part of the benchmark. The solution space for serving a model on a given system is wide — inference framework and version, parallelism strategy, speculative decoding setup, disaggregation topology, and dozens of tuning decisions — and it is moving fast: since DeepSeek V4 Pro’s release we have seen serving configurations for it improve on a near-daily basis.

Results reach the leaderboard through two routes, and every published result discloses which: vendors submit tuned configurations for their own systems, which we run and verify, or Artificial Analysis builds a configuration in-house where no vendor submission is available yet. In the launch results for DeepSeek V4 Pro, the B300 and GB300 configurations were submitted by NVIDIA, while the H200 and MI355X results run configurations built by Artificial Analysis. The full serving configuration behind every number, and who submitted it, is in the configurations browser.

The numbers in this initial release of AA-AgentPerf should be read as a snapshot of a fast-moving frontier, like other live performance benchmarking results reported on Artificial Analysis. These are real results that reflect performance attainable today with the disclosed configurations, but we caution that a variable amount of headroom is potentially available for greater performance on each system, and we expect results to climb as further vendor submissions are made. In particular, that headroom may be greater for the MI355X and H200, as their configurations were submitted by the Artificial Analysis team rather than a vendor team.

Capacity at real service levels

Peak throughput is easy to inflate: pile on concurrency until every agent is crawling. AA-AgentPerf instead fixes the level of service and asks how far a system can scale while preserving it. Performance targets are derived from Artificial Analysis serverless API benchmarking data — the tiers of service that actually exist in the market today. Speed and latency are measured per request: P25 output speed and P95 time to first token, computed across all requests in a test phase.

Service-level tiers at launch

DeepSeek V4 Pro (max)

Tier 1

20tokens/s

P95 TTFT ≤ 10 s

Tier 2

60tokens/s

P95 TTFT ≤ 5 s

Tier 3

180tokens/s

P95 TTFT ≤ 3 s

gpt-oss-120b (high)Coming soon

Tier 1

100tokens/s

P95 TTFT ≤ 5 s

Tier 2

250tokens/s

P95 TTFT ≤ 3 s

Tier 3

500tokens/s

P95 TTFT ≤ 2 s

Tier 4

2,000tokens/s

P95 TTFT ≤ 1 s

Target output speed per agent

Max concurrent agents at target

8agents sustained at ≥ 300 tokens/s each

Reportedper MWper acceleratorper system

Every system gets a result at every tier: relaxing the speed target frees capacity for more agents. Tier targets shown for DeepSeek V4 Pro; agent counts are illustrative.

For each tier, an exponential ramp followed by a binary search finds the maximum concurrency the system sustains without violating the targets, with metrics computed over steady-state windows where all agents have been active for at least 30 seconds.

RAMPdoubling the load until a target breaks

Phase	Agents	p25 Speed	Result

max agents = 11

The capacity search in action: ramp, probe, and converge on the maximum supported concurrency for a service level.

What the first results show

The first wave of results covers NVIDIA and AMD systems running DeepSeek V4 Pro, measured from single accelerators up to full racks, with gpt-oss-120b results to follow. Three findings stand out:

Rack-scale shows its efficiency

The inherent scale advantages, and ability to aggressively disaggregate inference, lead to clear gains in pure compute and Agents per Megawatt vs. single-node.

A generational leap

The move from Hopper to Blackwell delivers a step change in the number of concurrent agents a system can sustain, in raw capacity and Agents per Megawatt.

Real-world optimizations are critical

Testing with different kernel optimizations and serving configuration designs moved the results significantly, reinforcing the need to optimize the entire stack.

Every result below is normalized per accelerator, per system, and per megawatt, so systems of very different sizes can be compared fairly. The per-megawatt figures are computed on measured accelerator power under load, not rated thermal design power (TDP).

Note that with the relative infancy of kernel optimizations and config design for DeepSeek V4 Pro on AMD systems, we expect significant improvements in AMD performance in the near-term. We are looking forward to working with all vendors on measuring their improvement in performance over time.

Explore the DeepSeek V4 Pro results

The charts below are live views of the AA-AgentPerf leaderboard — the same data as the hardware benchmarks dashboard, updated as new results land. Switch between normalization lenses (per accelerator, per system, per megawatt) for agent capacity, and drill into the capacity search behind every number.

View all serving configurations →

Agent capacity

Concurrent agents and output throughput each system sustains at each output speed target.

Max Concurrent Agents per MW of Accelerator Power

Maximum concurrent agents sustained per megawatt of accelerator power at each service level objective (SLO)

Values are normalized by GPU power draw in megawatts, based on measured GPU-only power (GPU die plus HBM, excluding CPUs, networking, and cooling overhead).

A live benchmark, open for submissions

AA-AgentPerf is a live benchmark. Submissions of configurations for benchmarking are open now, and results are published on a rolling basis as new hardware, software stacks, and model versions become available. Vendors can also submit improved configurations for systems already on the leaderboard; as serving software advances, published results are updated to reflect how real-world performance evolves. The benchmark is designed to evaluate every kind of system fairly, from a single accelerator to a full rack, and from DRAM-only architectures to SRAM-only designs and everything in between.

If you are a hardware vendor or inference provider and want your system benchmarked, reach out to agentperf@artificialanalysis.ai. We will be expanding model coverage, system coverage, and workload breadth on an ongoing basis.

What’s next

AA-AgentPerf will be upgraded continuously. Planned expansions include:

Longer context lengths — extending trajectories up to 1M tokens
Broader model coverage — gpt-oss-120b results across all submitted systems, with additional models to follow
Broader hardware coverage — results for more NVIDIA and AMD systems, and additional accelerator architectures
Tool execution performance — testing CPU performance for executing agent tool calls as part of the benchmark
Results over time — tracking how each system’s numbers improve as serving configurations and frameworks advance
Deeper analysis of token economics and pricing — including agent capacity per $/hr of rental cost, cost per task, and total cost of ownership

Go deeper

Full results, normalization lenses, and serving configurations live on the hardware benchmarks page, updated as new results land
The AA-AgentPerf methodology documents the dataset, service-level tiers, test execution, and metrics in detail
The configurations browser lists the full serving configuration behind every published result
Submissions: agentperf@artificialanalysis.ai

Read the latest

How Thinking Machines Lab’s Inkling performs on agentic knowledge work

Thinking Machines Lab’s Inkling scores an Elo of 836 on on our agentic knowledge work benchmark AA-Briefcase

July 22, 2026

Kimi K3: second only to Fable 5 on AA-Briefcase

Kimi K3 is second only to Fable 5 on AA-Briefcase, our agentic knowledge work benchmark, but costs more than Opus 4.8 to run while averaging nearly an hour per task

July 21, 2026

Gemini 3.6 Flash and Gemini 3.5 Flash-Lite: Halving Time per Task

Google has released Gemini 3.6 Flash and Gemini 3.5 Flash-Lite. Both halve time per task relative to their predecessors and increase token efficiency, Gemini 3.5 Flash-Lite improves by 11 Intelligence Index points while Gemini 3.6 Flash does not improve in intelligence over 3.5 Flash