01
Rack-scale shows its efficiency
The inherent scale advantages, and ability to aggressively disaggregate inference, lead to clear gains in pure compute and Agents per Megawatt vs. single-node.
Announcement · June 12, 2026
AA-AgentPerf is the first agentic inference benchmark: it replays real coding-agent trajectories and measures how many concurrent agents a system can serve while meeting production service-level targets. Its lead metric is Agents per Megawatt, the maximum number of agents served by an accelerator platform for each megawatt of power. This is the most important metric for the AI buildout in a power-constrained world. In the first results, NVIDIA’s Blackwell systems lead, showing a significant generational leap over Hopper. The scale and disaggregated inference advantages of rack-scale deployments are also evident both in compute and power efficiency.
Live benchmark — open for submissions
The dominant AI workload of 2026 looks nothing like the workloads inference benchmarks were built for. Coding agents run for hundreds of turns, interleave reasoning with tool calls and code edits, and routinely carry contexts past 100K tokens. Serving them well is a different engineering problem from serving chat: long shared prefixes reward KV cache reuse, bursts of short outputs stress schedulers, and users still expect responsive, interactive speeds.
Most hardware benchmarks still measure synthetic requests at fixed input and output lengths, with the optimizations production deployments rely on switched off. The number buyers actually need — how many agents can this system serve at an experience users will accept? — never appears. And in a world where AI capacity is increasingly constrained by available power and the cost of energy, that question comes with a denominator. AA-AgentPerf’s lead metric, Agents per Megawatt, measures how many simultaneous agents a system can support per megawatt of measured power, within per-agent output speed and time-to-first-token targets.
AA-AgentPerf is the first agentic inference benchmark. It replays real coding-agent trajectories against a system under test and finds the maximum number of concurrent agents the system can sustain while meeting market-derived performance targets. It has been shaped by a year of work with inference providers, AI accelerator companies, developers, and enterprise buyers. It is designed to be the definitive resource for anyone buying or leasing AI hardware for agentic workloads.
Every simulated agent in AA-AgentPerf works through a real coding-agent trajectory, generated by agents resolving issues in public code repositories using top open weights models with reasoning enabled. Sessions run up to 200 turns and reach sequence lengths beyond 100K tokens:
Prior turns served from KV cache — only new tokens prefilled
Turn 0 / 16 · Context: 0K tokens
Illustrative session — turn structure and token counts modeled on the AA-AgentPerf dataset (inputs 1K–131K tokens, mean ~27K · outputs median ~150 tokens, P95 ~2K)
This shape is what makes agent serving hard, and what uniform-length synthetic benchmarks miss entirely. Tool results balloon the context, outputs are often only a few hundred tokens, and the same prefix comes back turn after turn. A system’s KV cache behavior, scheduler, and memory hierarchy decide whether it thrives or collapses under this pattern.
AA-AgentPerf is the first inference benchmark to allow the optimizations that labs and inference providers actually run in production: KV cache reuse, speculative decoding, and disaggregated prefill/decode. Benchmarks that switch these off measure a deployment style nobody ships. Allowing them captures what real deployments should look like. It also makes AA-AgentPerf a tool for understanding the impact of each new software and hardware advance, not just each new chip. Accuracy verification controls for quality loss, so an optimization cannot buy capacity at the expense of output quality.
Allowing production optimizations makes serving configuration a first-class part of the benchmark. The solution space for serving a model on a given system is wide — inference framework and version, parallelism strategy, speculative decoding setup, disaggregation topology, and dozens of tuning decisions — and it is moving fast: since DeepSeek V4 Pro’s release we have seen serving configurations for it improve on a near-daily basis.
Results reach the leaderboard through two routes, and every published result discloses which: vendors submit tuned configurations for their own systems, which we run and verify, or Artificial Analysis builds a configuration in-house where no vendor submission is available yet. In the launch results for DeepSeek V4 Pro, the B300 and GB300 configurations were submitted by NVIDIA, while the H200 and MI355X results run configurations built by Artificial Analysis. The full serving configuration behind every number, and who submitted it, is in the configurations browser.
The numbers in this initial release of AA-AgentPerf should be read as a snapshot of a fast-moving frontier, like other live performance benchmarking results reported on Artificial Analysis. These are real results that reflect performance attainable today with the disclosed configurations, but we caution that a variable amount of headroom is potentially available for greater performance on each system, and we expect results to climb as further vendor submissions are made. In particular, that headroom may be greater for the MI355X and H200, as their configurations were submitted by the Artificial Analysis team rather than a vendor team.
Peak throughput is easy to inflate: pile on concurrency until every agent is crawling. AA-AgentPerf instead fixes the level of service and asks how far a system can scale while preserving it. Performance targets are derived from Artificial Analysis serverless API benchmarking data — the tiers of service that actually exist in the market today. Speed and latency are measured per request: P25 output speed and P95 time to first token, computed across all requests in a test phase.
Service-level tiers at launch
DeepSeek V4 Pro (max)
Tier 1
20tokens/s
P95 TTFT ≤ 10 s
Tier 2
60tokens/s
P95 TTFT ≤ 5 s
Tier 3
180tokens/s
P95 TTFT ≤ 3 s
gpt-oss-120b (high)Coming soon
Tier 1
100tokens/s
P95 TTFT ≤ 5 s
Tier 2
250tokens/s
P95 TTFT ≤ 3 s
Tier 3
500tokens/s
P95 TTFT ≤ 2 s
Tier 4
2,000tokens/s
P95 TTFT ≤ 1 s
8agents sustained at ≥ 300 tokens/s each
For each tier, an exponential ramp followed by a binary search finds the maximum concurrency the system sustains without violating the targets, with metrics computed over steady-state windows where all agents have been active for at least 30 seconds.
| Phase | Agents | p25 Speed | Result |
|---|
The first wave of results covers NVIDIA and AMD systems running DeepSeek V4 Pro, measured from single accelerators up to full racks, with gpt-oss-120b results to follow. Three findings stand out:
01
The inherent scale advantages, and ability to aggressively disaggregate inference, lead to clear gains in pure compute and Agents per Megawatt vs. single-node.
02
The move from Hopper to Blackwell delivers a step change in the number of concurrent agents a system can sustain, in raw capacity and Agents per Megawatt.
03
Testing with different kernel optimizations and serving configuration designs moved the results significantly, reinforcing the need to optimize the entire stack.
Every result below is normalized per accelerator, per system, and per megawatt, so systems of very different sizes can be compared fairly. The per-megawatt figures are computed on measured accelerator power under load, not rated thermal design power (TDP).
Note that with the relative infancy of kernel optimizations and config design for DeepSeek V4 Pro on AMD systems, we expect significant improvements in AMD performance in the near-term. We are looking forward to working with all vendors on measuring their improvement in performance over time.
The charts below are live views of the AA-AgentPerf leaderboard — the same data as the hardware benchmarks dashboard, updated as new results land. Switch between normalization lenses (per accelerator, per system, per megawatt) for agent capacity, and drill into the capacity search behind every number.
Agent capacity
Concurrent agents and output throughput each system sustains at each output speed target.
AA-AgentPerf is a live benchmark. Submissions of configurations for benchmarking are open now, and results are published on a rolling basis as new hardware, software stacks, and model versions become available. Vendors can also submit improved configurations for systems already on the leaderboard; as serving software advances, published results are updated to reflect how real-world performance evolves. The benchmark is designed to evaluate every kind of system fairly, from a single accelerator to a full rack, and from DRAM-only architectures to SRAM-only designs and everything in between.
If you are a hardware vendor or inference provider and want your system benchmarked, reach out to agentperf@artificialanalysis.ai. We will be expanding model coverage, system coverage, and workload breadth on an ongoing basis.
AA-AgentPerf will be upgraded continuously. Planned expansions include:

Anthropic is nearly 5 points ahead of any other lab’s best model
June 10, 2026

Anthropic has released Claude Fable 5, the first publicly available Mythos-class model that ranks #1 in our agentic real-world knowledge work benchmark GDPval-AA
June 9, 2026

Cohere just released North Mini Code, a small 30B parameter (3B active) open weights coding model that scores 27.6 on the Artificial Analysis Intelligence Index
June 9, 2026