AI Hardware Benchmarking & Performance Analysis
We measure real-world performance of AI accelerator systems during language model inference.
For language model intelligence benchmarks, or API performance benchmarks, see language model comparisons.
AA-AgentPerf: The Hardware Benchmark for the Agent Era
- Real agent workloads, not synthetic queries: we've captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens
- Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we're allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like
- Measures what developers need to know: max concurrent users at each target output speed, expressed per accelerator, per kW, per $/hr, and per rack
- Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between
- Live now: AA-AgentPerf is now open for submissions of configurations for benchmarking. The models supported at launch are gpt-oss-120b and DeepSeek V4 Pro. We'll be publishing results on a rolling basis.
AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance. Read the full methodology
More concurrent agents means higher total throughput but slower per-user speed — AgentPerf measures exactly where each system hits this trade-off.
Max Concurrent Users per System
First results coming soon!
Concurrent Users vs. Output Speed
First results coming soon!
How AA-AgentPerf measures maximum user count
| Phase | Users | P25 Speed | Result |
|---|
AgentPerf uses a binary search to find the maximum number of concurrent users each system can sustain while meeting output speed and time-to-first-token performance targets.
System Load Test (AA-SLT)
Our original hardware benchmark, covering a wide range of systems. Read the methodology
Highlights
Throughput
System Output Throughput at 100 tokens/s Per Query Output Speed
Output Speed
Peak Output Speed per Query
Throughput vs Speed
System Output Throughput vs. Output Speed per Query
System Output Throughput & Output Speed per Query vs. Concurrency
Cost
Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed
Concurrency
End-to-End Latency vs. Concurrency
Pricing
Price per GPU Hour (On-Demand)
Frequently Asked Questions
AA-AgentPerf is Artificial Analysis' hardware benchmark for AI agent workloads. It measures how AI accelerators perform on real coding agent trajectories — with up to 200 turns and sequence lengths exceeding 100K tokens — rather than synthetic single-turn prompts. AA-AgentPerf reports each system's maximum concurrent agent capacity at a target output speed and time-to-first-token, normalized per accelerator, per kW, per dollar per hour, and per rack.
There is no single best accelerator for AI agent inference — the right choice depends on whether you optimize for total concurrent agent capacity, output speed per agent, power efficiency, or cost. AA-AgentPerf compares NVIDIA GPUs, AMD GPUs, Google TPUs, and custom AI accelerators on the same agentic workloads, with results broken out per accelerator, per kW, per dollar per hour, and per rack.
Agentic workloads have long sequence lengths (often >100K tokens), many sequential turns (up to 200+ in coding agents), and benefit substantially from production inference optimizations such as KV cache reuse, disaggregated prefill and decode, and speculative decoding. Standard LLM benchmarks built on short prompts and single-turn responses do not capture how accelerators handle these patterns, which is why we built AA-AgentPerf.
AA-AgentPerf runs a binary search per system to find the maximum number of concurrent AI agents each hardware configuration can sustain while meeting both an output speed target (tokens per second per agent) and a time-to-first-token (TTFT) target. The resulting concurrent user count can be normalized per accelerator, per kW, per dollar per hour, or per rack, so the same agentic workload is comparable across different AI accelerator architectures.
AA-AgentPerf supports gpt-oss-120b and DeepSeek V4 Pro at launch, running on real agentic coding trajectories. We will add additional models on a rolling basis.
AA-AgentPerf permits the inference optimizations that AI labs and inference providers deploy in production — KV cache reuse, disaggregated prefill/decode, and speculative decoding. The goal is to reflect how hardware performs in realistic agent deployments, not in an artificially constrained setup, so the reported agentic throughput and latency match what buyers see when they deploy.
AA-SLT is our original hardware benchmark for language model inference, measuring throughput, output speed, and cost across a wide range of AI accelerator systems on short and medium length prompts. AA-AgentPerf focuses on agentic AI workloads — long-context, multi-turn coding agents — and reports concurrent agent capacity instead of single-query throughput. Both benchmarks live on the same hardware page so AI accelerators can be compared on either workload.
For the current Artificial Analysis System Load Test (AA-SLT), NVIDIA's B200 is the most performant accelerator for LLM inference. It leads on peak throughput and output speed per query, though the right choice can still vary by model, deployment goal and budget.
NVIDIA's B200 currently powers the highest-throughput result in the current Artificial Analysis System Load Test (AA-SLT). The top benchmark is 8xB200 (SXM) serving gpt-oss-120b (high), reaching 92,909 output tokens per second at peak throughput.
NVIDIA's B200 currently powers the fastest single-query result in the current Artificial Analysis System Load Test (AA-SLT). The top benchmark is 8xB200 (SXM) serving gpt-oss-120b (high), reaching 403 output tokens per second per query.
8xB200 (SXM) for gpt-oss-120b (high) currently has the best cost efficiency in the current Artificial Analysis System Load Test (AA-SLT) at $0.19. Artificial Analysis compares systems using cost per one million input and one million output tokens at a model-specific reference speed, so the most cost-efficient hardware depends on both the model and the target output speed.
In the current Artificial Analysis System Load Test (AA-SLT), DeepSeek R1 0528 (May '25) works best on 8xB200 (SXM) with NVIDIA's B200, reaching 45,677 output tokens per second at peak throughput, Llama 4 Maverick works best on 8xB200 (SXM) with NVIDIA's B200, reaching 48,198 output tokens per second at peak throughput, and gpt-oss-120b (high) works best on 8xB200 (SXM) with NVIDIA's B200, reaching 92,909 output tokens per second at peak throughput.