AI Hardware Benchmarking & Performance Analysis
We measure real-world performance of AI accelerator systems during language model inference.
For language model intelligence benchmarks, or API performance benchmarks, see language model comparisons.
AA-AgentPerf: The Hardware Benchmark for the Agent Era Now Live
- Real agent workloads, not synthetic queries: we've captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens
- Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we're allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like
- Measures what developers need to know: max concurrent users at each target output speed, expressed per accelerator, per kW, per $/hr, and per rack
- Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between
- Live now: AA-AgentPerf is now open for submissions of configurations for benchmarking. The models supported at launch are gpt-oss-120b and DeepSeek V3.2. We'll be publishing results on a rolling basis.
AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance. Read the full methodology
More concurrent agents means higher total throughput but slower per-user speed — AgentPerf measures exactly where each system hits this trade-off.
Max Concurrent Users per System
First results coming soon!
Concurrent Users vs. Output Speed
First results coming soon!
How AA-AgentPerf measures maximum user count
| Phase | Users | P25 Speed | Result |
|---|
AgentPerf uses a binary search to find the maximum number of concurrent users each system can sustain while meeting output speed and time-to-first-token performance targets.