Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

AI Hardware Benchmarking & Performance Analysis

We measure real-world performance of AI accelerator systems during language model inference.

For language model intelligence benchmarks, or API performance benchmarks, see language model comparisons.

AA-AgentPerf: The Hardware Benchmark for the Agent Era Now Live

  • Real agent workloads, not synthetic queries: we've captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens
  • Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we're allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like
  • Measures what developers need to know: max concurrent users at each target output speed, expressed per accelerator, per kW, per $/hr, and per rack
  • Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between
  • Live now: AA-AgentPerf is now open for submissions of configurations for benchmarking. The models supported at launch are gpt-oss-120b and DeepSeek V3.2. We'll be publishing results on a rolling basis.

AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance. Read the full methodology

AI ServerAgent 1Agent 2Agent 3Agent 4Agent 5Agent 6Agent 7Agent 8Agent 9System ThroughputOutput Speed per User

More concurrent agents means higher total throughput but slower per-user speed — AgentPerf measures exactly where each system hits this trade-off.

Max Concurrent Users per System

Maximum concurrent users sustained per system at each output speed SLO (TTFT SLO fixed)
Output Speed SLO
30 tok/s
100 tok/s
300 tok/s

First results coming soon!

Concurrent Users vs. Output Speed

Concurrent Users vs. P25 Output Speed per Query (Tokens per Second)

First results coming soon!

How AA-AgentPerf measures maximum user count

0100200300400048121620Concurrent UsersP25 Output Speed (t/s)SLO threshold (100 t/s)
PhaseUsersP25 SpeedResult
Max users: 11

AgentPerf uses a binary search to find the maximum number of concurrent users each system can sustain while meeting output speed and time-to-first-token performance targets.

System Load Test (AA-SLT)

Our original hardware benchmark, covering a wide range of systems
Peak System Output Throughput, Llama 3.3 70B
Total System Output Tokens per Second; Higher is better
Peak Output Speed per Query, Llama 3.3 70B
Output Tokens per Second per Query; Higher is better
Rental Price (ON-DEMAND)
Minimum Rental Price per GPU per Hour, USD; Lower is better

System Output Throughput at 100 tokens/s Per Query Output Speed

gpt-oss-120B (high) | System Output Throughput (Tokens per Second) at 100 tokens/s Output Speed

The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Peak Output Speed per Query

gpt-oss-120B (high) | Peak Output Speed per Query (Tokens per Second)

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

System Output Throughput vs Output Speed per Query

gpt-oss-120B (high) | System Output Throughput (Tokens per Second) vs Output Speed per Query (Tokens per Second)
8xH100 - vLLM
8xH200 - vLLM
8xB200 - TensorRT-LLM - Optimal
8xMI300X - vLLM

The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

System Output Throughput & Output Speed per Query vs. Concurrency

gpt-oss-120B (high) | System Output Throughput (Tokens per Second) & Output Speed per Query (Tokens per Second)
8xH100 - vLLM
8xH200 - vLLM
8xB200 - TensorRT-LLM - Optimal
8xMI300X - vLLM
Throughput
Speed

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.

The number of simultaneous requests that can be processed by the API at any given time. Higher concurrency enables better handling of parallel requests, which is essential for applications requiring high throughput and scalability.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed

gpt-oss-120B (high) | Cost per One Million Input and One Million Output Tokens (USD) at 100 tokens/s Output Speed

The cost per million input and output tokens is calculated using the average price per GPU per hour and the system output throughput, assuming 1k input and 1k output tokens per request. The formula is:

average price per GPU per hour×number of GPUs×1,000,000system output throughput×3600\frac{\text{average price per GPU per hour} \times \text{number of GPUs} \times 1,000,000}{\text{system output throughput} \times 3600}

Note: Prices used are the lowest 1-year committed prices available on the market.

Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

End-to-End Latency vs. Concurrency

gpt-oss-120B (high) | End-to-End Latency (s) vs. Concurrency
8xH100 - vLLM
8xH200 - vLLM
8xB200 - TensorRT-LLM - Optimal
8xMI300X - vLLM

The time it takes for a request to be processed and the response to be returned.

The number of simultaneous requests that can be processed by the API at any given time. Higher concurrency enables better handling of parallel requests, which is essential for applications requiring high throughput and scalability.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Price per GPU Hour

Price per GPU Hour (On-Demand)

Leading cloud hyperscaler endpoints; Price in USD
Runpod
Crusoe Cloud
Digitalocean
Amazon Web Services
Google Cloud
Nebius
Microsoft Azure
Lambda
Coreweave

Instance configurations may differ slightly by provider, even when using the same GPU model. Factors like memory size, interconnect bandwidth, and system architecture can impact pricing.

Prices are based on the US Central region, or the closest available equivalent if not directly listed (e.g., US East - Ohio - for AWS).

Pricing reflects on-demand and 1-year committed hourly rates as listed by each cloud provider.

Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.