AI Hardware Benchmarking & Performance Analysis

We measure real-world performance of AI accelerator systems during language model inference.

For language model intelligence benchmarks, or API performance benchmarks, see language model comparisons.

AA-AgentPerf: The Hardware Benchmarkfor the Agent Era

AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance.

Open for submissions

AA-AgentPerf is now open for submissions of configurations for benchmarking. The models supported at launch are gpt-oss-120b and DeepSeek V4 Pro. We'll be publishing results on a rolling basis.

concurrent agents0/ 9INFERENCESYSTEMAGENT 1AGENT 2AGENT 3AGENT 4AGENT 5AGENT 6AGENT 7AGENT 8AGENT 9system throughputoutput speed per agent
More concurrent agents means higher total throughput but slower per-agent speed. AgentPerf measures exactly where each system hits this trade-off.
Design principles

Real agent workloads, not synthetic queries

We've captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens.

Production optimizations allowed

KV cache reuse, disaggregated prefill/decode, speculative decoding - we're allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like.

Measures what developers need to know

Max concurrent agents at each target output speed, expressed per MW, per accelerator, and per system.

Built for every kind of scale

Designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between.

Results
Model
View all serving configurations →

Agent capacity

Concurrent agents and output throughput each system sustains at each output speed target.

Max Concurrent Agents per MW of Accelerator Power

Maximum concurrent agents sustained per megawatt of accelerator power at each service level objective (SLO)

Concurrent Agents per MW of Accelerator Power vs. Output Speed

Concurrent agents per MW of accelerator power vs. p25 output speed per query (tokens per second)
Search iterationMax passing concurrent agentsTTFT SLO violated

Values are normalized by GPU power draw in megawatts, based on measured GPU-only power (GPU die plus HBM, excluding CPUs, networking, and cooling overhead).

Methodology

Inside a single agent trajectory

Each simulated agent replays a real multi-turn coding session — reasoning, calling tools, and editing code. Context compounds turn over turn: prior turns are served from KV cache, only new tokens are prefilled, and trajectories grow beyond 100K tokens. This access pattern is what separates agent serving from synthetic benchmarks.

How AA-AgentPerf measures maximum agent count

AgentPerf uses a binary search to find the maximum number of concurrent agents each system can sustain while meeting output speed and time-to-first-token performance targets.

RAMPdoubling the load until a target breaks
0100200300400048121620concurrent agentsp25 output speed (t/s)SLO threshold (100 t/s)max = 11
PhaseAgentsp25 SpeedResult
max agents = 11

Capacity at every service level

SLO tiers are derived from the levels of service observed across serverless API providers. Vendors tune for each tier separately, and AA-AgentPerf reports the maximum concurrent agents each system sustains at every tier — relaxing the speed target frees capacity for more agents.

Speed targets shown are the DeepSeek V4 Pro tiers; agent counts are illustrative.

Target output speed per agent
Max concurrent agents at target

8agents sustained at ≥ 300 tokens/s each

Reportedper MWper acceleratorper system

System Load Test (AA-SLT)

Original Benchmark

Our original hardware benchmark, covering a wide range of systems. Read the methodology

Highlights

Total system output tokens per second · Higher is better
Output tokens per second per query · Higher is better
Minimum rental price per GPU per hour (USD) · Lower is better

Throughput

System Output Throughput at 100 tokens/s Per Query Output Speed

gpt-oss-120b (high) | System output throughput (tokens per second) at 100 tokens/s output speed

The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Output Speed

Peak Output Speed per Query

gpt-oss-120b (high) | Peak output speed per query (tokens per second)

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Throughput vs Speed

System Output Throughput vs. Output Speed per Query

gpt-oss-120b (high) | System output throughput (tokens per second) vs output speed per query (tokens per second)
8xH100 - vllm
8xB200 - tensorrt-llm
8xH200 - vllm
8xMI300X - vllm

The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

System Output Throughput & Output Speed per Query vs. Concurrency

gpt-oss-120b (high) | System output throughput (tokens per second) & output speed per query (tokens per second)
8xH100 - vllm
8xB200 - tensorrt-llm
8xH200 - vllm
8xMI300X - vllm
Throughput
Speed

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.

The number of simultaneous requests that can be processed by the API at any given time. Higher concurrency enables better handling of parallel requests, which is essential for applications requiring high throughput and scalability.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Cost

Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed

gpt-oss-120b (high) | Cost per one million input and one million output tokens (USD) at 100 tokens/s output speed

The cost per million input and output tokens is calculated using the average price per GPU per hour and the system output throughput, assuming 1k input and 1k output tokens per request. The formula is:

average price per GPU per hour×number of GPUs×1,000,000system output throughput×3600\frac{\text{average price per GPU per hour} \times \text{number of GPUs} \times 1,000,000}{\text{system output throughput} \times 3600}

Note: Prices used are the lowest 1-year committed prices available on the market.

Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Concurrency

End-to-End Latency vs. Concurrency

gpt-oss-120b (high) | End-to-end latency (s) vs. concurrency
8xH100 - vllm
8xB200 - tensorrt-llm
8xH200 - vllm
8xMI300X - vllm

The time it takes for a request to be processed and the response to be returned.

The number of simultaneous requests that can be processed by the API at any given time. Higher concurrency enables better handling of parallel requests, which is essential for applications requiring high throughput and scalability.

  • Max Throughput: Optimized for the highest sustained request volume.
  • Minimum Latency: Tuned to deliver the fastest response times.
  • Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Pricing

Price per GPU Hour (On-Demand)

Leading cloud hyperscaler endpoints · Price in USD
Runpod
Crusoe Cloud
Digitalocean
Nebius
Amazon Web Services
Google
Microsoft Azure
Coreweave
Lambda

Instance configurations may differ slightly by provider, even when using the same GPU model. Factors like memory size, interconnect bandwidth, and system architecture can impact pricing.

Prices are based on the US Central region, or the closest available equivalent if not directly listed (e.g., US East - Ohio - for AWS).

Pricing reflects on-demand and 1-year committed hourly rates as listed by each cloud provider.

Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.

Frequently Asked Questions

AA-AgentPerf is Artificial Analysis' hardware benchmark for AI agent workloads. It measures how AI accelerators perform on real coding agent trajectories — with up to 200 turns and sequence lengths exceeding 100K tokens — rather than synthetic single-turn prompts. AA-AgentPerf reports each system's maximum concurrent agent capacity at a target output speed and time-to-first-token, normalized per MW, per accelerator, and per system.

There is no single best accelerator for AI agent inference — the right choice depends on whether you optimize for total concurrent agent capacity, output speed per agent, power efficiency, or cost. AA-AgentPerf compares NVIDIA GPUs, AMD GPUs, Google TPUs, and custom AI accelerators on the same agentic workloads, with results broken out per MW, per accelerator, and per system.

Agentic workloads have long sequence lengths (often >100K tokens), many sequential turns (up to 200+ in coding agents), and benefit substantially from production inference optimizations such as KV cache reuse, disaggregated prefill and decode, and speculative decoding. Standard LLM benchmarks built on short prompts and single-turn responses do not capture how accelerators handle these patterns, which is why we built AA-AgentPerf.

AA-AgentPerf runs a binary search per system to find the maximum number of concurrent AI agents each hardware configuration can sustain while meeting both an output speed target (tokens per second per agent) and a time-to-first-token (TTFT) target. The resulting concurrent agent count can be normalized per MW, per accelerator, or per system, so the same agentic workload is comparable across different AI accelerator architectures.

AA-AgentPerf supports gpt-oss-120b and DeepSeek V4 Pro at launch, running on real agentic coding trajectories. We will add additional models on a rolling basis.

AA-AgentPerf permits the inference optimizations that AI labs and inference providers deploy in production — KV cache reuse, disaggregated prefill/decode, and speculative decoding. The goal is to reflect how hardware performs in realistic agent deployments, not in an artificially constrained setup, so the reported agentic throughput and latency match what buyers see when they deploy.

AA-SLT is our original hardware benchmark for language model inference, measuring throughput, output speed, and cost across a wide range of AI accelerator systems on short and medium length prompts. AA-AgentPerf focuses on agentic AI workloads — long-context, multi-turn coding agents — and reports concurrent agent capacity instead of single-query throughput. Both benchmarks live on the same hardware page so AI accelerators can be compared on either workload.

For the current Artificial Analysis System Load Test (AA-SLT), NVIDIA's B200 is the most performant accelerator for LLM inference. It leads on peak throughput and output speed per query, though the right choice can still vary by model, deployment goal and budget.

NVIDIA's B200 currently powers the highest-throughput result in the current Artificial Analysis System Load Test (AA-SLT). The top benchmark is B200 (SXM) x8 serving gpt-oss-120b (high), reaching 92,909 output tokens per second at peak throughput.

NVIDIA's B200 currently powers the fastest single-query result in the current Artificial Analysis System Load Test (AA-SLT). The top benchmark is B200 (SXM) x8 serving gpt-oss-120b (high), reaching 403 output tokens per second per query.

B200 (SXM) x8 for gpt-oss-120b (high) currently has the best cost efficiency in the current Artificial Analysis System Load Test (AA-SLT) at $0.19. Artificial Analysis compares systems using cost per one million input and one million output tokens at a model-specific reference speed, so the most cost-efficient hardware depends on both the model and the target output speed.

In the current Artificial Analysis System Load Test (AA-SLT), DeepSeek R1 0528 (May '25) works best on B200 (SXM) x8 with NVIDIA's B200, reaching 45,677 output tokens per second at peak throughput, Llama 4 Maverick works best on B200 (SXM) x8 with NVIDIA's B200, reaching 48,198 output tokens per second at peak throughput, and gpt-oss-120b (high) works best on B200 (SXM) x8 with NVIDIA's B200, reaching 92,909 output tokens per second at peak throughput.