Name: Peak System Output Throughput, Llama 3.3 70B
Creator: Artificial Analysis
License: https://artificialanalysis.ai/docs/legal/Terms-of-Use.pdf

Question 1

What is AA-AgentPerf?

Accepted Answer

AA-AgentPerf is Artificial Analysis' hardware benchmark for AI agent workloads. It measures how AI accelerators perform on real coding agent trajectories — with up to 200 turns and sequence lengths exceeding 100K tokens — rather than synthetic single-turn prompts. AA-AgentPerf reports each system's maximum concurrent agent capacity at a target output speed and time-to-first-token, normalized per MW, per accelerator, and per system.

Question 2

What is the best hardware for AI agent inference?

Accepted Answer

There is no single best accelerator for AI agent inference — the right choice depends on whether you optimize for total concurrent agent capacity, output speed per agent, power efficiency, or cost. AA-AgentPerf compares NVIDIA GPUs, AMD GPUs, Google TPUs, and custom AI accelerators on the same agentic workloads, with results broken out per MW, per accelerator, and per system.

Question 3

How do agentic AI workloads differ from standard LLM inference benchmarks?

Accepted Answer

Agentic workloads have long sequence lengths (often >100K tokens), many sequential turns (up to 200+ in coding agents), and benefit substantially from production inference optimizations such as KV cache reuse, disaggregated prefill and decode, and speculative decoding. Standard LLM benchmarks built on short prompts and single-turn responses do not capture how accelerators handle these patterns, which is why we built AA-AgentPerf.

Question 4

How does AA-AgentPerf measure concurrent agent capacity on AI hardware?

Accepted Answer

AA-AgentPerf runs a binary search per system to find the maximum number of concurrent AI agents each hardware configuration can sustain while meeting both an output speed target (tokens per second per agent) and a time-to-first-token (TTFT) target. The resulting concurrent agent count can be normalized per MW, per accelerator, or per system, so the same agentic workload is comparable across different AI accelerator architectures.

Question 5

Which AI models does AA-AgentPerf use to benchmark hardware?

Accepted Answer

AA-AgentPerf supports gpt-oss-120b and DeepSeek V4 Pro at launch, running on real agentic coding trajectories. We will add additional models on a rolling basis.

Question 6

Why does AA-AgentPerf allow production inference optimizations?

Accepted Answer

AA-AgentPerf permits the inference optimizations that AI labs and inference providers deploy in production — KV cache reuse, disaggregated prefill/decode, and speculative decoding. The goal is to reflect how hardware performs in realistic agent deployments, not in an artificially constrained setup, so the reported agentic throughput and latency match what buyers see when they deploy.

Question 7

How does AA-AgentPerf compare to the AA System Load Test (AA-SLT)?

Accepted Answer

AA-SLT is our original hardware benchmark for language model inference, measuring throughput, output speed, and cost across a wide range of AI accelerator systems on short and medium length prompts. AA-AgentPerf focuses on agentic AI workloads — long-context, multi-turn coding agents — and reports concurrent agent capacity instead of single-query throughput. Both benchmarks live on the same hardware page so AI accelerators can be compared on either workload.

Question 8

What is the best accelerator for LLM inference?

Accepted Answer

For the current Artificial Analysis System Load Test (AA-SLT), NVIDIA's B200 is the most performant accelerator for LLM inference. It leads on peak throughput and output speed per query, though the right choice can still vary by model, deployment goal and budget.

Question 9

Which accelerator has the highest LLM inference throughput?

Accepted Answer

NVIDIA's B200 currently powers the highest-throughput result in the current Artificial Analysis System Load Test (AA-SLT). The top benchmark is B200 (SXM) x8 serving gpt-oss-120b (high), reaching 92,909 output tokens per second at peak throughput.

Question 10

Which accelerator is fastest for AI inference?

Accepted Answer

NVIDIA's B200 currently powers the fastest single-query result in the current Artificial Analysis System Load Test (AA-SLT). The top benchmark is B200 (SXM) x8 serving gpt-oss-120b (high), reaching 403 output tokens per second per query.

Question 11

Which hardware has the best cost efficiency for LLM inference?

Accepted Answer

B200 (SXM) x8 for gpt-oss-120b (high) currently has the best cost efficiency in the current Artificial Analysis System Load Test (AA-SLT) at $0.19. Artificial Analysis compares systems using cost per one million input and one million output tokens at a model-specific reference speed, so the most cost-efficient hardware depends on both the model and the target output speed.

Question 12

What is the best hardware for DeepSeek R1, Llama 4 Maverick, and GPT-OSS 120B inference?

Accepted Answer

In the current Artificial Analysis System Load Test (AA-SLT), DeepSeek R1 0528 (May '25) works best on B200 (SXM) x8 with NVIDIA's B200, reaching 45,677 output tokens per second at peak throughput, Llama 4 Maverick works best on B200 (SXM) x8 with NVIDIA's B200, reaching 48,198 output tokens per second at peak throughput, and gpt-oss-120b (high) works best on B200 (SXM) x8 with NVIDIA's B200, reaching 92,909 output tokens per second at peak throughput.

AI Hardware Benchmarking & Performance Analysis

AA-AgentPerf: The Hardware Benchmarkfor the Agent Era

Real agent workloads, not synthetic queries

Production optimizations allowed

Measures what developers need to know

Built for every kind of scale

Max Concurrent Agents per MW of Accelerator Power

Inside a single agent trajectory

How AA-AgentPerf measures maximum agent count

Capacity at every service level

System Load Test (AA-SLT)

Peak System Output Throughput, Llama 3.3 70B

Peak Output Speed per Query, Llama 3.3 70B

Rental Price (ON-DEMAND)

Throughput

System Output Throughput at 100 tokens/s Per Query Output Speed

Output Speed

Peak Output Speed per Query

Throughput vs Speed

System Output Throughput vs. Output Speed per Query

System Output Throughput & Output Speed per Query vs. Concurrency

Cost

Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed

Concurrency

End-to-End Latency vs. Concurrency

Pricing

Price per GPU Hour (On-Demand)

System & Benchmark Specifications

Frequently Asked Questions

AI Hardware Benchmarking & Performance Analysis

Real agent workloads, not synthetic queries

Production optimizations allowed

Measures what developers need to know

Built for every kind of scale

Max Concurrent Agents per MW of Accelerator Power

Power Normalization (Per MW)

Inside a single agent trajectory

How AA-AgentPerf measures maximum agent count

Capacity at every service level

System Load Test (AA-SLT)

Peak System Output Throughput, Llama 3.3 70B

Peak Output Speed per Query, Llama 3.3 70B

Rental Price (ON-DEMAND)

Throughput

System Output Throughput at 100 tokens/s Per Query Output Speed

System Output Throughput

Hardware Configuration

Output Speed

Peak Output Speed per Query

Output Speed per Query

Hardware Configuration

Throughput vs Speed

System Output Throughput vs. Output Speed per Query

System Output Throughput

Output Speed per Query

Hardware Configuration

System Output Throughput & Output Speed per Query vs. Concurrency

Output Speed per Query

System Output Throughput

Concurrency

Hardware Configuration

Cost

Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed

Cost per Million Input and Output Tokens

Pricing Update Schedule

Hardware Configuration

Concurrency

End-to-End Latency vs. Concurrency

End-to-End Latency

Concurrency

Hardware Configuration

Pricing

Price per GPU Hour (On-Demand)

GPU Variations

Regional Pricing Basis

Provider Pricing Basis

Pricing Update Schedule

System & Benchmark Specifications

Frequently Asked Questions

What is AA-AgentPerf?

What is the best hardware for AI agent inference?

How do agentic AI workloads differ from standard LLM inference benchmarks?

How does AA-AgentPerf measure concurrent agent capacity on AI hardware?

Which AI models does AA-AgentPerf use to benchmark hardware?

Why does AA-AgentPerf allow production inference optimizations?

How does AA-AgentPerf compare to the AA System Load Test (AA-SLT)?

What is the best accelerator for LLM inference?

Which accelerator has the highest LLM inference throughput?

Which accelerator is fastest for AI inference?

Which hardware has the best cost efficiency for LLM inference?

What is the best hardware for DeepSeek R1, Llama 4 Maverick, and GPT-OSS 120B inference?