Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

AI Hardware Benchmarking & Performance Analysis

Comprehensive benchmarking of AI accelerator systems for language model inference. We measure how performance scales with concurrent load on NVIDIA 8×H100, 8×H200, and 8×B200 systems using Deepseek R1, Llama 4 Maverick, Llama 3.3 70B, and GPT-OSS 120B.

For details regarding the methodology, see our methodology section. Benchmarks are conducted periodically, at least once per quarter, and benchmark specifications are shared in the System & Benchmark Specifications section below.

For model benchmarks, see our LLM model comparison.

Coming soon:Model Deployment Report

Coming soon

Model Deployment Report

Peak System Output Throughput, Llama 3.3 70B

Total System Output Tokens per Second; Higher is better

Peak Output Speed per Query, Llama 3.3 70B

Output Tokens per Second per Query; Higher is better

Rental Price (ON-DEMAND)

Minimum Rental Price per GPU per Hour, USD; Lower is better

Navigation

Price per GPU Hour System & Benchmark Specifications

Price per GPU Hour

Price per GPU Hour (On-Demand)

Leading cloud hyperscaler endpoints; Price in USD

Runpod

Crusoe Cloud

Digitalocean

Amazon Web Services

Google Cloud

Nebius

Microsoft Azure

Lambda

Coreweave

Instance configurations may differ slightly by provider, even when using the same GPU model. Factors like memory size, interconnect bandwidth, and system architecture can impact pricing.

Prices are based on the US Central region, or the closest available equivalent if not directly listed (e.g., US East - Ohio - for AWS).

Pricing reflects on-demand and 1-year committed hourly rates as listed by each cloud provider.

Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.

System Output Throughput at 100 tokens/s Per Query Output Speed

gpt-oss-120B (high) | System Output Throughput (Tokens per Second) at 100 tokens/s Output Speed

The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.

Max Throughput: Optimized for the highest sustained request volume.
Minimum Latency: Tuned to deliver the fastest response times.
Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Peak Output Speed per Query

gpt-oss-120B (high) | Peak Output Speed per Query (Tokens per Second)

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

Max Throughput: Optimized for the highest sustained request volume.
Minimum Latency: Tuned to deliver the fastest response times.
Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

System Output Throughput vs Output Speed per Query

gpt-oss-120B (high) | System Output Throughput (Tokens per Second) vs Output Speed per Query (Tokens per Second)

8xH100 - vLLM

8xH200 - vLLM

8xB200 - TensorRT-LLM - Optimal

8xMI300X - vLLM

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

Max Throughput: Optimized for the highest sustained request volume.
Minimum Latency: Tuned to deliver the fastest response times.
Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

System Output Throughput & Output Speed per Query vs. Concurrency

gpt-oss-120B (high) | System Output Throughput (Tokens per Second) & Output Speed per Query (Tokens per Second)

8xH100 - vLLM

8xH200 - vLLM

8xB200 - TensorRT-LLM - Optimal

8xMI300X - vLLM

Throughput

Speed

Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.

The number of simultaneous requests that can be processed by the API at any given time. Higher concurrency enables better handling of parallel requests, which is essential for applications requiring high throughput and scalability.

Max Throughput: Optimized for the highest sustained request volume.
Minimum Latency: Tuned to deliver the fastest response times.
Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed

gpt-oss-120B (high) | Cost per One Million Input and One Million Output Tokens (USD) at 100 tokens/s Output Speed

The cost per million input and output tokens is calculated using the average price per GPU per hour and the system output throughput, assuming 1k input and 1k output tokens per request. The formula is:

\frac{\text{average price per GPU per hour} \times \text{number of GPUs} \times 1,000,000}{\text{system output throughput} \times 3600}

Note: Prices used are the lowest 1-year committed prices available on the market.

Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.

Max Throughput: Optimized for the highest sustained request volume.
Minimum Latency: Tuned to deliver the fastest response times.
Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

End-to-End Latency vs. Concurrency

gpt-oss-120B (high) | End-to-End Latency (s) vs. Concurrency

8xH100 - vLLM

8xH200 - vLLM

8xB200 - TensorRT-LLM - Optimal

8xMI300X - vLLM

The time it takes for a request to be processed and the response to be returned.

Max Throughput: Optimized for the highest sustained request volume.
Minimum Latency: Tuned to deliver the fastest response times.
Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.

System & Benchmark Specifications

Model Name	System	Provider	Precision	Tensor Parallel	Expert Parallel	Data Parallel	Inference Framework	Kernel Version	Configuration	Configuration Reference	Date
Llama 4 Maverick	8xB200 (SXM)	google	fp8	8	1	1	vLLM 0.9.0.1	CUDA 12.8	Default		4 June 2025
Llama 4 Maverick	8xH100 (SXM)	google	fp8	8	1	1	vLLM 0.9.2	CUDA 12.8	Default		4 August 2025
Llama 4 Maverick	8xH200 (SXM)	google	fp8	8	1	1	vLLM 0.9.2	CUDA 12.8	Default		4 August 2025
Llama 4 Maverick	8xB200 (SXM)	google	fp8	8	8	1	TensorRT-LLM 1.2.0rc3	CUDA 13.0	Max throughput	Link	19 November 2025
Llama 4 Maverick	8xMI300X	runpod	fp8	8	1	1	vLLM 0.11.1rc2	ROCm 7.0	Default	Link	19 November 2025
Llama 4 Maverick	8xB200 (SXM)	google	fp8	8	2	1	TensorRT-LLM 1.2.0rc1	CUDA 13.0	Balanced	Link	19 November 2025
Llama 4 Maverick	8xB200 (SXM)	google	fp8	8	1	1	TensorRT-LLM 1.2.0rc3	CUDA 13.0	Min latency	Link	19 November 2025
Llama 3.3 Instruct 70B	8xB200 (SXM)	google	bf16	8		1	vLLM 0.9.2	CUDA 12.8	Default		4 August 2025
Llama 3.3 Instruct 70B	8xH100 (SXM)	google	bf16	8		1	vLLM 0.9.2	CUDA 12.8	Default		4 August 2025
Llama 3.3 Instruct 70B	8xH200 (SXM)	google	bf16	8		1	vLLM 0.9.2	CUDA 12.8	Default		4 August 2025
Llama 3.3 Instruct 70B	8xMI300X	runpod	bf16	8		1	vLLM 0.11.1rc2	ROCm 7.0	Default	Link	19 November 2025
Llama 3.3 Instruct 70B	8xB200 (SXM)	google	bf16	8		1	TensorRT-LLM 1.2.0rc1	CUDA 12.9	Default	Link	21 November 2025
Llama 3.3 Instruct 70B	TPU v6e-8	google	bf16	8		1	vLLM vllm-tpu:nightly-20251129-28607fc-39e63de	XLA [v2-alpha-tpuv6e]	Default	Link	9 December 2025
gpt-oss-120B (high)	8xB200 (SXM)	nebius	mxfp4	8	8	1	TensorRT-LLM 1.1.0rc1	CUDA 12.9	Max throughput	Link	29 September 2025
gpt-oss-120B (high)	8xH100 (SXM)	google	mxfp4	8	1	1	vLLM 0.10.0	CUDA 12.8	Default		8 August 2025
gpt-oss-120B (high)	8xH200 (SXM)	google	mxfp4	8	1	1	vLLM 0.10.0	CUDA 12.8	Default		8 August 2025
gpt-oss-120B (high)	8xB200 (SXM)	nebius	mxfp4	8	1	1	TensorRT-LLM 1.1.0rc1	CUDA 12.9	Min latency	Link	29 September 2025
gpt-oss-120B (high)	8xB200 (SXM)	nebius	mxfp4	8		1	TensorRT-LLM 1.1.0rc1	CUDA 12.9	Optimal		15 October 2025
gpt-oss-120B (high)	8xMI300X	runpod	mxfp4	8	8	1	vLLM 0.10.1	ROCm 7.0	Default	Link	19 November 2025
DeepSeek R1 0528 (May '25)	8xH200 (SXM)	google	fp8	8	1	1	vLLM 0.9.0.1	CUDA 12.8	Default		5 June 2025
DeepSeek R1 0528 (May '25)	8xB200 (SXM)	google	fp8	8	1	1	vLLM 0.9.0.1	CUDA 12.8	Default		4 June 2025
DeepSeek R1 0528 (May '25)	8xB200 (SXM)	google	fp4	8	8	1	TensorRT-LLM 1.2.0rc2	CUDA 13.0	Optimal		19 November 2025
DeepSeek R1 0528 (May '25)	8xH200 (SXM)	google	fp8	8	8	1	TensorRT-LLM 1.0.0rc2	CUDA 12.9	Max throughput	Link	31 July 2025
DeepSeek R1 0528 (May '25)	8xB200 (SXM)	google	fp8	8	8	1	TensorRT-LLM 1.1.0rc2.post1	CUDA 12.9	Max throughput	Link	8 September 2025
DeepSeek R1 0528 (May '25)	8xMI300X	runpod	fp8	8	1	1	SGLang 0.5.2rc2	ROCm 7.0	Default	Link	19 November 2025
DeepSeek R1 0528 (May '25)	8xB200 (SXM)	google	fp8	8	8	1	TensorRT-LLM 1.2.0rc2	CUDA 13.0	Min latency	Link	19 November 2025
DeepSeek R1 0528 (May '25)	8xB200 (SXM)	google	fp4	8	8	1	TensorRT-LLM 1.2.0rc3	CUDA 13.0	Max throughput	Link	19 November 2025
DeepSeek R1 0528 (May '25)	SN40L-16	sambanova	fp8	1	1	1	na	-	Optimal		12 January 2026
DeepSeek R1 0528 (May '25)	8xB200 (SXM)	google	fp8	8		1	TensorRT-LLM 1.2.0rc2	CUDA 13.0	Optimal		19 November 2025
DeepSeek R1 0528 (May '25)	8xB200 (SXM)	google	fp4	8	8	1	TensorRT-LLM 1.2.0rc2	CUDA 13.0	Min latency	Link	19 November 2025

AI Hardware Benchmarking & Performance Analysis

Model Deployment Report

Navigation

Price per GPU Hour

Price per GPU Hour (On-Demand)

GPU Variations

Regional Pricing Basis

Provider Pricing Basis

Pricing Update Schedule

System Output Throughput at 100 tokens/s Per Query Output Speed

System Output Throughput

Hardware Configuration

Peak Output Speed per Query

Output Speed per Query

Hardware Configuration

System Output Throughput vs Output Speed per Query

System Output Throughput

Output Speed per Query

Hardware Configuration

System Output Throughput & Output Speed per Query vs. Concurrency

Output Speed per Query

System Output Throughput

Concurrency

Hardware Configuration

Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed

Cost per million Input and Output Tokens

Pricing Update Schedule

Hardware Configuration

End-to-End Latency vs. Concurrency

End-to-End Latency

Concurrency

Hardware Configuration

System & Benchmark Specifications