AI Hardware Benchmarking & Performance Analysis
Comprehensive benchmarking of AI accelerator systems for language model inference. We measure how performance scales with concurrent load on NVIDIA 8×H100, 8×H200, and 8×B200 systems using Deepseek R1, Llama 4 Maverick, Llama 3.3 70B, and GPT-OSS 120B.
For details regarding the methodology, see our methodology section. Benchmarks are conducted periodically, at least once per quarter, and benchmark specifications are shared in the System & Benchmark Specifications section below.
For model benchmarks, see our LLM model comparison.

Coming soon
Model Deployment Report
Highlights
Price per GPU Hour
Price per GPU Hour
Instance configurations may differ slightly by provider, even when using the same GPU model. Factors like memory size, interconnect bandwidth, and system architecture can impact pricing.
Prices are based on the US Central region, or the closest available equivalent if not directly listed (e.g., US East - Ohio - for AWS).
Pricing reflects on-demand hourly rates as listed by each cloud provider.
Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.
Peak Output Speed per Query
Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.
- Max Throughput: Optimized for the highest sustained request volume.
- Minimum Latency: Tuned to deliver the fastest response times.
- Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.
System Output Throughput vs Output Speed per Query
The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.
Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.
- Max Throughput: Optimized for the highest sustained request volume.
- Minimum Latency: Tuned to deliver the fastest response times.
- Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.
System Output Throughput & Output Speed per Query vs. Concurrency
Tokens per second received by each individual query after the first chunk has been received. Represented as the median result of all queries in each concurrency phase.
The total number of tokens that can be processed per second across all concurrent requests. This metric measures the overall system capacity and efficiency, taking into account both the per-query output speed and the system's ability to handle multiple concurrent requests.
The number of simultaneous requests that can be processed by the API at any given time. Higher concurrency enables better handling of parallel requests, which is essential for applications requiring high throughput and scalability.
- Max Throughput: Optimized for the highest sustained request volume.
- Minimum Latency: Tuned to deliver the fastest response times.
- Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.
Cost per Million Input and Output Tokens at 100 tokens/s Per Query Output Speed
The cost per million input and output tokens is calculated using the average price per GPU per hour and the system output throughput, assuming 1k input and 1k output tokens per request. The formula is:
Prices shown are updated at the start of each month. While prices may fluctuate daily, we maintain a monthly update schedule to provide consistent comparisons.
- Max Throughput: Optimized for the highest sustained request volume.
- Minimum Latency: Tuned to deliver the fastest response times.
- Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.
End-to-End Latency vs. Concurrency
The time it takes for a request to be processed and the response to be returned.
The number of simultaneous requests that can be processed by the API at any given time. Higher concurrency enables better handling of parallel requests, which is essential for applications requiring high throughput and scalability.
- Max Throughput: Optimized for the highest sustained request volume.
- Minimum Latency: Tuned to deliver the fastest response times.
- Optimal: Combines max throughput and minimum latency configurations by selecting the configuration with higher system throughput at a given concurrency level.
System & Benchmark Specifications
| Model Name | System | Provider | Precision | Tensor Parallel | Expert Parallel | Data Parallel | Inference Framework | Kernel Version | Configuration | Configuration Reference | Date |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama 4 Maverick | 8xB200 (SXM) | fp8 | 8 | 1 | 1 | vLLM 0.9.0.1 | CUDA 12.8 | Default | 4 June 2025 | ||
| Llama 4 Maverick | 8xH100 (SXM) | fp8 | 8 | 1 | 1 | vLLM 0.9.2 | CUDA 12.8 | Default | 4 August 2025 | ||
| Llama 4 Maverick | 8xH200 (SXM) | fp8 | 8 | 1 | 1 | vLLM 0.9.2 | CUDA 12.8 | Default | 4 August 2025 | ||
| Llama 4 Maverick | 8xB200 (SXM) | fp8 | 8 | 8 | 1 | TensorRT-LLM 1.2.0rc3 | CUDA 13.0 | Max throughput | Link | 19 November 2025 | |
| Llama 4 Maverick | 8xMI300X | runpod | fp8 | 8 | 1 | 1 | vLLM 0.11.1rc2 | ROCm 7.0 | Default | Link | 19 November 2025 |
| Llama 4 Maverick | 8xB200 (SXM) | fp8 | 8 | 2 | 1 | TensorRT-LLM 1.2.0rc1 | CUDA 13.0 | Balanced | Link | 19 November 2025 | |
| Llama 4 Maverick | 8xB200 (SXM) | fp8 | 8 | 1 | 1 | TensorRT-LLM 1.2.0rc3 | CUDA 13.0 | Min latency | Link | 19 November 2025 | |
| Llama 3.3 Instruct 70B | 8xB200 (SXM) | bf16 | 8 | 1 | vLLM 0.9.2 | CUDA 12.8 | Default | 4 August 2025 | |||
| Llama 3.3 Instruct 70B | 8xH100 (SXM) | bf16 | 8 | 1 | vLLM 0.9.2 | CUDA 12.8 | Default | 4 August 2025 | |||
| Llama 3.3 Instruct 70B | 8xH200 (SXM) | bf16 | 8 | 1 | vLLM 0.9.2 | CUDA 12.8 | Default | 4 August 2025 | |||
| Llama 3.3 Instruct 70B | 8xMI300X | runpod | bf16 | 8 | 1 | vLLM 0.11.1rc2 | ROCm 7.0 | Default | Link | 19 November 2025 | |
| Llama 3.3 Instruct 70B | 8xB200 (SXM) | bf16 | 8 | 1 | TensorRT-LLM 1.2.0rc1 | CUDA 12.9 | Default | Link | 21 November 2025 | ||
| Llama 3.3 Instruct 70B | TPU v6e-8 | bf16 | 8 | 1 | vLLM vllm-tpu:nightly-20251129-28607fc-39e63de | XLA [v2-alpha-tpuv6e] | Default | Link | 9 December 2025 | ||
| gpt-oss-120B (high) | 8xB200 (SXM) | nebius | mxfp4 | 8 | 8 | 1 | TensorRT-LLM 1.1.0rc1 | CUDA 12.9 | Max throughput | Link | 29 September 2025 |
| gpt-oss-120B (high) | 8xH100 (SXM) | mxfp4 | 8 | 1 | 1 | vLLM 0.10.0 | CUDA 12.8 | Default | 8 August 2025 | ||
| gpt-oss-120B (high) | 8xH200 (SXM) | mxfp4 | 8 | 1 | 1 | vLLM 0.10.0 | CUDA 12.8 | Default | 8 August 2025 | ||
| gpt-oss-120B (high) | 8xB200 (SXM) | nebius | mxfp4 | 8 | 1 | 1 | TensorRT-LLM 1.1.0rc1 | CUDA 12.9 | Min latency | Link | 29 September 2025 |
| gpt-oss-120B (high) | 8xB200 (SXM) | nebius | mxfp4 | 8 | 1 | TensorRT-LLM 1.1.0rc1 | CUDA 12.9 | Optimal | 15 October 2025 | ||
| gpt-oss-120B (high) | 8xMI300X | runpod | mxfp4 | 8 | 8 | 1 | vLLM 0.10.1 | ROCm 7.0 | Default | Link | 19 November 2025 |
| DeepSeek R1 0528 (May '25) | 8xH200 (SXM) | fp8 | 8 | 1 | 1 | vLLM 0.9.0.1 | CUDA 12.8 | Default | 5 June 2025 | ||
| DeepSeek R1 0528 (May '25) | 8xB200 (SXM) | fp8 | 8 | 1 | 1 | vLLM 0.9.0.1 | CUDA 12.8 | Default | 4 June 2025 | ||
| DeepSeek R1 0528 (May '25) | 8xB200 (SXM) | fp4 | 8 | 8 | 1 | TensorRT-LLM 1.2.0rc2 | CUDA 13.0 | Optimal | 19 November 2025 | ||
| DeepSeek R1 0528 (May '25) | 8xH200 (SXM) | fp8 | 8 | 8 | 1 | TensorRT-LLM 1.0.0rc2 | CUDA 12.9 | Max throughput | Link | 31 July 2025 | |
| DeepSeek R1 0528 (May '25) | 8xB200 (SXM) | fp8 | 8 | 8 | 1 | TensorRT-LLM 1.1.0rc2.post1 | CUDA 12.9 | Max throughput | Link | 8 September 2025 | |
| DeepSeek R1 0528 (May '25) | 8xMI300X | runpod | fp8 | 8 | 1 | 1 | SGLang 0.5.2rc2 | ROCm 7.0 | Default | Link | 19 November 2025 |
| DeepSeek R1 0528 (May '25) | 8xB200 (SXM) | fp8 | 8 | 8 | 1 | TensorRT-LLM 1.2.0rc2 | CUDA 13.0 | Min latency | Link | 19 November 2025 | |
| DeepSeek R1 0528 (May '25) | 8xB200 (SXM) | fp4 | 8 | 8 | 1 | TensorRT-LLM 1.2.0rc3 | CUDA 13.0 | Max throughput | Link | 19 November 2025 | |
| DeepSeek R1 0528 (May '25) | SN40L-16 | sambanova | fp8 | 1 | 1 | 1 | na | - | Optimal | 12 January 2026 | |
| DeepSeek R1 0528 (May '25) | 8xB200 (SXM) | fp8 | 8 | 1 | TensorRT-LLM 1.2.0rc2 | CUDA 13.0 | Optimal | 19 November 2025 | |||
| DeepSeek R1 0528 (May '25) | 8xB200 (SXM) | fp4 | 8 | 8 | 1 | TensorRT-LLM 1.2.0rc2 | CUDA 13.0 | Min latency | Link | 19 November 2025 |