Inference Serving Configurations
Every serving configuration used in AA-AgentPerf hardware benchmarking, from June 2026 onwards. Each entry captures the model, accelerator system, precision, inference framework, parallelism topology, and full launch command.
AA-AgentPerf
Serving configurations from AA-AgentPerf benchmark runs, with the full launch command for each. Read the methodology
Showing 6 of 6 configurations
DeepSeek V4 Pro (Max) — 6 configurations
SystemPrecisionFrameworkParallelismAdded
# TensorRT-LLM 1.3.0rc15.post1 disaggregated serving on a single 8x B300 node:
# 1x context worker (TP4 + EP4, attention-DP, GPUs 4-7) + 1x generation worker (TP4 + EP4,
# attention-DP, GPUs 0-3), colocated, + disagg orchestrator with conversation-affinity routing.
# SLURM allocation: --exclusive --cpus-per-task=28 (224 CPUs).
# Worker environment:
# TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1
# TRTLLM_ENABLE_PDL=1 NCCL_GRAPH_MIXING_SUPPORT=0 TLLM_ADP_ROUTER_MATCH_RATE_THRESHOLD=0.10
# TLLM_ADP_ROUTER_DECISION_LOG=1 TLLM_ADP_ROUTER_INCLUDE_TRANSFER_LOAD=1
# TLLM_ADP_ROUTER_RANDOMIZE_TIEBREAK=1 TRTLLM_DISABLE_KV_CACHE_RATIO_UPDATE=1
# TRTLLM_MHC_ENABLE_FUSED_HC=0 UCX_TLS=rc,sm,self,cuda_copy,cuda_ipc
# HF_HUB_OFFLINE=1 TRTLLM_KV_CACHE_TRANSFER_TIMEOUT_SEC=480
# Generation worker (CUDA_VISIBLE_DEVICES=0,1,2,3):
trtllm-serve <model-path> --config gen_config.yaml \
--custom_tokenizer deepseek_v4 --tool_parser deepseek_v4
# Context worker (CUDA_VISIBLE_DEVICES=4,5,6,7):
trtllm-serve <model-path> --config ctx_config.yaml \
--custom_tokenizer deepseek_v4 --tool_parser deepseek_v4
# Orchestrator:
trtllm-serve disaggregated -c server_config.yaml -t 7200 -r 7200
# ---------------- gen_config.yaml ----------------
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
pipeline_parallel_size: 1
enable_attention_dp: true
enable_lm_head_tp_in_adp: true
max_batch_size: 64
max_num_tokens: 1024
max_seq_len: 131072
enable_autotuner: false
print_iter_log: true
stream_interval: 100
num_postprocess_workers: 4
moe_config:
backend: TRTLLM
use_low_precision_moe_combine: true
kv_cache_config:
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
dtype: fp8
host_cache_size: 0
iteration_stats_interval: 1
use_kv_cache_manager_v2: false
cache_transceiver_config:
max_tokens_in_buffer: 131072
backend: NIXL
transceiver_runtime: PYTHON
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
return_perf_metrics: false
perf_metrics_max_requests: 100000
enable_iter_perf_stats: false
enable_iter_req_stats: false
# ---------------- ctx_config.yaml ----------------
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
pipeline_parallel_size: 1
enable_attention_dp: true
max_batch_size: 128
max_num_tokens: 8192
max_seq_len: 131072
enable_chunked_prefill: true
disable_overlap_scheduler: true
enable_autotuner: true
print_iter_log: true
cuda_graph_config: null
moe_config:
backend: TRTLLM
kv_cache_config:
enable_block_reuse: true
free_gpu_memory_fraction: 0.8
dtype: fp8
host_cache_size: 214748364800
event_buffer_max_size: 16384
iteration_stats_interval: 1
use_kv_cache_manager_v2: false
cache_transceiver_config:
max_tokens_in_buffer: 131072
backend: NIXL
transceiver_runtime: PYTHON
scheduler_config:
capacity_scheduler_policy: MAX_UTILIZATION
attention_dp_config:
enable_kv_cache_aware_routing: true
kv_cache_routing_load_balance_weight: 0.5
kv_cache_routing_cold_start_warmup: true
kv_cache_routing_match_rate_threshold: 0.1
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
return_perf_metrics: false
perf_metrics_max_requests: 100000
enable_iter_perf_stats: false
enable_iter_req_stats: false
# ---------------- server_config.yaml (topology; hosts/ports omitted) ----------------
backend: pytorch
context_servers:
num_instances: 1
router:
type: conversation
generation_servers:
num_instances: 1
router:
type: conversationSystemB300 x8EngineTRT-LLM (1.3.0rc15.post1)ParallelismTP=4 DP=2 EP=4PrecisionNVFP4GPUs used8
Submitted by NVIDIA
# TensorRT-LLM 1.3.0rc15.post1 disaggregated serving on GB300 NVL72 — 40 GPUs (10 compute trays x 4):
# 6x context workers (TP4, attention-DP, 4 GPUs each) + 1x generation worker (16 GPUs over 4 trays,
# attention-DP16 + MoE EP16) + disagg orchestrator with conversation-affinity routing.
# Worker environment (every worker; CUDA_VISIBLE_DEVICES=0,1,2,3 per tray):
# TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1
# TRTLLM_ENABLE_PDL=1 NCCL_GRAPH_MIXING_SUPPORT=0 TLLM_ADP_ROUTER_INCLUDE_TRANSFER_LOAD=1
# TLLM_ADP_ROUTER_RANDOMIZE_TIEBREAK=1 TRTLLM_DISABLE_KV_CACHE_RATIO_UPDATE=1
# TRTLLM_MHC_ENABLE_FUSED_HC=1 HF_HUB_OFFLINE=1 TRTLLM_KV_CACHE_TRANSFER_TIMEOUT_SEC=480
# Generation worker (16 GPUs, MPI across 4 trays):
trtllm-serve <model-path> --config gen_config.yaml \
--custom_tokenizer deepseek_v4 --tool_parser deepseek_v4
# Context workers (x6, one per tray):
trtllm-serve <model-path> --config ctx_config.yaml \
--custom_tokenizer deepseek_v4 --tool_parser deepseek_v4
# Orchestrator:
trtllm-serve disaggregated -c server_config.yaml -t 7200 -r 7200
# ---------------- gen_config.yaml ----------------
tensor_parallel_size: 16
moe_expert_parallel_size: 16
enable_attention_dp: true
enable_lm_head_tp_in_adp: true
pipeline_parallel_size: 1
max_batch_size: 32
max_num_tokens: 128
max_seq_len: 131072
print_iter_log: true
enable_autotuner: true
kv_cache_config:
enable_block_reuse: false
free_gpu_memory_fraction: 0.7
dtype: fp8
host_cache_size: 0
avg_seq_len: 39040
moe_config:
backend: MEGAMOE_DEEPGEMM
use_low_precision_moe_combine: true
# Static MoE expert-placement map (initial_global_assignments, 384 slots per
# EP16 rank; ~110 KB of slot->expert arrays). Published as a separate artifact
# - see the reference link on this config.
load_balancer: moe_load_balancer_gen_ep16_slots384_v2.yaml
cache_transceiver_config:
max_tokens_in_buffer: 131072
backend: NIXL
transceiver_runtime: PYTHON
stream_interval: 100
num_postprocess_workers: 8
cuda_graph_config:
enable_padding: true
max_batch_size: 32
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
allow_advanced_sampling: true
sparse_attention_config:
algorithm: deepseek_v4
enable_heuristic_topk: true
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false
# ---------------- ctx_config.yaml ----------------
max_batch_size: 128
max_num_tokens: 8192
max_seq_len: 131072
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
pipeline_parallel_size: 1
enable_chunked_prefill: true
print_iter_log: true
cuda_graph_config: null
enable_autotuner: true
disable_overlap_scheduler: true
moe_config:
backend: TRTLLM
kv_cache_config:
enable_block_reuse: true
free_gpu_memory_fraction: 0.6
dtype: fp8
host_cache_size: 128849018880
event_buffer_max_size: 16384
pool_ratio:
- 0.3
- 0.45
- 0.25
cache_transceiver_config:
max_tokens_in_buffer: 131072
backend: NIXL
transceiver_runtime: PYTHON
scheduler_config:
capacity_scheduler_policy: MAX_UTILIZATION
attention_dp_config:
kv_cache_routing_conversation_affinity: true
kv_cache_routing_max_sessions: 65536
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
allow_advanced_sampling: true
# ---------------- server_config.yaml ----------------
backend: pytorch
context_servers:
num_instances: 6
router:
type: conversation
generation_servers:
num_instances: 1
router:
type: conversationSystemGB300 NVL72EngineTRT-LLM (1.3.0rc15.post1)ParallelismTP=1 DP=16 EP=16PrecisionNVFP4GPUs used40 of 72
Submitted by NVIDIA
# TensorRT-LLM 1.3.0rc15.post1 disaggregated serving on GB300 NVL72 — 32 GPUs (8 compute trays x 4):
# 6x context workers (TP4, attention-DP, 4 GPUs each) + 1x generation worker (8 GPUs over 2 trays,
# attention-DP8 + MoE EP8) + disagg orchestrator with conversation-affinity routing.
# Worker environment (every worker; CUDA_VISIBLE_DEVICES=0,1,2,3 per tray):
# TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1
# TRTLLM_ENABLE_PDL=1 NCCL_GRAPH_MIXING_SUPPORT=0 TLLM_ADP_ROUTER_MATCH_RATE_THRESHOLD=0.10
# HF_HUB_OFFLINE=1 TRTLLM_KV_CACHE_TRANSFER_TIMEOUT_SEC=480
# Generation worker (8 GPUs, MPI across 2 trays):
trtllm-serve <model-path> --config gen_config.yaml \
--custom_tokenizer deepseek_v4 --tool_parser deepseek_v4
# Context workers (x6, one per tray):
trtllm-serve <model-path> --config ctx_config.yaml \
--custom_tokenizer deepseek_v4 --tool_parser deepseek_v4
# Orchestrator:
trtllm-serve disaggregated -c server_config.yaml -t 7200 -r 7200
# ---------------- gen_config.yaml ----------------
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
enable_lm_head_tp_in_adp: true
pipeline_parallel_size: 1
max_batch_size: 192
max_num_tokens: 768
max_seq_len: 131072
print_iter_log: true
enable_autotuner: false
kv_cache_config:
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
dtype: fp8
host_cache_size: 0
avg_seq_len: 39040
moe_config:
backend: MEGAMOE_DEEPGEMM
use_low_precision_moe_combine: true
cache_transceiver_config:
max_tokens_in_buffer: 131072
backend: NIXL
transceiver_runtime: PYTHON
stream_interval: 100
num_postprocess_workers: 8
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 192
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
allow_advanced_sampling: true
sparse_attention_config:
algorithm: deepseek_v4
enable_heuristic_topk: true
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false
# ---------------- ctx_config.yaml ----------------
max_batch_size: 128
max_num_tokens: 8192
max_seq_len: 131072
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
pipeline_parallel_size: 1
enable_chunked_prefill: true
print_iter_log: true
cuda_graph_config: null
enable_autotuner: true
disable_overlap_scheduler: true
moe_config:
backend: TRTLLM
kv_cache_config:
enable_block_reuse: true
free_gpu_memory_fraction: 0.6
dtype: fp8
host_cache_size: 128849018880
event_buffer_max_size: 16384
pool_ratio:
- 0.3
- 0.45
- 0.25
cache_transceiver_config:
max_tokens_in_buffer: 131072
backend: NIXL
transceiver_runtime: PYTHON
scheduler_config:
capacity_scheduler_policy: MAX_UTILIZATION
attention_dp_config:
enable_kv_cache_aware_routing: false
kv_cache_routing_conversation_affinity: true
kv_cache_routing_max_sessions: 65536
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
# ---------------- server_config.yaml ----------------
backend: pytorch
context_servers:
num_instances: 6
router:
type: conversation
generation_servers:
num_instances: 1
router:
type: conversationSystemGB300 NVL72EngineTRT-LLM (1.3.0rc15.post1)ParallelismTP=1 DP=8 EP=8PrecisionNVFP4GPUs used32 of 72
Submitted by NVIDIA
docker run --gpus all --shm-size 32g --network host \
-v path/to/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:deepseek-v4-hopper \
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V4-Pro \
--tp 8 --context-length 393216 --mem-fraction-static 0.88 \
--speculative-algorithm EAGLE --speculative-num-steps 1 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--moe-runner-backend marlin --host 0.0.0.0 --port 30000SystemH200 (SXM) x8EngineSGLang (0.5.11 DSv4)ParallelismTP=8 DP=1 EP=1PrecisionNVFP4GPUs used8
Submitted by Artificial Analysis
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--shm-size 32g --ipc=host --network=host --privileged \
-v <host-path>:/model \
-e SGLANG_DSV4_FP4_EXPERTS=True \
-e SGLANG_FORCE_TRITON_MOE_FP8=0 \
-e SGLANG_USE_AITER=1 \
-e SGLANG_USE_ROCM700A=1 \
-e SGLANG_HACK_FLASHMLA_BACKEND=triton \
-e SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 \
-e SGLANG_TOPK_TRANSFORM_512_TORCH=0 \
-e SGLANG_OPT_USE_FUSED_COMPRESS=true \
-e SGLANG_OPT_USE_OLD_COMPRESSOR=false \
-e SGLANG_OPT_USE_AITER_MHC_PRE=true \
-e SGLANG_OPT_USE_AITER_MHC_POST=true \
-e SGLANG_OPT_USE_FUSED_STORE_CACHE=true \
-e SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false \
-e SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true \
-e SGLANG_OPT_FUSE_WQA_WKV=true \
-e SGLANG_OPT_USE_TRITON_SWA_PREPARE=true \
-e SGLANG_OPT_USE_FUSED_HASH_TOPK=true \
-e SGLANG_OPT_DPSK_V4_RADIX=1 \
-e SGLANG_OPT_USE_TILELANG_INDEXER=true \
-e AITER_BF16_FP8_MOE_BOUND=1 \
-e SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0 \
rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4 \
python3 -m sglang.launch_server \
--model-path /model \
--served-model-name deepseek-ai/DeepSeek-V4-Pro \
--tp 8 \
--port 30000 \
--host 0.0.0.0 \
--trust-remote-code \
--attention-backend compressed \
--max-running-requests 256 \
--page-size 256 \
--context-length 32768 \
--mem-fraction-static 0.90 \
--swa-full-tokens-ratio 0.15 \
--chunked-prefill-size 8192 \
--disable-shared-experts-fusion \
--tool-call-parser deepseekv4 \
--reasoning-parser deepseek-v4Host-specific paths are redacted as <host-path>.
SystemMI355X x8EngineSGLang (ROCm 7.2 DSv4 dev)ParallelismTP=8 DP=1 EP=1PrecisionFP4GPUs used8
Submitted by Artificial Analysis
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--shm-size 32g --ipc=host --network=host --privileged \
-v <host-path>:/model \
-e SGLANG_DSV4_FP4_EXPERTS=True \
-e SGLANG_FORCE_TRITON_MOE_FP8=0 \
-e SGLANG_USE_AITER=1 \
-e SGLANG_USE_ROCM700A=1 \
-e SGLANG_HACK_FLASHMLA_BACKEND=triton \
-e SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 \
-e SGLANG_TOPK_TRANSFORM_512_TORCH=0 \
-e SGLANG_OPT_USE_FUSED_COMPRESS=true \
-e SGLANG_OPT_USE_OLD_COMPRESSOR=false \
-e SGLANG_OPT_USE_AITER_MHC_PRE=true \
-e SGLANG_OPT_USE_AITER_MHC_POST=true \
-e SGLANG_OPT_USE_FUSED_STORE_CACHE=true \
-e SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false \
-e SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true \
-e SGLANG_OPT_FUSE_WQA_WKV=true \
-e SGLANG_OPT_USE_TRITON_SWA_PREPARE=true \
-e SGLANG_OPT_USE_FUSED_HASH_TOPK=true \
-e SGLANG_OPT_DPSK_V4_RADIX=1 \
-e SGLANG_OPT_USE_TILELANG_INDEXER=true \
-e AITER_BF16_FP8_MOE_BOUND=1 \
-e SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0 \
rocm/sgl-dev:rocm720-mi35x-3b59630-20260526-DSv4 \
python3 -m sglang.launch_server \
--model-path /model \
--served-model-name deepseek-ai/DeepSeek-V4-Pro \
--tp 8 \
--port 30000 \
--host 0.0.0.0 \
--trust-remote-code \
--attention-backend compressed \
--max-running-requests 256 \
--page-size 256 \
--chunked-prefill-size 8192 \
--disable-shared-experts-fusion \
--tool-call-parser deepseekv4 \
--reasoning-parser deepseek-v4Host-specific paths are redacted as <host-path>.
SystemMI355X x8EngineSGLang (ROCm 7.2 DSv4 dev)ParallelismTP=8 DP=1 EP=1PrecisionFP4GPUs used8
Submitted by Artificial Analysis
AA-SLT
Coming soon. In the meantime, AA-SLT serving configurations are published in the System & Benchmark Specifications table.