Inference Serving Configurations

Every serving configuration used in AA-AgentPerf hardware benchmarking, from June 2026 onwards. Each entry captures the model, accelerator system, precision, inference framework, parallelism topology, and full launch command.

Read the AA-AgentPerf methodology.

AA-AgentPerf

Serving configurations from AA-AgentPerf benchmark runs, with the full launch command for each. Read the methodology

Showing 6 of 6 configurations

DeepSeek V4 Pro (Max) 6 configurations

# TensorRT-LLM 1.3.0rc15.post1 disaggregated serving on a single 8x B300 node:
# 1x context worker (TP4 + EP4, attention-DP, GPUs 4-7) + 1x generation worker (TP4 + EP4,
# attention-DP, GPUs 0-3), colocated, + disagg orchestrator with conversation-affinity routing.
# SLURM allocation: --exclusive --cpus-per-task=28 (224 CPUs).

# Worker environment:
#   TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1
#   TRTLLM_ENABLE_PDL=1 NCCL_GRAPH_MIXING_SUPPORT=0 TLLM_ADP_ROUTER_MATCH_RATE_THRESHOLD=0.10
#   TLLM_ADP_ROUTER_DECISION_LOG=1 TLLM_ADP_ROUTER_INCLUDE_TRANSFER_LOAD=1
#   TLLM_ADP_ROUTER_RANDOMIZE_TIEBREAK=1 TRTLLM_DISABLE_KV_CACHE_RATIO_UPDATE=1
#   TRTLLM_MHC_ENABLE_FUSED_HC=0 UCX_TLS=rc,sm,self,cuda_copy,cuda_ipc
#   HF_HUB_OFFLINE=1 TRTLLM_KV_CACHE_TRANSFER_TIMEOUT_SEC=480

# Generation worker (CUDA_VISIBLE_DEVICES=0,1,2,3):
trtllm-serve <model-path> --config gen_config.yaml \
  --custom_tokenizer deepseek_v4 --tool_parser deepseek_v4

# Context worker (CUDA_VISIBLE_DEVICES=4,5,6,7):
trtllm-serve <model-path> --config ctx_config.yaml \
  --custom_tokenizer deepseek_v4 --tool_parser deepseek_v4

# Orchestrator:
trtllm-serve disaggregated -c server_config.yaml -t 7200 -r 7200

# ---------------- gen_config.yaml ----------------
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
pipeline_parallel_size: 1
enable_attention_dp: true
enable_lm_head_tp_in_adp: true
max_batch_size: 64
max_num_tokens: 1024
max_seq_len: 131072
enable_autotuner: false
print_iter_log: true
stream_interval: 100
num_postprocess_workers: 4
moe_config:
  backend: TRTLLM
  use_low_precision_moe_combine: true
kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.8
  dtype: fp8
  host_cache_size: 0
  iteration_stats_interval: 1
  use_kv_cache_manager_v2: false
cache_transceiver_config:
  max_tokens_in_buffer: 131072
  backend: NIXL
  transceiver_runtime: PYTHON
cuda_graph_config:
  enable_padding: true
  batch_sizes:
  - 1
  - 2
  - 4
  - 8
  - 16
  - 32
  - 64
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
return_perf_metrics: false
perf_metrics_max_requests: 100000
enable_iter_perf_stats: false
enable_iter_req_stats: false

# ---------------- ctx_config.yaml ----------------
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
pipeline_parallel_size: 1
enable_attention_dp: true
max_batch_size: 128
max_num_tokens: 8192
max_seq_len: 131072
enable_chunked_prefill: true
disable_overlap_scheduler: true
enable_autotuner: true
print_iter_log: true
cuda_graph_config: null
moe_config:
  backend: TRTLLM
kv_cache_config:
  enable_block_reuse: true
  free_gpu_memory_fraction: 0.8
  dtype: fp8
  host_cache_size: 214748364800
  event_buffer_max_size: 16384
  iteration_stats_interval: 1
  use_kv_cache_manager_v2: false
cache_transceiver_config:
  max_tokens_in_buffer: 131072
  backend: NIXL
  transceiver_runtime: PYTHON
scheduler_config:
  capacity_scheduler_policy: MAX_UTILIZATION
attention_dp_config:
  enable_kv_cache_aware_routing: true
  kv_cache_routing_load_balance_weight: 0.5
  kv_cache_routing_cold_start_warmup: true
  kv_cache_routing_match_rate_threshold: 0.1
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
return_perf_metrics: false
perf_metrics_max_requests: 100000
enable_iter_perf_stats: false
enable_iter_req_stats: false

# ---------------- server_config.yaml (topology; hosts/ports omitted) ----------------
backend: pytorch
context_servers:
  num_instances: 1
  router:
    type: conversation
generation_servers:
  num_instances: 1
  router:
    type: conversation
SystemB300 x8EngineTRT-LLM (1.3.0rc15.post1)ParallelismTP=4 DP=2 EP=4PrecisionNVFP4GPUs used8

Submitted by NVIDIA

# TensorRT-LLM 1.3.0rc15.post1 disaggregated serving on GB300 NVL72 — 40 GPUs (10 compute trays x 4):
# 6x context workers (TP4, attention-DP, 4 GPUs each) + 1x generation worker (16 GPUs over 4 trays,
# attention-DP16 + MoE EP16) + disagg orchestrator with conversation-affinity routing.

# Worker environment (every worker; CUDA_VISIBLE_DEVICES=0,1,2,3 per tray):
#   TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1
#   TRTLLM_ENABLE_PDL=1 NCCL_GRAPH_MIXING_SUPPORT=0 TLLM_ADP_ROUTER_INCLUDE_TRANSFER_LOAD=1
#   TLLM_ADP_ROUTER_RANDOMIZE_TIEBREAK=1 TRTLLM_DISABLE_KV_CACHE_RATIO_UPDATE=1
#   TRTLLM_MHC_ENABLE_FUSED_HC=1 HF_HUB_OFFLINE=1 TRTLLM_KV_CACHE_TRANSFER_TIMEOUT_SEC=480

# Generation worker (16 GPUs, MPI across 4 trays):
trtllm-serve <model-path> --config gen_config.yaml \
  --custom_tokenizer deepseek_v4 --tool_parser deepseek_v4

# Context workers (x6, one per tray):
trtllm-serve <model-path> --config ctx_config.yaml \
  --custom_tokenizer deepseek_v4 --tool_parser deepseek_v4

# Orchestrator:
trtllm-serve disaggregated -c server_config.yaml -t 7200 -r 7200

# ---------------- gen_config.yaml ----------------
tensor_parallel_size: 16
moe_expert_parallel_size: 16
enable_attention_dp: true
enable_lm_head_tp_in_adp: true
pipeline_parallel_size: 1
max_batch_size: 32
max_num_tokens: 128
max_seq_len: 131072
print_iter_log: true
enable_autotuner: true
kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.7
  dtype: fp8
  host_cache_size: 0
  avg_seq_len: 39040
moe_config:
  backend: MEGAMOE_DEEPGEMM
  use_low_precision_moe_combine: true
  # Static MoE expert-placement map (initial_global_assignments, 384 slots per
  # EP16 rank; ~110 KB of slot->expert arrays). Published as a separate artifact
  # - see the reference link on this config.
  load_balancer: moe_load_balancer_gen_ep16_slots384_v2.yaml
cache_transceiver_config:
  max_tokens_in_buffer: 131072
  backend: NIXL
  transceiver_runtime: PYTHON
stream_interval: 100
num_postprocess_workers: 8
cuda_graph_config:
  enable_padding: true
  max_batch_size: 32
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
  allow_advanced_sampling: true
sparse_attention_config:
  algorithm: deepseek_v4
  enable_heuristic_topk: true
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false

# ---------------- ctx_config.yaml ----------------
max_batch_size: 128
max_num_tokens: 8192
max_seq_len: 131072
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
pipeline_parallel_size: 1
enable_chunked_prefill: true
print_iter_log: true
cuda_graph_config: null
enable_autotuner: true
disable_overlap_scheduler: true
moe_config:
  backend: TRTLLM
kv_cache_config:
  enable_block_reuse: true
  free_gpu_memory_fraction: 0.6
  dtype: fp8
  host_cache_size: 128849018880
  event_buffer_max_size: 16384
  pool_ratio:
  - 0.3
  - 0.45
  - 0.25
cache_transceiver_config:
  max_tokens_in_buffer: 131072
  backend: NIXL
  transceiver_runtime: PYTHON
scheduler_config:
  capacity_scheduler_policy: MAX_UTILIZATION
attention_dp_config:
  kv_cache_routing_conversation_affinity: true
  kv_cache_routing_max_sessions: 65536
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
  allow_advanced_sampling: true

# ---------------- server_config.yaml ----------------
backend: pytorch
context_servers:
  num_instances: 6
  router:
    type: conversation
generation_servers:
  num_instances: 1
  router:
    type: conversation
SystemGB300 NVL72EngineTRT-LLM (1.3.0rc15.post1)ParallelismTP=1 DP=16 EP=16PrecisionNVFP4GPUs used40 of 72

Submitted by NVIDIA

# TensorRT-LLM 1.3.0rc15.post1 disaggregated serving on GB300 NVL72 — 32 GPUs (8 compute trays x 4):
# 6x context workers (TP4, attention-DP, 4 GPUs each) + 1x generation worker (8 GPUs over 2 trays,
# attention-DP8 + MoE EP8) + disagg orchestrator with conversation-affinity routing.

# Worker environment (every worker; CUDA_VISIBLE_DEVICES=0,1,2,3 per tray):
#   TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1
#   TRTLLM_ENABLE_PDL=1 NCCL_GRAPH_MIXING_SUPPORT=0 TLLM_ADP_ROUTER_MATCH_RATE_THRESHOLD=0.10
#   HF_HUB_OFFLINE=1 TRTLLM_KV_CACHE_TRANSFER_TIMEOUT_SEC=480

# Generation worker (8 GPUs, MPI across 2 trays):
trtllm-serve <model-path> --config gen_config.yaml \
  --custom_tokenizer deepseek_v4 --tool_parser deepseek_v4

# Context workers (x6, one per tray):
trtllm-serve <model-path> --config ctx_config.yaml \
  --custom_tokenizer deepseek_v4 --tool_parser deepseek_v4

# Orchestrator:
trtllm-serve disaggregated -c server_config.yaml -t 7200 -r 7200

# ---------------- gen_config.yaml ----------------
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
enable_lm_head_tp_in_adp: true
pipeline_parallel_size: 1
max_batch_size: 192
max_num_tokens: 768
max_seq_len: 131072
print_iter_log: true
enable_autotuner: false
kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.8
  dtype: fp8
  host_cache_size: 0
  avg_seq_len: 39040
moe_config:
  backend: MEGAMOE_DEEPGEMM
  use_low_precision_moe_combine: true
cache_transceiver_config:
  max_tokens_in_buffer: 131072
  backend: NIXL
  transceiver_runtime: PYTHON
stream_interval: 100
num_postprocess_workers: 8
cuda_graph_config:
  enable_padding: true
  batch_sizes:
  - 1
  - 2
  - 4
  - 8
  - 16
  - 32
  - 64
  - 128
  - 192
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
  allow_advanced_sampling: true
sparse_attention_config:
  algorithm: deepseek_v4
  enable_heuristic_topk: true
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false

# ---------------- ctx_config.yaml ----------------
max_batch_size: 128
max_num_tokens: 8192
max_seq_len: 131072
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
pipeline_parallel_size: 1
enable_chunked_prefill: true
print_iter_log: true
cuda_graph_config: null
enable_autotuner: true
disable_overlap_scheduler: true
moe_config:
  backend: TRTLLM
kv_cache_config:
  enable_block_reuse: true
  free_gpu_memory_fraction: 0.6
  dtype: fp8
  host_cache_size: 128849018880
  event_buffer_max_size: 16384
  pool_ratio:
  - 0.3
  - 0.45
  - 0.25
cache_transceiver_config:
  max_tokens_in_buffer: 131072
  backend: NIXL
  transceiver_runtime: PYTHON
scheduler_config:
  capacity_scheduler_policy: MAX_UTILIZATION
attention_dp_config:
  enable_kv_cache_aware_routing: false
  kv_cache_routing_conversation_affinity: true
  kv_cache_routing_max_sessions: 65536
return_perf_metrics: false
enable_iter_perf_stats: false
enable_iter_req_stats: false
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3

# ---------------- server_config.yaml ----------------
backend: pytorch
context_servers:
  num_instances: 6
  router:
    type: conversation
generation_servers:
  num_instances: 1
  router:
    type: conversation
SystemGB300 NVL72EngineTRT-LLM (1.3.0rc15.post1)ParallelismTP=1 DP=8 EP=8PrecisionNVFP4GPUs used32 of 72

Submitted by NVIDIA

docker run --gpus all --shm-size 32g --network host \
  -v path/to/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:deepseek-v4-hopper \
  python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V4-Pro \
    --tp 8 --context-length 393216 --mem-fraction-static 0.88 \
    --speculative-algorithm EAGLE --speculative-num-steps 1 \
    --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --moe-runner-backend marlin --host 0.0.0.0 --port 30000
SystemH200 (SXM) x8EngineSGLang (0.5.11 DSv4)ParallelismTP=8 DP=1 EP=1PrecisionNVFP4GPUs used8

Submitted by Artificial Analysis

docker run -d \
  --device=/dev/kfd --device=/dev/dri \
  --shm-size 32g --ipc=host --network=host --privileged \
  -v <host-path>:/model \
  -e SGLANG_DSV4_FP4_EXPERTS=True \
  -e SGLANG_FORCE_TRITON_MOE_FP8=0 \
  -e SGLANG_USE_AITER=1 \
  -e SGLANG_USE_ROCM700A=1 \
  -e SGLANG_HACK_FLASHMLA_BACKEND=triton \
  -e SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 \
  -e SGLANG_TOPK_TRANSFORM_512_TORCH=0 \
  -e SGLANG_OPT_USE_FUSED_COMPRESS=true \
  -e SGLANG_OPT_USE_OLD_COMPRESSOR=false \
  -e SGLANG_OPT_USE_AITER_MHC_PRE=true \
  -e SGLANG_OPT_USE_AITER_MHC_POST=true \
  -e SGLANG_OPT_USE_FUSED_STORE_CACHE=true \
  -e SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false \
  -e SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true \
  -e SGLANG_OPT_FUSE_WQA_WKV=true \
  -e SGLANG_OPT_USE_TRITON_SWA_PREPARE=true \
  -e SGLANG_OPT_USE_FUSED_HASH_TOPK=true \
  -e SGLANG_OPT_DPSK_V4_RADIX=1 \
  -e SGLANG_OPT_USE_TILELANG_INDEXER=true \
  -e AITER_BF16_FP8_MOE_BOUND=1 \
  -e SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0 \
  rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4 \
  python3 -m sglang.launch_server \
    --model-path /model \
    --served-model-name deepseek-ai/DeepSeek-V4-Pro \
    --tp 8 \
    --port 30000 \
    --host 0.0.0.0 \
    --trust-remote-code \
    --attention-backend compressed \
    --max-running-requests 256 \
    --page-size 256 \
    --context-length 32768 \
    --mem-fraction-static 0.90 \
    --swa-full-tokens-ratio 0.15 \
    --chunked-prefill-size 8192 \
    --disable-shared-experts-fusion \
    --tool-call-parser deepseekv4 \
    --reasoning-parser deepseek-v4

Host-specific paths are redacted as <host-path>.

SystemMI355X x8EngineSGLang (ROCm 7.2 DSv4 dev)ParallelismTP=8 DP=1 EP=1PrecisionFP4GPUs used8

Submitted by Artificial Analysis

docker run -d \
  --device=/dev/kfd --device=/dev/dri \
  --shm-size 32g --ipc=host --network=host --privileged \
  -v <host-path>:/model \
  -e SGLANG_DSV4_FP4_EXPERTS=True \
  -e SGLANG_FORCE_TRITON_MOE_FP8=0 \
  -e SGLANG_USE_AITER=1 \
  -e SGLANG_USE_ROCM700A=1 \
  -e SGLANG_HACK_FLASHMLA_BACKEND=triton \
  -e SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 \
  -e SGLANG_TOPK_TRANSFORM_512_TORCH=0 \
  -e SGLANG_OPT_USE_FUSED_COMPRESS=true \
  -e SGLANG_OPT_USE_OLD_COMPRESSOR=false \
  -e SGLANG_OPT_USE_AITER_MHC_PRE=true \
  -e SGLANG_OPT_USE_AITER_MHC_POST=true \
  -e SGLANG_OPT_USE_FUSED_STORE_CACHE=true \
  -e SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false \
  -e SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true \
  -e SGLANG_OPT_FUSE_WQA_WKV=true \
  -e SGLANG_OPT_USE_TRITON_SWA_PREPARE=true \
  -e SGLANG_OPT_USE_FUSED_HASH_TOPK=true \
  -e SGLANG_OPT_DPSK_V4_RADIX=1 \
  -e SGLANG_OPT_USE_TILELANG_INDEXER=true \
  -e AITER_BF16_FP8_MOE_BOUND=1 \
  -e SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0 \
  rocm/sgl-dev:rocm720-mi35x-3b59630-20260526-DSv4 \
  python3 -m sglang.launch_server \
    --model-path /model \
    --served-model-name deepseek-ai/DeepSeek-V4-Pro \
    --tp 8 \
    --port 30000 \
    --host 0.0.0.0 \
    --trust-remote-code \
    --attention-backend compressed \
    --max-running-requests 256 \
    --page-size 256 \
    --chunked-prefill-size 8192 \
    --disable-shared-experts-fusion \
    --tool-call-parser deepseekv4 \
    --reasoning-parser deepseek-v4

Host-specific paths are redacted as <host-path>.

SystemMI355X x8EngineSGLang (ROCm 7.2 DSv4 dev)ParallelismTP=8 DP=1 EP=1PrecisionFP4GPUs used8

Submitted by Artificial Analysis

AA-SLT

Coming soon. In the meantime, AA-SLT serving configurations are published in the System & Benchmark Specifications table.