NVIDIA Nemotron 3 Super: The new leader in open, efficient intelligence

NVIDIA has released Nemotron 3 Super, a 120B (12B active) open weights reasoning model using a hybrid Mamba-Transformer MoE architecture.

We were given access to this model ahead of launch, and we evaluated it across intelligence, openness, and inference efficiency.

Key takeaways

Nemotron 3 Super combines highly open methodology and training data with strong intelligence for its size, and is substantially more intelligent than any other model with comparable openness
NVIDIA focused on efficient intelligence with this release, and in our testing we found Nemotron 3 Super to have higher intelligence than gpt-oss-120b while enabling ~10% higher throughput per GPU
To test this model’s efficiency goals, we developed a simple but realistic test for peer models. Our load test used workloads of 50k input tokens and 2k output tokens to approximate document processing or code analysis workflows. We think this is the best way to test model-specific inference characteristics, and hope to do so for additional models in future

Model details

This is the second model in the Nemotron 3 family, and represents a medium scale model option between Nano (30B total, 3B active) and the forthcoming Ultra (~500B total, ~50B active)
Nemotron 3 Super has 120.6B total and 12.7B active parameters, along with a 1 million token context window and hybrid reasoning support
The model has several design features enabling efficient inference, including using hybrid Mamba-Transformer and LatentMoE architectures, multi-token prediction, and NVFP4 quantized weights
NVIDIA pre-trained Nemotron 3 Super in (mostly) NVFP4 precision, but moved to BF16 for post-training. Our evaluation scores use the BF16 weights.
Open weights and permissively licensed (similar to Nemotron 3 Nano), including open training data and methodology disclosure

Performance and benchmarks

Nemotron 3 Super scored 36 on the Artificial Analysis Intelligence Index, +17 points ahead of the previous Super release and +12 points from Nemotron 3 Nano. Compared to models in a similar size category, this places it ahead of gpt-oss-120b (33), but behind the recently-released Qwen3.5 122B A10B (42).

We benchmarked Nemotron 3 Super in its highest-effort reasoning mode ("regular"), the most capable of the model's three inference modes (reasoning-off, low-effort, and regular).

Artificial Analysis Intelligence Index

Similar to previous releases, NVIDIA released significant pre- and post-training data alongside new comprehensive training recipes for this model. These disclosures reach an 83 on the Artificial Analysis Openness Index, behind only highly-open models from Ai2 and MBZUAI, and place Nemotron 3 Super in the most attractive quadrant for Openness and Intelligence among peers.

Nemotron 3 Super is by far the most intelligent model ever released with this level of openness.

Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index

Nemotron 3 Super used a relatively high number of tokens across our evaluations. It used 110M output tokens to run the Artificial Analysis Intelligence Index evaluations - this is around 40% more than gpt-oss-120b with high reasoning effort, but a ~20% reduction compared to Nemotron 3 Nano.

That’s significantly fewer tokens than Anthropic’s Claude Opus 4.6 (max), which used 160M tokens, and slightly fewer than OpenAI’s GPT-5.4 (xhigh), which used 120M tokens.

Output Tokens Used to Run Artificial Analysis Intelligence Index

Nemotron 3 Super shows well-rounded performance, with consistently strong results across evaluations. It is a significant step up on agentic tasks compared to its predecessor or Nemotron 3 Nano:

On Terminal-Bench Hard, Nemotron 3 Super scores 29%
On GDPval-AA, our leading benchmark of agentic performance in real-world work tasks, it reaches an ELO of 1027

Intelligence Index Evaluations

At 120B total with 12B active parameters, Nemotron 3 Super is still relatively small compared to other recent open weights model releases from top global labs — GLM-5 (744B total, 40B active), Qwen3.5 397B A17B (397B total, 17B active), and Kimi K2.5 (1T total, 32B active) are each 3x to 8x larger.

Intelligence vs. Total Parameters

Model efficiency

NVIDIA is focused on efficient intelligence for the Nemotron family, and we tested inference performance against peer models to see the impact of the architecture choices.

We ran self-hosted throughput tests across a range of peer models using a simple methodology with workloads representative of common use cases such as agentic workflows with moderate history, RAG applications, or document processing.

Our testing approach used the following setup:

Hardware: 8x NVIDIA B200-SXM
Serving: vLLM-nightly (0.17.0)
Workload: 50K input / 2K output, aligned to our standardized inference provider benchmarking workload generation approach
Fixed batch size of 8 users per GPU
Across models, we used reasonable baseline configurations for inference with no substantial per-model optimizations - we’re aiming for like-for-like comparisons, not roofline performance:
- No multi-token prediction or other speculative decoding
- Reasonable parallelism strategies without using extensive sweeps
A range of efficiency-focused peer models were tested, with model weights at sensible production-ready precision
Full model configurations are detailed below¹

In this test, Nemotron 3 Super (NVFP4) shows 11% higher throughput per NVIDIA B200 GPU than gpt-oss-120b (MXFP4), placing Nemotron 3 Super ‘up and to the right’ relative to gpt-oss-120b. Qwen3.5 122B A10B achieves +6 points on the Intelligence Index compared to Nemotron 3 Super, but at 40% lower throughput per GPU.

Our Intelligence Index scores for Nemotron 3 Super were evaluated on the BF16 weights. We have not yet assessed whether there is any intelligence impact of NVFP4 quantization, but NVIDIA’s internal testing found that the NVFP4 model achieved 99.8% median accuracy relative to the BF16 baseline.

Intelligence vs. Efficiency

Model Efficiency Comparison

Provider support

Nemotron 3 Super will be available from its release on serverless APIs from providers including Lightning AI and DeepInfra.

We tested these endpoints and see performance up to 484 tokens per second on our standard 10k token input workloads. At launch, Nemotron 3 Super sits in the most attractive quadrant for intelligence and output speed among comparable peers.

Intelligence vs Output Speed

Compare Nemotron 3 Super with other models: https://artificialanalysis.ai/models/nvidia-nemotron-3-super-120b-a12b

Model configurations:

Qwen3 Next 80B A3B Thinking: BF16 weights

Parameter	Value
Tensor Parallel Size	2
Max Model Length	65,536
GPU Memory Utilization	90%
KV Cache dtype	auto
Max Batched Tokens	32,768
Max CUDAGraph Capture Size	2,048
Prefix Caching	Disabled
Async Scheduling	Enabled
Attention Backend	FLASH_ATTN
FlashInfer MoE Backend	latency
Compiler Passes	`fuse_allreduce_rms`, `eliminate_noops`

Qwen3.5 122B A10B: BF16 weights

Parameter	Value
Tensor Parallel Size	2
Max Model Length	65,536
GPU Memory Utilization	90%
KV Cache dtype	auto
Max Batched Tokens	32,768
Max CUDAGraph Capture Size	2,048
Prefix Caching	Disabled
Async Scheduling	Enabled
Attention Backend	FLASH_ATTN
FlashInfer MoE Backend	latency
Compiler Passes	`fuse_allreduce_rms`, `eliminate_noops`

Qwen3 235B A22B 2507: BF16 weights

Parameter	Value
Tensor Parallel Size	4
Max Model Length	65,536
GPU Memory Utilization	95%
KV Cache dtype	fp8
Max Batched Tokens	32,768
Max CUDAGraph Capture Size	2,048
Prefix Caching	Disabled
Async Scheduling	Enabled
Compiler Passes	`fuse_allreduce_rms`, `eliminate_noops`

Minimax M2.1: FP8 weights

Parameter	Value
Tensor Parallel Size	4
Max Model Length	65,536
GPU Memory Utilization	90%
KV Cache dtype	fp8
Max Batched Tokens	32,768
Max CUDAGraph Capture Size	2,048
CUDAGraph Mode	PIECEWISE
Prefix Caching	Disabled
Async Scheduling	Enabled
Compiler Passes	`fuse_allreduce_rms`, `eliminate_noops`

GLM 4.5 Air: BF16 weights

Parameter	Value
Tensor Parallel Size	2
Max Model Length	65,536
GPU Memory Utilization	95%
KV Cache dtype	fp8
Max Batched Tokens	32,768
Max CUDAGraph Capture Size	2,048
Prefix Caching	Disabled
Async Scheduling	Enabled
Compiler Passes	`fuse_allreduce_rms`, `eliminate_noops`

gpt-oss-120b: MXFP4 weights

Parameter	Value
Tensor Parallel Size	2
Max Model Length	65,536
GPU Memory Utilization	90%
KV Cache dtype	fp8
Max Batched Tokens	8,192
Max CUDAGraph Capture Size	2,048
Prefix Caching	Disabled
Async Scheduling	Enabled
Compiler Passes	`fuse_allreduce_rms`, `eliminate_noops`

Nemotron 3 Super: NVFP4 weights

Parameter	Value
Tensor Parallel Size	2
Max Model Length	65,536
GPU Memory Utilization	90%
KV Cache dtype	fp8
Max Batched Tokens	8,192
Max CUDAGraph Capture Size	2,048
Prefix Caching	Disabled
Async Scheduling	Enabled
Compiler Passes	`fuse_allreduce_rms`, `eliminate_noops`

↩

NVIDIA Nemotron 3 Super: The new leader in open, efficient intelligence

Performance and benchmarks

Model efficiency

Provider support

Read the latest

Muse Spark 1.1: Meta gains 8 Intelligence Index points in three months

GPT-5.6 benchmarks across Intelligence, Speed and Cost

Announcing EnterpriseOps-Gym-AA

NVIDIA Nemotron 3 Super: The new leader in open, efficient intelligence

Performance and benchmarks

Model efficiency

Provider support

Footnotes

Read the latest

Muse Spark 1.1: Meta gains 8 Intelligence Index points in three months

GPT-5.6 benchmarks across Intelligence, Speed and Cost

Announcing EnterpriseOps-Gym-AA