March 11, 2026
NVIDIA Nemotron 3 Super: The new leader in open, efficient intelligence
NVIDIA has released Nemotron 3 Super, a 120B (12B active) open weights reasoning model using a hybrid Mamba-Transformer MoE architecture.
We were given access to this model ahead of launch, and we evaluated it across intelligence, openness, and inference efficiency.
Key takeaways
- Nemotron 3 Super combines highly open methodology and training data with strong intelligence for its size, and is substantially more intelligent than any other model with comparable openness
- NVIDIA focused on efficient intelligence with this release, and in our testing we found Nemotron 3 Super to have higher intelligence than gpt-oss-120b while enabling ~10% higher throughput per GPU
- To test this model’s efficiency goals, we developed a simple but realistic test for peer models. Our load test used workloads of 50k input tokens and 2k output tokens to approximate document processing or code analysis workflows. We think this is the best way to test model-specific inference characteristics, and hope to do so for additional models in future
Model details
- This is the second model in the Nemotron 3 family, and represents a medium scale model option between Nano (30B total, 3B active) and the forthcoming Ultra (~500B total, ~50B active)
- Nemotron 3 Super has 120.6B total and 12.7B active parameters, along with a 1 million token context window and hybrid reasoning support
- The model has several design features enabling efficient inference, including using hybrid Mamba-Transformer and LatentMoE architectures, multi-token prediction, and NVFP4 quantized weights
- NVIDIA pre-trained Nemotron 3 Super in (mostly) NVFP4 precision, but moved to BF16 for post-training. Our evaluation scores use the BF16 weights.
- Open weights and permissively licensed (similar to Nemotron 3 Nano), including open training data and methodology disclosure
Performance and benchmarks
Nemotron 3 Super scored 36 on the Artificial Analysis Intelligence Index, +17 points ahead of the previous Super release and +12 points from Nemotron 3 Nano. Compared to models in a similar size category, this places it ahead of gpt-oss-120b (33), but behind the recently-released Qwen3.5 122B A10B (42).
We benchmarked Nemotron 3 Super in its highest-effort reasoning mode ("regular"), the most capable of the model's three inference modes (reasoning-off, low-effort, and regular).
Artificial Analysis Intelligence Index
Similar to previous releases, NVIDIA released significant pre- and post-training data alongside new comprehensive training recipes for this model. These disclosures reach an 83 on the Artificial Analysis Openness Index, behind only highly-open models from Ai2 and MBZUAI, and place Nemotron 3 Super in the most attractive quadrant for Openness and Intelligence among peers.
Nemotron 3 Super is by far the most intelligent model ever released with this level of openness.
Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index
Nemotron 3 Super used a relatively high number of tokens across our evaluations. It used 110M output tokens to run the Artificial Analysis Intelligence Index evaluations - this is around 40% more than gpt-oss-120b with high reasoning effort, but a ~20% reduction compared to Nemotron 3 Nano.
That’s significantly fewer tokens than Anthropic’s Claude Opus 4.6 (max), which used 160M tokens, and slightly fewer than OpenAI’s GPT-5.4 (xhigh), which used 120M tokens.
Output Tokens Used to Run Artificial Analysis Intelligence Index
Nemotron 3 Super shows well-rounded performance, with consistently strong results across evaluations. It is a significant step up on agentic tasks compared to its predecessor or Nemotron 3 Nano:
- On Terminal-Bench Hard, Nemotron 3 Super scores 29%
- On GDPval-AA, our leading benchmark of agentic performance in real-world work tasks, it reaches an ELO of 1027
Intelligence Index Evaluations
At 120B total with 12B active parameters, Nemotron 3 Super is still relatively small compared to other recent open weights model releases from top global labs — GLM-5 (744B total, 40B active), Qwen3.5 397B A17B (397B total, 17B active), and Kimi K2.5 (1T total, 32B active) are each 3x to 8x larger.
Intelligence vs. Total Parameters
Model efficiency
NVIDIA is focused on efficient intelligence for the Nemotron family, and we tested inference performance against peer models to see the impact of the architecture choices.
We ran self-hosted throughput tests across a range of peer models using a simple methodology with workloads representative of common use cases such as agentic workflows with moderate history, RAG applications, or document processing.
Our testing approach used the following setup:
- Hardware: 8x NVIDIA B200-SXM
- Serving: vLLM-nightly (0.17.0)
- Workload: 50K input / 2K output, aligned to our standardized inference provider benchmarking workload generation approach
- Fixed batch size of 8 users per GPU
- Across models, we used reasonable baseline configurations for inference with no substantial per-model optimizations - we’re aiming for like-for-like comparisons, not roofline performance:
- No multi-token prediction or other speculative decoding
- Reasonable parallelism strategies without using extensive sweeps
- A range of efficiency-focused peer models were tested, with model weights at sensible production-ready precision
- Full model configurations are detailed below1
In this test, Nemotron 3 Super (NVFP4) shows 11% higher throughput per NVIDIA B200 GPU than gpt-oss-120b (MXFP4), placing Nemotron 3 Super ‘up and to the right’ relative to gpt-oss-120b. Qwen3.5 122B A10B achieves +6 points on the Intelligence Index compared to Nemotron 3 Super, but at 40% lower throughput per GPU.
Our Intelligence Index scores for Nemotron 3 Super were evaluated on the BF16 weights. We have not yet assessed whether there is any intelligence impact of NVFP4 quantization, but NVIDIA’s internal testing found that the NVFP4 model achieved 99.8% median accuracy relative to the BF16 baseline.
Intelligence vs. Efficiency
Model Efficiency Comparison
Provider support
Nemotron 3 Super will be available from its release on serverless APIs from providers including Lightning AI and DeepInfra.
We tested these endpoints and see performance up to 484 tokens per second on our standard 10k token input workloads. At launch, Nemotron 3 Super sits in the most attractive quadrant for intelligence and output speed among comparable peers.
Intelligence vs Output Speed
Compare Nemotron 3 Super with other models: https://artificialanalysis.ai/models/nvidia-nemotron-3-super-120b-a12b
Footnotes
-
Model configurations:
Qwen3 Next 80B A3B Thinking: BF16 weights
Parameter Value Tensor Parallel Size 2 Max Model Length 65,536 GPU Memory Utilization 90% KV Cache dtype auto Max Batched Tokens 32,768 Max CUDAGraph Capture Size 2,048 Prefix Caching Disabled Async Scheduling Enabled Attention Backend FLASH_ATTN FlashInfer MoE Backend latency Compiler Passes fuse_allreduce_rms,eliminate_noopsQwen3.5 122B A10B: BF16 weights
Parameter Value Tensor Parallel Size 2 Max Model Length 65,536 GPU Memory Utilization 90% KV Cache dtype auto Max Batched Tokens 32,768 Max CUDAGraph Capture Size 2,048 Prefix Caching Disabled Async Scheduling Enabled Attention Backend FLASH_ATTN FlashInfer MoE Backend latency Compiler Passes fuse_allreduce_rms,eliminate_noopsQwen3 235B A22B 2507: BF16 weights
Parameter Value Tensor Parallel Size 4 Max Model Length 65,536 GPU Memory Utilization 95% KV Cache dtype fp8 Max Batched Tokens 32,768 Max CUDAGraph Capture Size 2,048 Prefix Caching Disabled Async Scheduling Enabled Compiler Passes fuse_allreduce_rms,eliminate_noopsMinimax M2.1: FP8 weights
Parameter Value Tensor Parallel Size 4 Max Model Length 65,536 GPU Memory Utilization 90% KV Cache dtype fp8 Max Batched Tokens 32,768 Max CUDAGraph Capture Size 2,048 CUDAGraph Mode PIECEWISE Prefix Caching Disabled Async Scheduling Enabled Compiler Passes fuse_allreduce_rms,eliminate_noopsGLM 4.5 Air: BF16 weights
Parameter Value Tensor Parallel Size 2 Max Model Length 65,536 GPU Memory Utilization 95% KV Cache dtype fp8 Max Batched Tokens 32,768 Max CUDAGraph Capture Size 2,048 Prefix Caching Disabled Async Scheduling Enabled Compiler Passes fuse_allreduce_rms,eliminate_noopsgpt-oss-120b: MXFP4 weights
Parameter Value Tensor Parallel Size 2 Max Model Length 65,536 GPU Memory Utilization 90% KV Cache dtype fp8 Max Batched Tokens 8,192 Max CUDAGraph Capture Size 2,048 Prefix Caching Disabled Async Scheduling Enabled Compiler Passes fuse_allreduce_rms,eliminate_noopsNemotron 3 Super: NVFP4 weights
↩Parameter Value Tensor Parallel Size 2 Max Model Length 65,536 GPU Memory Utilization 90% KV Cache dtype fp8 Max Batched Tokens 8,192 Max CUDAGraph Capture Size 2,048 Prefix Caching Disabled Async Scheduling Enabled Compiler Passes fuse_allreduce_rms,eliminate_noops