All articles

June 4, 2026

NVIDIA Nemotron 3 Ultra released: fast, intelligent, and open

NVIDIA has just released Nemotron 3 Ultra, the new most intelligent US open weights model, with leading speed for its intelligence

Nemotron 3 Ultra scores 47.7 on the Artificial Analysis Intelligence Index, well ahead of the next strongest US open weights models, Gemma 4 31B (39.2), Nemotron 3 Super (36.0) and gpt-oss-120b (33.3), but behind the Chinese-led open weights frontier (Kimi K2.6 at 53.9).

We partnered with NVIDIA to evaluate this model for intelligence and speed ahead of its public release. These figures use the final NVFP4 weights that NVIDIA recommends for inference, but our tests show minimal intelligence impact compared to BF16 testing, with higher precision resulting in an Artificial Analysis Intelligence Index score of 48.2 vs. the NVFP4 score of 47.7.

Key Takeaways:

Nemotron 3 Ultra leads in speed for its intelligence: through BlackBox AI ahead of release, Nemotron 3 Ultra is served at over 400 output tokens per second - this is slightly faster than the typical serving speed of gpt-oss-120b despite being >4X larger, and comes with significantly greater intelligence

Largest Nemotron 3 model so far: with approximately 550 billion total parameters and 55 billion active, Nemotron 3 Ultra is significantly larger than its siblings and is the largest and most intelligent US open weights model release ever

Nemotron 3 Ultra is the leading US open weights model on the Artificial Analysis Intelligence and Agentic Indexes by far, but Gemma 4 31B scores ~1 point higher on the Coding Index (comprised of Terminal-Bench Hard and SciCode)

Nemotron 3 Ultra looks to perform effectively on agentic tasks at low latency. We tested it against peers under restricted turn-usage limits on Terminal-Bench v2.1 - Nemotron 3 Ultra completes tasks at a much faster pace than peers due to its high inference speed while scoring competitively on the benchmark.

Methodology for performance vs. task latency analysis on Terminal-Bench v2.1

Why turn budgets. Faster inference is useful to buy better results in less time, but it’s equally possible to push per-task latency lower by reducing model effort to complete agentic tasks. There are a range of approaches to this today including model ‘effort’ parameters, token limits at the request or task level, or turn limits. In this analysis, we cap how many turns an agent may spend on each task and watch how performance and wall-clock time move as the cap rises.

We tell the model its budget clearly. Silently truncating an agent at N turns penalizes models that didn't know the limit was coming, based on their natural preference for turn usage. To avoid this bias, we customized the Terminus 2 harness so that every run states the turn budget up front (with a precise definition of what counts as a "turn") and warns the model as it approaches the limit. This mirrors the turn-aware prompting we use for GDPval-AA.

The sweep. Each model runs Terminal-Bench v2.1 at four budgets - 10, 20, 50 and 100 turns - and we trace the resulting curve of score vs. time.

How to read the chart. Each model is drawn as a single line connecting its four turn budgets (10 → 100), so the entire performance-versus-latency tradeoff is visible at a glance. The vertical axis is performance, represented as the proportion of tasks fully solved (passing the task's tests). The horizontal axis is an estimated wall-clock cost per task: counted output tokens divided by the model's output speed, plus the time actually spent executing tools in the sandbox. Output speed is each model's real speed - the same first-party-or-median figure shown across Artificial Analysis. For Nemotron 3 Ultra, we measured speeds from a pre-release deployment on BlackBox AI. Giving a model more turns generally makes it more performant but slower, so the shape and position of each curve is the story: a line that hugs the top left is delivering more ‘performance per time’.

Nemotron 3 Ultra is the fastest across all turn limits and sits on the Pareto frontier for performance versus time per task for this evaluation.

Nemotron 3 Ultra is optimized for inference speed, similar to its smaller siblings. It sits on the Pareto frontier for speed versus intelligence, with speeds of over 400 tokens per second measured as deployed on BlackBox AI ahead of release.

Nemotron 3 Ultra uses roughly 1M fewer output tokens than NVIDIA Nemotron 3 Super to run the Artificial Analysis Intelligence Index alongside its increase in size and intelligence. This places Nemotron centrally among peers for its verbosity

Nemotron 3 Ultra scored well on AA-Omniscience Non-Hallucination at 71%, reflecting low propensity to answer factual questions when it doesn’t know the answer. It also improves significantly on GDPval-AA compared to Nemotron 3 Super with an Elo of 1378, approximately in line with DeepSeek V4 Flash. Performance on CritPt - a benchmark for graduate level Physics research problems - is relatively lower though, reaching the same 3% as Nemotron 3 Super. Full evaluation suite breakdowns below: