Logo
Back to Blog
AI & LLMsJune 27, 202613 min read

Nemotron 3 Ultra: NVIDIA's 550B Open Reasoning Model Guide

NVIDIA released Nemotron 3 Ultra on June 4, 2026: a 550B-parameter Mixture-of-Experts model with 55B active per token, a hybrid Mamba-Transformer architecture, native multi-token prediction, and a 1M-token context window, all under the permissive OpenMDW-1.1 license. This guide covers what shipped, how the LatentMoE architecture works, where it lands on benchmarks, the Nano/Super/Ultra family, how to access it, and what it costs.

Lushbinary Team

Lushbinary Team

AI & Cloud Solutions

Nemotron 3 Ultra: NVIDIA's 550B Open Reasoning Model Guide

NVIDIA spent the past few years known mainly as the company that sells the GPUs everyone else trains models on. With Nemotron 3 Ultra, released on June 4, 2026, it staked a much louder claim: the most capable open-weight language model to come out of a US lab. It is a 550-billion-parameter Mixture-of-Experts model that activates only about 55 billion parameters per token, ships with open weights, training data, and recipes, and was built from the ground up for the kind of long-running agentic work that breaks smaller models.

The interesting part is not just the size. Nemotron 3 Ultra uses a hybrid architecture that blends Mamba-2 state-space layers, attention, and MoE routing, plus native multi-token prediction and a 1-million-token context window. That combination is aimed squarely at throughput: agents that run for hundreds of steps need to think fast, not just think well. This guide breaks down what shipped, how the architecture works, where it lands on benchmarks, how to access it, and what it actually costs.

If you want to run the model on your own hardware, read the companion self-hosting and deployment guide. To wire it into a multi-step agent, see building long-running agents with Nemotron 3 Ultra.

Figures and sourcing

Specifications below come from NVIDIA's official model card and Hugging Face listings. Benchmark numbers are from Artificial Analysis at launch, and pricing reflects third-party hosted providers as of June 2026. Independent benchmark scores and provider prices move quickly, so verify against the source before you commit a production budget.

1What Shipped in Nemotron 3 Ultra

NVIDIA announced Nemotron 3 Ultra during its GTC Taipei presence at Computex in early June 2026 and released the weights on June 4. The headline specs are easy to list and harder to appreciate until you see them work together:

  • 550B total parameters, ~55B active per token. A sparse Mixture-of-Experts design, roughly 10 percent of the model fires on any given token.
  • Hybrid LatentMoE architecture. Interleaved Mamba-2 state-space layers, attention layers, and MoE layers, plus native Multi-Token Prediction.
  • Up to 1M-token context. Long enough to hold an entire codebase, a research corpus, or a multi-day agent transcript.
  • Toggleable reasoning. A chat-template flag (enable_thinking) turns extended reasoning on or off per request.
  • Open everything. Weights, training data, and recipes under the permissive OpenMDW-1.1 license.
  • NVFP4 and BF16 checkpoints. The NVFP4 4-bit build is the efficient deployment path; NVIDIA reports up to 5x higher throughput from it.
  • 12 supported languages, including English, Spanish, French, German, Italian, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese.

NVIDIA positions the model explicitly for orchestration: the brain that makes architectural decisions in a long coding session, synthesizes across hundreds of research sources, or verifies thousands of interdependent constraints. The pre-training data has a cutoff of September 2025 and the post-training data a cutoff of May 2026.

2The Hybrid Mamba-Transformer MoE Architecture

The architecture is where Nemotron 3 Ultra earns its throughput. A conventional transformer pays a price that grows with sequence length: the attention mechanism and its KV cache scale with how much context you hold. At a million tokens, that price is brutal. NVIDIA's answer is a hybrid stack it calls LatentMoE.

Three layer types are interleaved through the network:

  • Mamba-2 state-space layers carry most of the sequence-processing load. They use a fixed-size recurrent state instead of a KV cache that grows with context, so long sequences stay cheap in both memory and compute.
  • Attention layers are kept where they matter most, for precise long-range lookups, but used sparingly rather than at every layer.
  • MoE layers route each token to a small subset of experts, so the model can hold 550B parameters of knowledge while only activating about 55B per token.

On top of that, Nemotron 3 Ultra uses native Multi-Token Prediction (MTP). During pre-training the model learns to predict more than one future token at a time using additional prediction heads, which it can then exploit at inference for speculative-style speedups in multi-turn generation. Combined with NVFP4 quantization, NVIDIA reports inference throughput of 5.9x, 4.8x, and 1.6x over GLM-5.1-754B-A40B, Kimi-K2.6-1T-A32B, and Qwen-3.5-397B-17B respectively on an 8k input and 64k output setting. The practical reading: for output-heavy reasoning, Ultra finishes more steps in the same time budget.

Why the active count matters for cost, not memory

A common misconception is that 55B active means you only need to hold 55B parameters in memory. You do not. Every expert can be selected for some token, so all 550B parameters must be resident in GPU memory. The 55B active count reduces the math per token, which is what drives throughput and per-token cost down, not the memory footprint.

3The 1M-Token Context Window

A million-token context is not a marketing number here; it is the direct payoff of the Mamba-heavy design. Because the state-space layers do not maintain a KV cache that grows linearly with sequence length, holding very long inputs is far cheaper than it would be on a pure-attention model of similar quality. NVIDIA reports that Ultra outperforms other state-of-the-art open models on the RULER benchmark at the full 1M-token length, which is the test that actually stresses whether a model can use distant context rather than just accept it.

For builders, a million tokens changes which problems are tractable in a single call: an entire mid-size codebase for a refactor, a full contract set for cross-document verification, or the accumulated transcript of an agent that has been running for hours. Hosted providers sometimes expose a smaller cap (some serve 256K), so confirm the context limit on whichever endpoint you use rather than assuming the full 1M.

4Benchmarks & Where It Lands

The single most-cited number at launch came from Artificial Analysis, which scored Nemotron 3 Ultra in reasoning mode at roughly 48 on its composite Intelligence Index. In context, that placed it:

ModelAA Intelligence Index (approx.)Notes
Nemotron 3 Ultra~48Leading US open-weight model at launch
Gemma 4 31B39.2Next strongest US open weights
Nemotron 3 Super36.0The 120B family sibling
gpt-oss-120b33.3OpenAI open-weight model
Kimi K2.6 (and peers)higherChinese-led open frontier still ahead

The honest summary: Nemotron 3 Ultra was the strongest open model from a US lab at release, but it did not top the global open-weight leaderboard, where Chinese-led models such as Kimi K2.6 still led. Composite indexes also compress a lot of detail, and Artificial Analysis listed different scores for reasoning and non-reasoning configurations, so treat 48 as the reasoning-mode figure and benchmark against your own tasks before drawing conclusions. Where Ultra clearly differentiates is long-context (RULER at 1M) and raw throughput per dollar for agentic workloads.

5The Toggleable Reasoning Mode

Like several recent reasoning models, Ultra lets you switch extended thinking on or off per request through the chat template, using an enable_thinking flag. This matters more than it sounds. Reasoning tokens are expensive: they are output tokens, billed at the higher rate, and they add latency. For a hard architectural decision or a tricky proof, you want them on. For a quick classification, a routing decision, or a formatting fixup, you want them off.

In an agent, the smart pattern is to drive this flag from the step type rather than leaving it always-on. We cover that routing logic in the agent orchestration guide.

6The Nemotron 3 Family: Nano, Super, Ultra

Ultra is the flagship, but it is the top of a three-tier family that all share the hybrid Mamba-Transformer MoE design. Knowing the lineup helps you avoid overpaying: you rarely need Ultra for every step.

  • Nemotron 3 Nano arrived first, in December 2025. It is the compact, high-throughput tier, with NVIDIA reporting roughly 4x the throughput of Nemotron 2 Nano. It targets edge deployment and very high-volume, low-latency agent steps.
  • Nemotron 3 Super (120B total, 12B active) launched at GTC on March 11, 2026. It also carries a 1M-token context and is tuned for collaborative multi-agent work such as software development and security triage. It is the default LLM in the NVIDIA RAG Blueprint.
  • Nemotron 3 Ultra (550B total, 55B active) is the June 2026 flagship for the hardest reasoning and the orchestrator role in multi-agent systems.

The intended pattern is a hierarchy: Ultra makes the hard calls, Super handles substantial worker tasks, and Nano runs the cheap, high-frequency steps. Because all three speak the same family format, routing between them is straightforward.

7Openness & the OpenMDW License

NVIDIA shipped Nemotron 3 Ultra under the OpenMDW-1.1 license, a permissive open model license stewarded under the Linux Foundation. Crucially, NVIDIA released not just the weights but the training data and the recipes used to build the model. That level of openness is rare at this scale and has real consequences:

  • You can self-host the model with no per-token fees and full control over data residency.
  • You can fine-tune or distill it for your domain, and the open recipes make reproducing or extending the training credible rather than guesswork.
  • There is no usage-based licensing tied to a vendor API, which matters for regulated industries and air-gapped deployments.

As always, read the license text for your specific use rather than relying on a summary, especially around redistribution and any attribution or use-restriction terms.

8How to Access Nemotron 3 Ultra

There are three practical paths, in increasing order of operational ownership:

  • NVIDIA-hosted API (build.nvidia.com / NIM). The fastest way to try the model. NVIDIA exposes an OpenAI-compatible endpoint through its NIM microservices, with a hosted preview on build.nvidia.com.
  • Third-party inference providers. The open weights mean providers picked it up quickly. It is available through multiple hosted endpoints, several of which are OpenAI-compatible, and some offered free tiers at launch.
  • Self-hosting. The weights are on Hugging Face in NVFP4 and BF16. Nemotron 3 Ultra got day-0 support in vLLM, plus a TensorRT-LLM deployment guide and NVIDIA Dynamo recipes. See our deployment guide for the hardware and commands.

Because most endpoints are OpenAI-compatible, a first call looks familiar:

from openai import OpenAI

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="YOUR_NVIDIA_API_KEY",
)

resp = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b-a55b",
    messages=[
        {"role": "system", "content": "You are a precise engineering assistant."},
        {"role": "user", "content": "Plan a migration from a monolith to services."},
    ],
    # Reasoning mode is controlled via the chat template / extra body.
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    temperature=0.6,
)

print(resp.choices[0].message.content)

The exact way to pass enable_thinking varies by provider, so check the endpoint docs. Some accept it in the request body, others expose a dedicated reasoning parameter.

9Pricing & Cost Profile

As of June 2026, third-party hosted pricing clustered in this range:

  • Input: roughly $0.50 to $0.68 per million tokens.
  • Output: roughly $2.40 to $2.67 per million tokens.

A worked example makes the output-heavy reality concrete. Suppose an agent run consumes 2M tokens split 70 percent input and 30 percent output, priced at $0.68 input and $2.67 output per million:

input  = 2,000,000 * 0.70 = 1,400,000 tokens
output = 2,000,000 * 0.30 =   600,000 tokens

input cost  = 1,400,000 / 1,000,000 * $0.68 = $0.952
output cost =   600,000 / 1,000,000 * $2.67 = $1.602
total                                        = $2.554

Output dominates even at a 30 percent share, which is exactly why the toggleable reasoning mode is a cost lever and not just a quality knob. Reasoning tokens are output tokens. If you have steady, high volume, self-hosting the open weights can undercut per-token pricing; we run that break-even math in the deployment guide.

10Use Cases & What to Build

Ultra is overkill for a chatbot. It earns its keep where reasoning depth, long context, and sustained throughput all matter at once:

  • Agent orchestration. The planner that decomposes a goal, assigns subtasks to cheaper models, and verifies their output.
  • Long coding sessions. Architectural decisions and cross-file refactors that need the whole repository in context.
  • Deep research. Synthesis across hundreds of sources held in a single long-context window.
  • High-stakes RAG. Verification across many retrieved documents where a wrong synthesis is costly.
  • Domain adaptation. Fine-tuning on the open weights for regulated or specialized workloads that cannot leave your environment.

11Why Lushbinary for Nemotron Integration

A frontier open model is only useful once it is wired into a real product with sane cost controls, evaluation, and guardrails. Lushbinary builds production AI systems end to end: model selection and routing, retrieval pipelines, agent orchestration, and the cloud infrastructure to run any of it reliably. Whether you want to start on the hosted API or self-host Nemotron 3 Ultra on your own GPUs, we can design the architecture and ship it.

Tell us what you are building and we will map the fastest credible path from prototype to production.

Build on Nemotron 3 Ultra With Confidence

From hosted API integration to self-hosted 550B deployments, Lushbinary turns frontier open models into shipped features. Let's scope your project.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

Contact Us

Encrypted in transit · GDPR ready · We never share or sell your data

Subscribe · Newsletter

Build on Frontier Open Models

Clear breakdowns of the models reshaping AI, plus the engineering that turns them into shipped, production-grade features.

  • New deep-dives on AI agents and cloud architecture
  • Engineering teardowns of shipped products
  • No spam, unsubscribe in one click

We respect your inbox. Read our privacy policy.

Exclusive Offer for Lushbinary Readers
WidelAI
WidelAI

One Subscription. Every Flagship AI Model.

Stop juggling multiple AI subscriptions. WidelAI gives you access to Claude, GPT, Gemini, and more - all under a single plan.

Claude Opus & SonnetGPT-5.5 & o3Gemini ProSingle DashboardAPI Access

Use code at checkout for 10% off your subscription:

Nemotron 3 UltraNVIDIAOpen WeightsMixture of ExpertsMambaReasoning ModelsLatentMoEAgentic AI1M ContextOpenMDWLLMNVFP4

ContactUs

Contact us