#2350: NVIDIA's Strategic Pivot: From Chipmaker to Model Builder

Dive into NVIDIA’s Nemotron 3 Super, a hybrid MoE model combining Mamba, Transformers, and multi-token prediction for cutting-edge efficiency.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2508
Published: Apr 20
Updated: May 15
Duration: 21:35
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: transformers latent-space ai-models

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

NVIDIA Nemotron 3 Super: A Hybrid MoE Model

NVIDIA’s Nemotron 3 Super represents a significant evolution in AI model design, combining innovative architectures and efficiency mechanisms to deliver high performance at lower computational costs. At its core, the model leverages a hybrid approach that integrates Mamba, a State Space Model (SSM), with the familiar Transformer architecture. This combination aims to balance Mamba’s memory efficiency with Transformer’s precision, achieving four times better memory efficiency for long-context tasks.

The model’s Mixture of Experts (MoE) design further enhances its efficiency. While Nemotron 3 Super boasts 120 billion parameters, only 12 billion are active during inference, thanks to NVIDIA’s Latent MoE implementation. This approach routes inputs to four experts but charges the inference cost of only one, maximizing expert coverage without increasing compute budgets.

Another standout feature is Multi-Token Prediction (MTP), which allows the model to predict multiple tokens per step, akin to speculative decoding. This mechanism reportedly boosts inference speed by three times, compounding the efficiency gains from MoE and the Mamba-Transformer hybrid.

Nemotron 3 Super was pre-trained on 25 trillion tokens and fine-tuned across ten environments using Supervised Fine-Tuning and Reinforcement Learning. Its agentic behavior capabilities make it particularly suited for dynamic, interactive workloads.

However, practical considerations temper some of the headline claims. While NVIDIA advertises a one million token context window, current providers cap it at 262,144 tokens, leaving the discrepancy unexplained. Similarly, benchmarks reveal mixed results: strong performance in reasoning and instruction-following tasks contrasts with weaker scores in deep knowledge retrieval and frontier research.

Pricing is competitive, with OpenRouter’s blended rate at $0.09 per million input tokens and $0.45 per million output tokens. Providers like DekaLLM, DeepInfra, and Nebius offer varying tradeoffs in cost and latency, with Nebius delivering faster token generation at higher prices.

Overall, Nemotron 3 Super showcases NVIDIA’s ambition to move beyond hardware into model development. Its hybrid architecture and efficiency mechanisms make it a compelling option for specific use cases, though its limitations highlight the challenges of balancing performance and practical deployment.

Mentions

AIME 2025 AI math exam benchmark
Claude Code Agentic coding tool from Anthropic
GPQA Diamond Graduate-level science benchmark
Hermes Agent Persistent self-improving agent by Nous Research
Mamba State Space Model for efficient long context
NVIDIA Nemotron 3 Super Efficient hybrid MoE model from NVIDIA
OpenClaw AI agent for messaging and automation
OpenHands Open-source AI coding agent
SWE-Bench Verified Software engineering benchmark
TerminalBench Terminal-based task benchmark

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2350: NVIDIA's Strategic Pivot: From Chipmaker to Model Builder

Welcome to My Weird Prompts. I'm Corn, he's Herman, and today we're doing an AI Model Spotlight. The model is NVIDIA Nemotron 3 Super, and we're going to spend the next twenty minutes or so pulling it apart properly. Let's start where we always start with these, which is the lab. Not a company that needs much introduction, but their AI model story is a bit more complicated than their chip story.

It really is. Most people know NVIDIA as the company that makes the GPUs that every AI lab in the world is desperately trying to get their hands on. That is still true. That is still the core business. But over the last couple of years NVIDIA has been making a much more deliberate push into being a model builder itself, not just the picks-and-shovels supplier.

They've been pretty candid about the fact that they were late to certain parts of that picture.

Jensen Huang said as much publicly, fairly recently. He acknowledged that NVIDIA was late to back foundational AI labs like Anthropic. His framing was that they recognized those labs needed massive compute funding, we're talking five to ten billion dollar ranges, and NVIDIA didn't move quickly enough to get exposure there. He called it a miss.

Which is a fairly unusual thing for a CEO to say out loud.

The follow-on was that they're now actively correcting for it. Huang said they're delighted to invest in players like OpenAI and Anthropic now. So the strategy has shifted from pure hardware supplier to something closer to infrastructure plus capital plus, increasingly, models.

Nemotron is the model side of that.

The Nemotron family is NVIDIA's attempt to show that they can build competitive frontier-adjacent models, not just sell the compute to run someone else's. They've also been expanding into robotics AI partnerships, physical AI as they're calling it, but that's a separate thread. For today, we're focused on Nemotron 3 Super specifically, which dropped on March eleventh, twenty twenty-six.

Let's get into what this thing actually is. A hundred and twenty billion parameters is the headline number, but that's not really the number that matters operationally, is it.

No, and this is where the architecture gets interesting. The model has a hundred and twenty billion total parameters, but only twelve billion of those are active on any given forward pass. That's the Mixture of Experts design, or MoE, doing its job. The idea is that you have a much larger pool of parameters than you ever use at once, and a routing mechanism decides which subset gets called for each input.

You're getting a model that has access to a hundred and twenty billion parameters worth of knowledge and capacity, but you're only paying the compute cost of running twelve billion at inference time.

That's the theory, yes. And NVIDIA has put a specific twist on how they implement it that they're calling Latent MoE. Standard MoE typically routes to multiple experts and you pay the compute cost of all of them. NVIDIA's version routes to four experts but charges you the inference cost of only one. The technical report goes into how they achieve that, but the practical upshot is that you're getting more expert coverage per token than a naive MoE implementation would give you at the same compute budget.

What's the backbone underneath all of that? Because MoE is a routing strategy, not an architecture in itself.

Right, and this is the part that's genuinely unusual. The backbone is a hybrid of two things: a Transformer, which is the standard architecture that underpins basically every major language model you've heard of, and something called Mamba, which is a State Space Model, or SSM. SSMs process sequences differently from Transformers. They're more memory-efficient at long contexts because they don't have to attend to every prior token the way a Transformer does. The tradeoff historically has been that pure SSMs can lose precision on tasks that require fine-grained attention. The hybrid approach is trying to get the efficiency of Mamba at long range with the accuracy of Transformers where it counts.

NVIDIA's technical report claims Mamba gives them something like four times better memory efficiency?

That's the figure they cite, yes. Four times better memory efficiency from the Mamba component. And then layered on top of all of this is Multi-Token Prediction, or MTP. Standard language models predict one token at a time. MTP adds a head to the model that lets it predict multiple tokens per step, which functions similarly to speculative decoding. The reported throughput gain from that alone is around three times faster inference.

You've got three compounding efficiency mechanisms: the sparse MoE activation, the Mamba-Transformer hybrid, and Multi-Token Prediction.

All three stacked. And the model was pre-trained on twenty-five trillion tokens, then post-trained using Supervised Fine-Tuning and Reinforcement Learning across more than ten environments. The RL piece is specifically oriented toward agentic behavior, which we'll get into when we talk about workloads.

What are we actually working with here.

This is where I need to flag something. The model description claims a one million token context window. That's the number NVIDIA is putting in the headline. But every provider currently serving this model through OpenRouter is capping it at two hundred and sixty-two thousand tokens. That's two hundred and sixty-two thousand, one hundred and forty-four tokens to be precise.

Is the one million number real or is it marketing.

We don't know from the information available to us. It could be a theoretical architectural capability that providers haven't enabled yet. It could be a hardware constraint at current deployment scale. The model card doesn't explain the gap. So when you're evaluating this for a long-context use case, the number you should be planning around is two hundred and sixty-two thousand tokens, not one million, until there's clarity on that discrepancy.

On modalities, this is text in, text out.

Text only as far as we can tell. No vision, no audio, no image input or output mentioned anywhere in the documentation. It does support tool calling, which is available through the DeepInfra provider specifically, and structured output is supported through DeepInfra and Nebius. Reasoning tokens are also exposed via the API, meaning you can see the model's chain-of-thought steps before the final answer, which is relevant if you're building anything that needs to audit or steer the reasoning process.

Let's talk pricing. Herman, I know you've got the breakdown, but before we get into the numbers, flag the caveat for anyone who needs to hear it.

Right, so all pricing we're about to cite is as of April twenty, twenty twenty-six. These numbers shift, sometimes weekly, especially on a routing layer like OpenRouter where providers are competing for traffic. Treat everything we say here as a snapshot, not a contract.

What's the headline rate.

OpenRouter's blended headline is nine cents per million input tokens and forty-five cents per million output tokens. For a model at this parameter scale, that is low. You're getting one hundred and twenty billion total parameters, twelve billion active, for under half a dollar per million tokens out.

There are three providers underneath that headline number.

DekaLLM is the cheapest, running FP eight quantization, at eight point nine cents input and forty-four point eight cents output. DeepInfra is in the middle at ten cents input and fifty cents output, running BF sixteen, which is full precision, and they're the only provider currently offering cache read pricing, also at ten cents per million. Then Nebius Token Factory out of the Netherlands is the most expensive of the three at thirty cents input and ninety cents output, running FP four quantization.

Nebius is three times the input cost of DekaLLM. What are you getting for that.

Nebius is averaging around eighty-one tokens per second. DeepInfra is around seventy-five. DekaLLM drops to about fourteen tokens per second. If you're running latency-sensitive workloads, that gap matters.

There's also a free tier.

There is, and the warning here is straightforward. The free tier logs all prompts and outputs for provider model improvement. That means no production use, no business-critical data, and definitely no personally identifiable or sensitive information going through it.

Let's get into the benchmarks. What is NVIDIA actually claiming here, and how much of it holds up.

There are two layers to this. There are the lab's own claims, and then there are the independent benchmark scores. They don't always tell the same story, and this model is a good example of that tension.

Start with the lab claims.

The headline claim is over fifty percent higher token generation compared to leading open models. The problem is they don't name the competitors. No side-by-side table, no specific model cited. So you can't verify that number directly from what's published on the model card. The technical report and NVIDIA's blog go further and claim up to five times higher throughput versus the prior Nemotron Super, and up to two point two times higher than GPT-OSS-120B and seven point five times higher than Qwen three-point-five 122B on B two hundred GPUs. Those are more specific, but they're also NVIDIA's own measurements on NVIDIA's own hardware, so take them as directional rather than definitive.

The independent scores.

Artificial Analysis has composite indices for this model. Intelligence Index is thirty-six point zero, which puts it better than seventy-eight percent of models they compare. Coding Index is thirty-one point two, also better than seventy-eight percent. Agentic Index is forty point two, better than seventy-four percent. Artificial Analysis also gave it an openness score of eighty-three out of one hundred, which is notably high. Weights, training data, and training recipes are all released. That's not common at this scale.

What about the specific reasoning benchmarks.

GPQA Diamond, which is graduate-level science questions, comes in at eighty percent. That's a strong number. IFBench, which tests instruction following, is seventy-one point five percent. Long-context reasoning on AA-LCR is sixty percent. The τ²-Bench Telecom score, which tests conversational agents in a dual-control setting, is sixty-seven point eight percent.

The ones that are less flattering.

HLE, Humanity's Last Exam, is nineteen point two percent. GDPval-AA, which measures performance on economically valuable tasks, is twenty-five point one percent. CritPt, research-level physics, is three point one percent. So this is not a model that's going to solve frontier research problems. The profile is strong on reasoning and instruction following, weaker on deep knowledge retrieval and hard science.

There's also a hallucination rate listed.

Yes, the AA-Omniscience hallucination rate is thirteen percent, meaning among the answers it gets wrong, thirteen percent involve a confidently incorrect response rather than an abstention. That's worth tracking if you're deploying this in any context where factual accuracy is load-bearing.

The lab also names AIME twenty twenty-five, TerminalBench, and SWE-Bench Verified as benchmark wins.

They do name those, but the actual scores for those three are not published on the model card page we reviewed. So we can't tell you what the numbers are. NVIDIA says they're leading results. We can't independently verify that from the available documentation.

Let's talk about where you'd actually reach for this. Given everything we've covered on the architecture and the benchmarks, what does the workload profile look like?

The clearest answer is multi-agent applications. That's not us reading between the lines, that's the stated design target. The architecture choices, the Latent Mixture of Experts, the Multi-Token Prediction head, the long context window, they all point in the same direction. This is a model built to sit inside an agent loop and run for a long time without falling over.

The token volume data from OpenRouter backs that up.

The top two apps by token consumption are both agent frameworks. OpenClaw, which is an AI agent for messaging, file, and email automation, has pushed through twelve point six billion tokens on this model. Hermes Agent from Nous Research, which is a persistent self-improving agent with over forty tools, has pushed through six point one nine billion tokens. Those are not people running one-off queries. That's sustained, high-volume agentic use. Claude Code and OpenHands, both agentic coding tools, are also in the top five. So the actual usage pattern is consistent with the design intent, which is not always the case.

What about long-context specifically. The context window situation is complicated, as we noted earlier.

The practical ceiling as served through the providers on OpenRouter is two hundred and sixty-two thousand tokens. That's still a very large working memory for a model. Cross-document reasoning, multi-step task planning where you need to hold a lot of state, retrieval-augmented generation where you're stuffing in large chunks of retrieved content, these all benefit from that window. The one million token figure in the description is either a theoretical capability or a future roadmap item. At this moment, you're working with two hundred and sixty-two thousand.

What about the coding use case specifically.

Agentic coding is a better fit than pure code generation. The Terminal-Bench Hard score of twenty-eight point eight percent and the SciCode score of thirty-six percent are not numbers that put this at the top of a coding leaderboard. But the tool calling support and the structured output capability, both available through DeepInfra and Nebius, make it viable as an orchestrator in a coding pipeline. Think less about asking it to write a function and more about asking it to manage a sequence of tool calls to accomplish a coding task.

RAG is explicitly called out as a recommended workload.

Yes, and the long context window and the instruction following score of seventy-one point five percent on IFBench both support that. If you need a model that can hold a large retrieval context and still follow a structured output format reliably, the capability profile is there. The hallucination rate of thirteen percent is the number to watch in any RAG deployment where factual precision matters.

How has the broader community responded to this one? Is there genuine enthusiasm or is this mostly NVIDIA marketing doing its job?

The reception has been positive, and I think that's worth saying clearly because it's not always the case when a hardware company releases a model. The Artificial Analysis write-up is probably the most substantive independent take we have so far. They gave it an openness score of eighty-three out of one hundred, which puts it well ahead of most peers on that dimension. NVIDIA released the weights, the training datasets, and the training recipes. That's a meaningful commitment to openness, and the community has noticed.

The technical report is public as well.

Yes, there's a full technical report from NVIDIA Research. It's not a marketing document. It goes into the architecture decisions in real detail, the pre-training on twenty-five trillion tokens, the post-training pipeline with supervised fine-tuning and reinforcement learning. Engineers who want to understand what they're running have the material to do that. That's not universal in this space.

What about the throughput claims? The five times higher throughput figure is the one NVIDIA has been leading with.

The NVIDIA blog and technical report both claim up to five times higher throughput than the prior Nemotron Super, and they attribute that to the combination of the hybrid Mamba-Transformer architecture, the Latent Mixture of Experts design, the Multi-Token Prediction head, and the NVFP four precision format. They also cite two point two times higher throughput than GPT-OSS-120B and seven point five times higher than Qwen three-point-five at 122B, measured on B two hundred GPUs. Those are specific claims with named comparators, which is more than you usually get. The OpenRouter provider data we looked at earlier is consistent with the throughput story, though that's a different measurement context.

Any critical voices?

There's a piece from a German tech outlet, Igor's Lab, that I think is worth flagging. The framing is that the model is impressive for agentic use cases but that the packaging around it warrants scrutiny. The specific point they make is that the one million token context window claim risks being read as a proxy for raw intelligence, and they argue those are different things. A long context window is an architectural capability. What you do with it depends on the model's reasoning quality, and reasoning quality has limits that context length doesn't fix. That's a fair point and it echoes what we said earlier about the gap between the claimed one million tokens and the two hundred and sixty-two thousand tokens actually served through current providers.

The honest read is strong reception with one recurring asterisk.

That's about right. The openness is real, the efficiency gains are real, and the agentic design intent is well-supported by the architecture. The one million token figure is the thing that keeps coming up as a caveat worth holding onto.

Alright, let's land this. If you're an engineer or an architect deciding whether Nemotron 3 Super belongs in your stack, what's the short version?

The short version is this. If you are building multi-agent systems, if you have long-context workloads, if you need a capable reasoning model that you can actually inspect, modify, and self-host, this is a serious option. The architecture was designed for exactly those use cases and the benchmark profile supports that. The agentic index score, the GPQA Diamond result, the throughput numbers, the real-world token volume from apps like OpenClaw and Hermes Agent, all of that is pointing in the same direction.

The price point is part of that story.

Nine cents per million input tokens at the OpenRouter headline rate is competitive for what you're getting. If you're running high-volume agentic pipelines, that matters. And if you want to avoid per-token costs entirely, the open weights mean you can run this yourself, assuming you have the hardware for it.

When would you not reach for it?

A few clear cases. If your workload requires vision or any multimodal input, this is not the model. The page is silent on multimodal capability and you should treat that as a no. If you're on the free tier, do not put sensitive data through it. Prompts and outputs are logged for provider model improvement. That's not a judgment, it's just the terms, and they're stated plainly. And if you are expecting the full one million token context window in production today, check with your provider first. Two hundred and sixty-two thousand tokens is what's actually being served right now, and that gap between the stated capability and what's live is something you need to verify before you build around it.

Any final framing?

NVIDIA is a hardware company that has built a model that competes credibly with dedicated AI labs on the workloads it was designed for. That's not a given. The openness is genuine, the efficiency story is well-supported, and the community reception has been substantive rather than reflexive. The caveats we've raised throughout are real but they're bounded. This is not a model with hidden problems. It's a model with a specific design target, and if your use case matches that target, the evidence says it's worth a serious evaluation.

Nemotron 3 Super from NVIDIA. Thanks for listening to My Weird Prompts.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2350: NVIDIA's Strategic Pivot: From Chipmaker to Model Builder

NVIDIA Nemotron 3 Super: A Hybrid MoE Model

Mentions

Downloads

You Might Also Like

#2350: NVIDIA's Strategic Pivot: From Chipmaker to Model Builder