#2067: MoE vs. Dense: The VRAM Nightmare

MoE models promise giant brains on a budget, but why are engineers fleeing back to dense transformers? The answer is memory.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2223
Published: Apr 6
Duration: 24:18
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-models fine-tuning edge-computing

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The AI industry is currently caught in a tug-of-war between two competing architectures: Mixture of Experts (MoE) and dense transformers. While headlines frequently celebrate massive MoE models like DeepSeek-V3 and Mixtral for their impressive parameter counts, a closer look at deployment realities reveals why dense models like Llama are experiencing a resurgence in production environments. The core difference lies in how these models utilize their "brain" during inference. A dense model activates every single parameter for every token generated, acting as a consistent, unified generalist. In contrast, an MoE model resembles a university faculty, where a "router" activates only a small subset of specialized experts for any given query, keeping the rest dormant.

The primary allure of MoE is training efficiency. Companies can scale the total "knowledge capacity" of a model—hitting parameter counts in the hundreds of billions—without the linear explosion of compute costs associated with dense training. For instance, a model like DeepSeek-V3 might have over 600 billion parameters total but only fire roughly 37 billion at a time. This allows for the breadth of a giant model with the per-token compute of a mid-sized one. However, this efficiency is a mirage when it comes to inference and memory.

The critical bottleneck is VRAM. To run an MoE model, you must load all its parameters into memory, even if only a fraction are used for math. This creates a "VRAM tax" where an MoE model requires massive hardware just to exist. A dense 70B model might fit on a single high-end GPU node, offering predictable latency and cost. Conversely, a 470B MoE model might require eight GPUs just to hold the weights, with seven of them acting as expensive storage drives for 90% of the time. For startups or individual developers, this infrastructure requirement is often a dealbreaker, making dense models far more accessible for low-to-mid traffic applications.

Beyond hardware, MoE introduces significant complexity in stability and fine-tuning. The router mechanism is prone to "expert collapse," where a feedback loop causes one expert to dominate all traffic, effectively turning the model into a single, overworked specialist while wasting the rest of the parameters. Fine-tuning exacerbates this; updating the model on niche data (like legal documents) can disrupt the router's delicate balance, causing "expert migration" where the model loses general reasoning capability. In contrast, dense models update smoothly across their entire parameter set, offering stability and predictability that enterprise engineers prioritize.

Latency and edge deployment further tilt the scales toward dense models. In high-throughput scenarios, MoE can suffer from load balancing issues—imagine a thousand users asking coding questions simultaneously, bottlenecking the "coding expert" slice while other experts sit idle. Dense models distribute work evenly, offering consistent latency. On the edge, like smartphones or laptops, RAM is scarce and shared. An MoE model might occupy 32GB of RAM but only use 4GB for computation, paralyzing the device. Dense models maximize every byte of memory, making them the only viable option for local, on-device AI.

Ultimately, the choice isn't about which architecture is "better," but where each excels. MoE is a powerhouse for massive-scale training and high-throughput inference in resource-rich environments. Dense models win on predictability, fine-tuning stability, and accessibility for the majority of developers. As the industry matures, the "free lunch" of MoE is being reevaluated against the practical costs of memory and complexity, ensuring dense transformers remain a cornerstone of AI deployment.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2067: MoE vs. Dense: The VRAM Nightmare

So, I was looking at the landscape of these new model releases, and it feels like we are living in a world completely obsessed with the flashy and the complex. It is almost like every time a new benchmark drops, the headline is some massive Mixture of Experts architecture. But Daniel sent us a prompt that really grounds the conversation. He wrote, quote: "Generate an episode framed as Mixture of Experts versus dense transformers—the empire strikes back. Mixture of Experts, or MoE, gets all the headlines with DeepSeek and Mixtral, but dense models like Llama and Qwen-dense refuse to die. Walk through the tradeoffs: MoE wins on training-compute-per-quality but is a nightmare to serve due to huge VRAM footprint for parameters you mostly don't activate, routing complexity, and fine-tuning instability. Dense is simpler, has more predictable latency, and is better for edge deployment and fine-tuning. Conclude with why dense isn't dead and where each architecture wins." End quote.

I love that framing, the empire strikes back. My name is Herman Poppleberry, and I have been waiting for us to really dig into the architectural guts of this. Because you are right, Corn, if you just read the tech blogs, you would think dense models are these dusty relics from twenty twenty-two. But the reality on the ground—especially for engineers actually deploying this stuff—is that dense models are having a massive resurgence. By the way, before we dive too deep into the VRAM weeds, today’s episode is powered by Google Gemini three Flash. It is the brain behind the script today.

Gemini three Flash, keeping it snappy. I like it. So, Herman, let's start with the basics for anyone who has been living under a rock. When we say MoE versus dense, what is the actual physical difference in how these things think? Because from the outside, they both just look like a chat box.

It comes down to how the model uses its brain for every single word it generates. A dense model, like Llama three point three seventy B, is a generalist. Every single one of those seventy billion parameters is "switched on" and doing math for every single token. It is like a single person who knows everything the model knows and uses their whole brain to answer you. MoE is more like a university faculty. You might have a model with four hundred billion parameters total, but for any given question about cooking, it only activates a small "expert" group of, say, twelve billion parameters. The rest of the brain stays dark.

So MoE is basically a "work smart, not hard" approach to compute. If I'm just asking for a pancake recipe, I don't need the quantum physics department waking up and charging me for electricity.

That is the theory, and it is why it wins so hard on training efficiency. If you are a company like DeepSeek, and you want to train a model that performs like a five hundred billion parameter giant but you only want to pay the electricity bill for a forty billion parameter model, MoE is your only choice. It lets you scale the "knowledge capacity" of the model—the total number of parameters—without the training cost exploding linearly. DeepSeek-Vthree, which we saw earlier this year, is the poster child for this. It has over six hundred billion parameters, but it only "fires" about thirty-seven billion at a time.

But wait, if I’m only firing thirty-seven billion parameters, am I really getting the "intelligence" of a six hundred billion parameter model? Or is it just a very large collection of small, specialized models?

That’s the million-dollar question. In a way, you’re getting the "breadth" of a giant. Imagine a library with a million books. A dense model is like a person who has memorized every page of every book. An MoE model is like a library with a thousand specialized librarians. When you ask a question, the "router" finds the two librarians best suited for the job. You get the benefit of all that specialized knowledge without having to pay for a thousand people to speak at once. But the catch—and there’s always a catch—is that the library still has to exist. You still need the physical space for all those books and all those librarians.

Okay, so that sounds like a free lunch. I get the smarts of a giant for the price of a mid-sized model. Why isn't every single model on Hugging Face an MoE model by now? Why is the "Empire" of dense models striking back?

Because the "free lunch" only applies to the math, not the storage. This is where the nightmare begins for anyone not named Google or Microsoft. To run that DeepSeek model that only uses thirty-seven billion parameters for math, you still have to load all six hundred billion parameters into your Video RAM. Parameters are like books in a library. Even if you are only reading one book at a time, you still need the shelves to hold all of them.

Right, and VRAM is the most expensive real estate in the world right now. So if I want to run a "fast" MoE model, I might still need eight A-one-hundred GPUs just to hold the weights in memory, even if seven of those GPUs are basically just acting as very expensive hard drives for ninety percent of the time.

You nailed it. That is the VRAM tax. If you take a dense model like Llama three point three seventy B, it fits comfortably on a single high-end node. It is predictable. But with an MoE like Mixtral eight-by-seventy B, you are looking at nearly four hundred and seventy billion parameters. Even though it is "faster" per token, the infrastructure requirement to even get it out of bed is massive. For a startup or a local developer, that is a dealbreaker.

Does that mean the "inference cost" is actually higher for MoE in some cases? If I have to rent eight GPUs to run an MoE but only one for a dense model, am I really saving money?

This is where the math gets tricky. If you have a massive amount of traffic—think millions of users—MoE can be cheaper because those eight GPUs can process tokens much faster than one GPU could. You're getting higher "throughput." But if you’re a small dev with a low-traffic app, you’re paying for seven idle GPUs just to keep the model weights warm. It’s like renting a whole bus to drive one person to work. The bus is efficient if it's full, but it's a huge waste if it's just you and the driver.

I want to talk about the "Router." This is the part of the MoE that acts as the traffic cop, right? It decides which expert gets which token. That sounds like a point of failure to me. If the traffic cop is having a bad day, the whole model falls apart.

It is incredibly fragile. This is what we call "routing collapse" or "expert collapse." Think of it this way: if the router decides, for whatever reason, that Expert Number Five is the best at everything, it starts sending every token there. Expert Number Five gets all the "practice" and all the gradient updates during training, while the other experts just sit there and atrophy. Eventually, you don't have a Mixture of Experts; you just have one very overworked, mediocre expert and a lot of wasted VRAM.

It is like a group project where one person does all the work, but you still have to pay the tuition for all eight people. But how does that happen in the first place? Is it a math error or just bad luck with the data?

It's usually a feedback loop. In the beginning of training, the router makes a random choice. If Expert Five happens to be slightly better at the first few sentences, the router sends more work there. Because Expert Five gets more work, it gets more "training," so it becomes even better. It’s a "rich get richer" scenario. Researchers have to use these weird "load balancing losses"—basically a mathematical stick to beat the router into sharing the work—but even then, it’s never perfect.

And it gets worse when you try to fine-tune them. This is one of the biggest reasons the "Empire" of dense models is winning in the enterprise space. Let's say you take a great MoE model and you want to fine-tune it specifically on your company's legal documents. In a dense model, every parameter gets updated. It is a smooth, stable process. In an MoE, the fine-tuning data might only trigger a couple of experts. You end up "breaking" the delicate balance the router learned during initial training. You get "Expert Migration" where the model suddenly forgets how to do basic logic because the legal tuning pushed all the "thinking" into the "legal" expert.

And imagine you’re a developer trying to debug this. In a dense model, if the model is hallucinating, you can look at the weights and the data. In an MoE, you have to figure out if the expert is wrong or if the router is just sending the token to the wrong place. Is the lawyer expert answering the coding question? Or is the coding expert trying to learn law? It adds a whole new layer of "what went wrong?" to the engineering process.

So you end up with a model that knows your contracts but has the IQ of a toaster for everything else. Meanwhile, the dense model is just sitting there, taking the updates across its whole brain, staying stable. I can see why an engineer would prefer the "boring" dense model for a production app. It is predictable.

Predictability is the keyword. In high-throughput environments—like if you are running a customer service bot for a million people—dense models often actually have better latency. With MoE, you have "load balancing" issues. If a thousand people all ask a coding question at the same time, they all get routed to the "coding expert." That specific slice of the hardware gets slammed while the "poetry expert" slice sits idle. You get these weird bottlenecks that you just don't see with a dense transformer where the workload is perfectly distributed.

That is a great point. It is the difference between a buffet where everyone is crowding the prime rib station and a sit-down dinner where everyone gets the same plate at the same time. The buffet might have more variety, but the sit-down dinner is more efficient for the kitchen.

And think about the "cold start" problem. If you’re using serverless GPUs, loading a 70B dense model takes a few seconds. Loading a 600B MoE model involves moving hundreds of gigabytes across a network. Your "latency" isn't just the time it takes to generate the word; it's the time it takes to even get the model ready to talk. For many applications, that initial wait is a dealbreaker.

Well, I promised I wouldn't say that word, so let me say: you are spot on. And that brings us to the edge. Think about your phone or your laptop. We are seeing these Llama three point two models, the one-B and three-B versions, designed to run locally. You can't really do that with MoE effectively yet because of that memory overhead. If you want a model on your iPhone, you want every single byte of that memory to be working for you at all times. You can't afford to have three gigabytes of "dormant experts" sitting in your phone's RAM just in case you ask a question about seventeenth-century French poetry.

On a device like a MacBook or a phone, RAM is shared between the system and the GPU. If an MoE model takes up 32GB of RAM but only uses 4GB for the actual math, you’ve just paralyzed the rest of the computer for no reason. Dense models are "maximally efficient" for the storage they occupy. Every bit of data you load into memory is actually being used to generate your answer.

Right, the "slop" factor. We talked about "The Slop Reckoning" in a previous life—this idea that we are using nuclear reactors to toast bagels. MoE feels a bit like that for the edge. It is overkill on the storage side for a device that is already starved for memory. Is there any world where "Edge MoE" makes sense? Maybe if the experts were tiny?

People are trying! There’s research into "Sparse MoE" where the experts are extremely small, but then the routing overhead becomes a larger percentage of the total work. It’s like having a thousand tiny tools in your toolbox instead of ten big ones. You spend more time looking for the right tool than actually using it. For now, the "Empire" of dense models owns the edge.

And there is a "Quantization Tax" too. People in the local LLM community love to squeeze models down to four-bit or even two-bit precision to make them fit on consumer hardware. Dense models take this incredibly well. You can crush a seventy-billion parameter dense model down to four bits and it barely loses any intelligence. But MoEs are notoriously finicky with quantization. Because the experts are often smaller and more specialized, if you start clipping their precision, they lose their "edge" much faster than a big, redundant dense model would.

That’s a really subtle but important point. In a dense model, knowledge is distributed. If you lose a little precision in one area, the rest of the "brain" can often compensate. In an MoE, if you quantize the "coding expert" too hard, there’s no one else to pick up the slack. The model just stops being able to code. It’s much more "brittle" under pressure.

So the "Empire" is effectively the Llama-four and the Qwen-Denses of the world saying, "We might be heavy on the compute, but we are easy to live with." It is the difference between a high-maintenance supercar that needs a specialized mechanic and a reliable pickup truck that you can fix with a wrench in your driveway.

That is the perfect analogy. And look at what Qwen is doing. They released Qwen-two point five recently, and their dense models are putting up numbers that rival MoE models twice their size. They are proving that you can still get massive "knowledge density" in a dense architecture if you curate your data well enough. We used to think you needed MoE to hit certain performance tiers. In twenty twenty-six, we are realizing that "better data" can make a seventy-B dense model act like a two-hundred-B MoE.

I've heard this called the "Chinchilla Optimality" debate. Are we just realizing that we weren't training the dense models long enough?

Part of it is that. We used to think a 70B model was "done" after two trillion tokens. Now we’re seeing Llama models trained on fifteen trillion tokens. They are becoming "super-dense." They are packing more information into every single parameter. It turns out that a "small" brain that is incredibly well-educated is often better than a "giant" brain that is mostly empty space.

So where does MoE actually win? Because it isn't like DeepSeek is stupid. They are saving millions of dollars on training. If you are a frontier lab, and you are trying to build the "one model to rule them all," the "library of everything," MoE is still the king of training efficiency, right?

If your goal is to train on fifteen trillion tokens and you have a fixed budget of GPU hours, MoE lets you "see" more data per dollar. It allows for massive scaling. If you are a cloud provider like OpenAI or Anthropic, and you are serving millions of users, you can afford the specialized infrastructure to handle the routing and the VRAM. You can build custom kernels that make MoE serving efficient. For the "God Models," MoE is the architecture of choice.

But what about the "routing" itself? Is it just a simple switch, or is there a "mini-brain" inside the model doing that job?

It’s essentially a small linear layer. It looks at the incoming token and assigns a probability score to each expert. "I think this is a 90% match for Expert A and a 10% match for Expert B." The model then does a weighted average of their outputs. It’s very fast, but it’s another thing that has to be trained. If that little linear layer doesn't learn correctly, the whole multi-hundred-billion parameter model is useless. It’s the ultimate "single point of failure" in an architecture that is supposed to be distributed.

But for the rest of us—the developers, the startups, the people running stuff on-prem—the dense model is the "Empire" that never really left. It is the stable foundation.

It really is. And I think we are going to see a "hybrid" era soon. We are already seeing research into things like "Dense-MoE" hybrids or models that use MoE for some layers but stay dense for others to maintain that fine-tuning stability. But for right now, if someone asks me what model they should use for a specialized business application, I almost always point them toward a high-quality dense model first. The headaches you save on deployment and fine-tuning are worth the extra few cents in compute cost.

It is funny how the hype cycle works. We all got blinded by the "eight-by-seven-B" and "sixteen-by-something" numbers, thinking bigger is always better. But in the deployment era, "fits on my machine" is the most important feature.

It is the ultimate feature. And what is wild is that the dense models are getting more efficient too. We are seeing things like Multi-head Latent Attention—which DeepSeek actually pioneered—being pulled back into dense architectures to shrink the Key-Value cache. The dense models are learning tricks from the MoE world to stay competitive.

Wait, can you explain the Key-Value cache thing? How does a dense model use an MoE trick for that?

So, the KV cache is the "short-term memory" the model uses while it's typing. It grows with every word. In the past, dense models had huge KV caches that would eat up all your VRAM. DeepSeek found a way to "compress" that memory using some very clever math. Now, dense models are adopting that same math. It means a 70B dense model can now handle a much longer conversation—like a 128k context window—without needing a supercomputer. The "Empire" is literally stealing the rebels' best technology to stay in power.

So, if we are looking at the scorecard, MoE wins on raw "knowledge capacity" per training dollar, but Dense wins on... well, everything else? Serving, stability, fine-tuning, and edge?

Pretty much. Dense is the "predictable" architecture. If I send a prompt to a dense model, I know exactly how many FLOPs are going to be used and exactly which parts of the chip are going to be hot. In a world where we are trying to optimize every millisecond of inference, that predictability is gold.

I think about Daniel’s prompt and his mention of "routing complexity." It reminds me of the early days of parallel computing. Everyone thought we would have these massive, complex arrays of specialized processors, but eventually, we just settled on making the general-purpose stuff really, really fast and efficient because the "management overhead" of the complex stuff wasn't worth the gain.

That is exactly what is happening. We are realizing that the "management overhead" of an MoE—the routing, the load balancing, the VRAM management—is a hidden cost that doesn't show up on a benchmark chart. But it shows up on your AWS bill and in your engineering team's stress levels. Plus, there's the "Expert Parallelism" issue. To run an MoE efficiently, you often have to split the experts across different chips. This creates a "communication bottleneck" where the chips are spending more time talking to each other than doing math.

It’s like a meeting that could have been an email. The chips are just waiting for the "expert" on the other side of the server rack to finish its sentence.

In a dense model, you can often fit the whole active "thought" on one chip or a very tightly coupled pair. The communication overhead is minimal.

So, for the listener who is sitting there trying to decide which model to pull from Hugging Face for their next project: what is the "Corn and Herman" rule of thumb here?

If you are prototyping, if you are fine-tuning on a specific domain, or if you are deploying to anything smaller than a massive GPU cluster, go Dense. Look at Llama three point three or the latest Qwen-Dense. They are robust, they are "quantization-friendly," and they won't break when you try to teach them your company's specific jargon.

And what about the "fun fact" side of this? Is there any weird trivia about these architectures that people might not know?

Here’s a fun one: the "mixture of experts" concept actually dates back to the early 1990s. It was proposed by Geoffrey Hinton and others long before we had GPUs. They had the idea, but they didn't have the hardware to make it work. It took thirty years for the hardware to catch up to the "librarian" theory. Meanwhile, the "dense" transformer is a much younger child, only really coming into its own in 2017. So in a weird way, the "Empire" is actually the younger, more modern architecture, and MoE is the "Old Guard" making a comeback.

That’s fascinating. So the "Empire" is the young upstart that streamlined everything, and now the ancient "Expert" philosophy is trying to reclaim the throne.

And if you are trying to build the next trillion-parameter world-brain and you have a hundred million dollars of venture capital to burn on H-one-hundreds?

Then you go MoE. You hire a team of three hundred PhDs to manage your routing stability and you bask in the glory of your training efficiency. But for the "Empire" of real-world applications, the dense transformer is still the king of the hill.

It is a classic story, isn't it? The flashy new tech gets the headlines, but the reliable, refined version of the old tech is what actually runs the world. I feel a lot better about my "boring" seventy-B dense models now.

You should! They are masterpieces of engineering. We are squeezing more "intelligence per parameter" out of dense models than we ever thought possible two years ago. The "Death of the Dense Model" was greatly exaggerated. It’s like the internal combustion engine—every time people say it’s reached its limit, someone finds a way to make it 20% more efficient.

Well, this has been a deep dive. I feel like my own "router" is a bit overheated from all these technical specs, but I think the takeaway is clear. Efficiency isn't just about FLOPs; it is about the total system. It's about how much it costs to store, how hard it is to train, and how reliable it is when a customer actually uses it.

And if you're a developer, you don't want to be the one waking up at 3 AM because your "routing" collapsed and your model started speaking in tongues because the coding expert tried to take over the poetry department.

I mean—you are right. It is about the whole stack, from the silicon to the VRAM to the person trying to get the model to follow a simple instruction.

Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power this show—whether we are running dense or MoE, Modal makes it easy to scale. This has been My Weird Prompts. If you are finding these deep dives helpful, leave us a review on Apple Podcasts or wherever you listen. It really helps other nerds find the show.

And you can always find the full archive and our RSS feed at myweirdprompts dot com. We’ve got some great episodes coming up on the future of specialized silicon, so stay tuned.

Until next time, stay curious and keep those VRAM budgets in check.

See you then.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2067: MoE vs. Dense: The VRAM Nightmare

Downloads

You Might Also Like

#2067: MoE vs. Dense: The VRAM Nightmare