Daniel sent us this one — he wants to talk about how tightly models and hardware are coupled, whether that coupling changes depending on the type of inference you're doing, and if some pairings are basically locked in while others leave room to shop around. The real question underneath it, I think, is whether there's such a thing as optimal hardware, or if that answer shifts depending on what you're actually trying to accomplish.
Oh, this is the good stuff. And you know what's funny — most people still talk about this as if the answer is just "buy more GPUs," and the actual landscape has completely fragmented. It's not one market anymore. It's three or four different markets wearing the same trench coat.
Three or four markets in a trench coat. You've been waiting to use that one.
I have, and it landed. But seriously — training and inference are diverging so fast that the hardware that's optimal for one is increasingly wrong for the other. There's a ModulEdge analysis from a couple months ago that put it perfectly. Training is a cost center. Inference is the product. You train a model once, but you run inference millions of times a day. The economics are completely different.
That distinction seems obvious once you say it, but I don't think most people outside the infrastructure world have internalized what it means for hardware choices. If inference is the product, then cost per query and latency consistency matter way more than raw teraflops.
And here's where the coupling question gets interesting. GPUs are general-purpose parallel processors. They can do anything — training, inference, graphics, scientific simulation. But that generality means they're not optimized for the specific thing you need to do a billion times a day. An H100 draws over seven hundred watts per chip, relies on complex memory hierarchies with variable latency, and requires batching queries to hit efficiency targets. For training, that's fine. For serving a chatbot that needs to respond in under a hundred milliseconds, it's expensive overhead.
The coupling is looser with GPUs — they'll run anything — but loose coupling means you're leaving efficiency on the table. Is that the tradeoff?
That's exactly the tradeoff. And different companies are placing their bets at different points on that spectrum. Let me walk through the main architectures, because each one represents a different point on what I'd call the rigidity spectrum. On one end, you've got ASICs — application-specific integrated circuits — where the chip is purpose-built for a specific model or model class. There's a paper from XgenSilicon on arXiv just a few weeks ago where they used reinforcement learning to co-design the chip architecture and the compiler for Llama three-point-one eight-billion. They got nearly thirty thousand tokens per second at three nanometers. But that chip runs that model, and basically nothing else.
Thirty thousand tokens a second is absurd. But you'd need to be serving that exact model at massive scale to justify the engineering cost of designing custom silicon for it.
Right, the NRE — the non-recurring engineering cost — is enormous. You need Google-scale volume. Which is exactly why Google built TPUs. They've got over five million TPU chips planned by twenty twenty-seven. Amazon built Inferentia. Microsoft is developing something called Athena. Meta is reportedly negotiating to buy billions of dollars worth of Google's TPU chips.
Wait, buying Google's chips? That's like Coca-Cola buying bottles from Pepsi.
It tells you how much the economics matter. Google's TPU v6e runs at about two dollars and seventy cents per chip-hour. NVIDIA's B200 is around five fifty. That's roughly a fifty percent cost difference, and on some workloads TPUs show up to four times better throughput per dollar. When you're and you're serving inference to billions of users, those differences add up to real money fast.
The hyperscalers are all building or buying custom silicon. That's the rigid end of the spectrum — tight coupling, maximum efficiency, zero flexibility. What's in the middle?
The middle is where things get really interesting, and it's where the type of inference starts to dictate the hardware. Let me talk about LPUs — language processing units. Groq pioneered this architecture, and NVIDIA actually acquired Groq for twenty billion dollars. Let that sink in. The company with over eighty percent of the AI training market spent twenty billion dollars specifically because inference requires different silicon than training.
Twenty billion is a loud admission that GPU dominance in training doesn't automatically translate to inference. What makes LPUs different?
Groq's LPU architecture uses on-chip SRAM — no external memory, no caches, no variable latency. If the compiler says an operation takes twenty-eight point five milliseconds, it takes exactly twenty-eight point five milliseconds every single time. On a GPU, latency varies depending on memory access patterns, cache hits, batch sizes. Groq claims three hundred to five hundred tokens per second on large language models versus around a hundred tokens per second on typical GPU setups.
It's faster and predictable. What's the catch?
The catch is memory. On-chip SRAM is about two hundred thirty megabytes. An H100 has eighty gigabytes of high-bandwidth memory. So large models have to be sharded across hundreds of LPU chips. The architecture works brilliantly for latency-sensitive inference, but you need a lot of chips.
Which brings us back to the coupling question. An LPU is less rigid than an ASIC — it'll run different models as long as they fit within the compiler's constraints — but it's still way more specialized than a GPU. You're making a bet on a particular inference profile.
NVIDIA's post-acquisition strategy is fascinating here. They've created this thing called the NVIDIA Groq Three LPX, co-designed with their Vera Rubin NVL72 platform. It's a heterogeneous architecture — Rubin GPUs handle prefill and decode attention, while the LPUs accelerate the latency-sensitive feed-forward and mixture-of-experts decode. The system dynamically routes different parts of the inference pipeline to different silicon.
Wait, so a single inference request gets split across two different types of processors in real time?
Modern inference is a relay race. The prefill phase — processing the input prompt — stresses compute and memory bandwidth differently than the decode phase, which generates tokens one at a time. Different sub-operations have different bottlenecks. NVIDIA's Dynamo orchestration software routes prefill to GPUs, hands off intermediate activations to LPUs for the feed-forward layers, and returns outputs to GPUs for continued token generation. They're claiming up to thirty-five times higher inference throughput per megawatt at four hundred tokens per second per user compared to the previous generation.
Thirty-five times per megawatt is not a marginal improvement. That's a step change. And it suggests the optimal hardware isn't a single chip at all — it's a system with different processors for different parts of the workload.
That's exactly the emerging paradigm. And it gets even more interesting when you look at the memory bandwidth numbers. The LPX system delivers forty petabytes per second of on-chip SRAM bandwidth and six hundred forty terabytes per second of scale-up bandwidth across two hundred fifty-six interconnected LPUs. For comparison, an H100 delivers about three point three five terabytes per second of HBM bandwidth. The bottleneck for a lot of inference workloads has shifted from compute to how fast you can move data to the compute units.
The coupling between model and hardware isn't just about the chip architecture. It's also about the memory hierarchy, the interconnect fabric, the orchestration layer. The "hardware" in this equation is really the whole system.
That's why the type of inference matters so much. Let me give you a concrete example. If you're doing batch processing — say, classifying millions of documents overnight — you care about throughput. You can batch queries, hide latency, maximize utilization. GPUs are great at this. But if you're running an AI coding assistant where a developer is waiting for autocomplete suggestions, latency compounds across every interaction. A hundred-millisecond delay per suggestion, multiplied across hundreds of suggestions per session, becomes a real productivity drain.
Agentic AI takes this to an extreme, doesn't it? If you've got multi-agent swarms with tight tool-calling loops, every step adds latency. A model that calls five tools to answer a query now has five round trips, each one adding overhead.
This is exactly why NVIDIA's technical blog frames the LPX as "the agentic inference engine." They're targeting a thousand tokens per second per user — what they call "speed of thought" computing. The idea is that if an AI can generate responses faster than a human can read them, the interaction feels instantaneous. But getting there requires deterministic, low-jitter, compiler-orchestrated hardware. Current GPU-centric data centers aren't optimized for that workload profile.
The type of inference — batch versus interactive, single-model versus agentic, text versus multimodal — directly determines where on the rigidity spectrum you should be. But most companies aren't Google or. They can't afford to design custom ASICs or deploy hundreds of LPUs. Where does that leave everyone else?
This is the practical question, and I think the answer is that GPUs remain the pragmatic default for most organizations, but the definition of "GPU" is expanding. NVIDIA's heterogeneous strategy with the LPX is designed to keep customers in their ecosystem — you buy Rubin GPUs and LPX accelerators, and Dynamo handles the orchestration. You get some of the benefits of specialization without having to become a chip designer.
You're still locked into NVIDIA's ecosystem. CUDA, their software stack, their interconnects.
That lock-in is a real concern. The software fragmentation is becoming a nightmare. You've got CUDA for NVIDIA, whatever Google's TPU software stack is called this year, Amazon's Inferentia SDK, the LPX toolchain. If you build your inference pipeline for one, switching to another is painful. It's not just about the hardware cost — it's the engineering cost of porting your entire serving infrastructure.
This is where I think the rigidity spectrum becomes a strategic question, not just a technical one. A tightly coupled ASIC gives you the best efficiency but the least flexibility and the highest switching cost. A GPU gives you flexibility but leaves efficiency on the table. Where you land depends on your scale, your workload predictability, and your tolerance for vendor lock-in.
There's another dimension I haven't mentioned yet — edge inference. NPUs, neural processing units, are designed for on-device AI. Phones, laptops, smart cameras. A KAIST benchmark from last year showed NPUs delivering up to sixty percent faster inference than modern GPUs while using forty-four percent less power on matrix-vector workloads.
Forty-four percent less power is huge for battery-constrained devices. But I assume there's a tradeoff.
GPUs remain superior for large-dimension matrix multiplication and high-throughput batch processing. NPUs are efficiency maximalists for the specific kinds of operations that dominate edge inference — small batches, latency-sensitive, power-constrained. They're achieving thirty-eight to sixty TOPS per watt, which is dramatically higher than GPU efficiency.
Even within inference, we're seeing fragmentation. Cloud inference favors one set of architectures, edge inference favors another, agentic workloads favor a third. The idea of "optimal hardware" only makes sense relative to a specific workload profile.
The workload profiles are multiplying. Two years ago, most inference was basically "send a prompt, get a response." Now you've got chain-of-thought reasoning that generates thousands of tokens internally before producing an answer. You've got vision-language models that process images and text together. You've got real-time voice models that need to handle streaming audio with imperceptible latency. Each of these stresses the hardware differently.
Let me push on something. You mentioned NVIDIA's thirty-five times throughput improvement per megawatt. Those numbers are always from NVIDIA's own benchmarks. How much of this is real versus marketing?
The thirty-five times figure is comparing the LPX-plus-Rubin system against the previous GB200 NVL72 generation specifically on interactive inference workloads at four hundred tokens per second. It's a real architectural improvement — the LPU's deterministic latency and on-chip SRAM genuinely change the efficiency equation for decode-heavy workloads. But it's also a carefully chosen comparison point. If you're doing training or high-batch inference, the improvement is much smaller.
The coupling between model and hardware also determines how much you should trust the vendor's benchmarks.
And this is where I think the industry conversation is missing something important. Everyone is focused on the chips — GPUs versus LPUs versus TPUs versus NPUs. But the data center infrastructure itself is fragmenting. Training clusters need thirty to a hundred-plus kilowatts per rack with liquid cooling. Inference farms might run on advanced air cooling with completely different density profiles. Traditional two-year data center build timelines create technology risk when the chip landscape shifts this fast.
If you build a data center optimized for today's GPU clusters, and in eighteen months the optimal architecture is LPU-based with different power and cooling requirements, you've got a stranded asset problem.
Modular, reconfigurable infrastructure is becoming a strategic necessity, not a nice-to-have. It's a knock-on effect of the model-hardware coupling question that most discussions completely miss. The hardware choice cascades into the facility choice, which cascades into the financing choice.
We've got a spectrum from rigid to flexible. ASICs on one end — maximum efficiency, minimum flexibility, massive upfront cost. GPUs on the other — maximum flexibility, lower efficiency, easier to deploy. LPUs and NPUs somewhere in the middle. And the hyperscalers are all placing different bets. Is anyone actually getting this right?
I think "right" depends on your timeframe. In the short term, NVIDIA's heterogeneous strategy is brilliant — they're saying "you can have specialized inference hardware without leaving our ecosystem." In the long term, I suspect we'll see more fragmentation, not less. Google has the scale to justify custom everything. Amazon will keep building Inferentia because they control the entire stack from silicon to customer. Microsoft has Azure and OpenAI, which gives them similar vertical integration.
The open-source models complicate this further. If you're serving Llama or Mistral or DeepSeek, you're not tied to a specific vendor's model architecture. You can shop around for the hardware that runs your chosen model most efficiently.
Which creates an interesting dynamic. The model providers want their architectures to run well on commodity hardware because that maximizes adoption. The hardware vendors want models that showcase their unique capabilities. There's a co-evolution happening that's going to shape both sides.
Let's bring this back to Daniel's question. Are some pairings more rigid than others? I think the answer is clearly yes. An ASIC designed for a specific model is about as rigid as it gets — you're marrying the silicon to the weights. An LPU is less rigid but still optimized for a particular inference profile. A GPU is the most flexible but pays for that flexibility in efficiency. And the "optimal" hardware isn't a single answer — it's a function of your workload, your scale, your latency requirements, and your tolerance for vendor lock-in.
I'd add that the type of inference is the most underappreciated variable. Most people think about "inference" as one thing. It's not. Batch inference, interactive chat, agentic loops, edge inference, real-time voice — these are different workloads with different bottlenecks. The hardware that's optimal for one might be mediocre for another.
That's a good framework. If you're running a coding assistant with sub-second latency requirements and tool-calling loops, you want something closer to the LPU end of the spectrum — deterministic, low-latency, optimized for decode. If you're doing overnight batch processing of customer support transcripts, a GPU cluster is probably fine. If you're Apple shipping on-device AI, you want an NPU that maximizes performance per watt.
If you're Google serving Gemini to a billion users, you build your own TPUs and optimize the entire stack from silicon to serving infrastructure. The coupling gets tighter as the scale increases.
There's one more angle I want to touch on. We've been talking about efficiency and cost, but there's also a reliability dimension. If your inference hardware has variable latency, your users experience inconsistent response times. Sometimes it's fast, sometimes it's slow. That variability can be worse for user experience than consistently moderate latency.
And it's exactly what the LPU architecture addresses. Deterministic latency means you can make reliable promises about response times. For applications like autonomous driving or real-time translation, predictability matters as much as raw speed.
The coupling question isn't just about performance — it's about the entire service-level agreement you can offer. Tighter coupling gives you more predictable behavior. Looser coupling gives you more flexibility but less predictability.
We haven't even talked about cost. The ModulEdge analysis pointed out something that should be obvious but often isn't. Training is a one-time cost per model. Inference is recurring — every query costs money, forever. If you're spending ten million dollars on training and a hundred million on inference over the model's lifetime, optimizing inference hardware has ten times the financial impact of optimizing training hardware.
Which explains why NVIDIA spent twenty billion on Groq. They saw where the recurring revenue is.
The inference market is where the real money is, and it's up for grabs. Google, Amazon, Microsoft,, Qualcomm — they're all building custom inference silicon. NVIDIA's GPU dominance in training doesn't automatically translate to inference, and they know it.
And now: Hilbert's daily fun fact.
The average cumulus cloud weighs about one point one million pounds — roughly the same as a herd of two hundred African elephants floating above your head.
What should listeners actually do with all this? If you're building or buying AI infrastructure, here's my practical takeaway. Start by defining your inference workload profile. What's your latency budget? What's your query volume? Are you doing batch processing or interactive serving? Agentic or single-shot? Edge or cloud? The answers will point you toward different points on the rigidity spectrum.
Don't optimize for today's models. The models are evolving faster than the hardware refresh cycle. If you're building infrastructure now, it needs to serve models that don't exist yet. That argues for some flexibility — unless you're at hyperscale and can afford to refresh custom silicon every generation.
The other practical point is to watch the software layer. The hardware is fragmenting, but the orchestration software — things like NVIDIA Dynamo — is trying to abstract that fragmentation away. If the software layer gets good enough, you might be able to mix and match hardware without rewriting your entire serving stack.
If you're a smaller organization, the pragmatic answer is probably still GPUs, but with an eye on the LPU and NPU space. The efficiency gains are real, and as the software ecosystem matures, the switching costs will come down. You don't need to be an early adopter, but you should be an informed follower.
One more thing. When vendors quote you efficiency numbers, ask what workload they're benchmarking. A thirty-five-times improvement on interactive inference at four hundred tokens per second is impressive, but it might be five percent on your specific batch workload. The type of inference determines whether the number is relevant to you.
Ask about the total cost of ownership, not just the chip cost. Power, cooling, rack density, software licensing, engineering time to port your models — all of that matters more than the per-chip price.
Here's the forward-looking question I keep coming back to. If the industry is fragmenting from "GPUs for everything" to a heterogeneous mix of specialized processors, what does that mean for the software ecosystem? CUDA's dominance was built on GPU ubiquity. If inference moves to LPUs and NPUs and TPUs and ASICs, does the software layer fragment too? Or does someone build the abstraction layer that makes heterogeneous inference as easy as writing PyTorch?
That's the trillion-dollar question. NVIDIA is betting they can be that abstraction layer — Dynamo plus CUDA plus their hardware portfolio. Google is betting on vertical integration. The open-source community is betting on frameworks that target multiple backends. I don't think anyone knows how it plays out, but whoever solves the software problem wins the inference market.
Thanks to our producer Hilbert Flumingtop. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com. We'll be back with another one soon.
Oh, and by the way — DeepSeek V four Pro wrote our script today. Not bad for a model that doesn't even have thumbs.
I've never needed thumbs to outthink a donkey.
That's debatable.