#1094: The CPU-First Era: Why AI is Moving Back to the Processor

Is the GPU's reign over? Discover how modern CPUs and clever optimization are bringing powerful AI models to the hardware you already own.

0:000:00

Episode Details

Published: Mar 11
Duration: 24:30
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: architecture local-ai quantization

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Shift from Training to Inference

For the past several years, the conversation around artificial intelligence has been dominated by a single piece of hardware: the GPU. Because massive clusters of graphics cards are essential for training trillion-parameter models, a belief emerged that they were the only way to run AI at all. However, as the industry moves from the "training phase" to the "inference phase"—where users actually interact with these models—the hardware requirements are changing.

The central processing unit (CPU), once dismissed as too slow for AI, is making a significant comeback. This shift is driven by the realization that while GPUs excel at high-throughput training, CPUs are increasingly optimized for low-latency, energy-efficient inference on local devices.

Breaking the Memory Wall

One of the primary hurdles for running AI on standard hardware has been the "memory wall." Large language models are massive, and the bottleneck is often not how fast a processor can do math, but how quickly it can move data from memory to the processor.

Recent breakthroughs in quantization have changed the game. By "squashing" high-precision numbers down to 4-bit or even lower formats, developers can fit complex models into a CPU’s cache. Projects like Whisper.cpp and Llama.cpp have demonstrated that by writing code specifically for CPU instructions and bypassing heavy software layers, standard laptops can perform real-time speech-to-text and text generation without needing a dedicated accelerator.

The Rise of Matrix Extensions

Modern CPUs are no longer just "general purpose" in the traditional sense. Manufacturers like Intel and ARM have begun baking specialized matrix extensions—such as Intel’s AMX and ARM’s SME—directly into the silicon.

These extensions act like specialized calculators within the CPU core, allowing it to perform the complex matrix multiplication required by AI models in a single heartbeat. Unlike external GPUs, these units share the same high-speed memory and cache as the rest of the processor. This eliminates the need to move data across a slow bus, drastically reducing latency and power consumption.

The Future of the Edge

The move toward CPU-first AI has profound implications for edge computing and digital sovereignty. By utilizing the processor already present in a device—whether it’s a smartphone, a smart camera, or a car—manufacturers can reduce costs, heat, and complexity.

Furthermore, moving away from a GPU-only ecosystem democratizes AI. It ensures that high-performance intelligence isn't locked behind expensive, specialized hardware, but is instead available on the billions of general-purpose chips already in use around the world. While the GPU remains the king of the data center, the CPU is reclaiming its place as the primary engine for daily AI tasks.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1094: The CPU-First Era: Why AI is Moving Back to the Processor

Daniel's Prompt

Custom topic: How can speech-to-text models like Whisper run on CPU? Increasingly, we're seeing edge and embedded AI models that can run entirely on CPU hardware, which challenges the prevailing assumption that you | Hosts: herman, corn

You know, it is funny how certain ideas in technology become almost like religious dogmas. For the last five or six years, if you wanted to talk about artificial intelligence, you had to talk about GPUs. It was the only way. You had to have thousands of these high-end graphics cards humming away in a data center somewhere, or at the very least, a dedicated mobile chip with a neural engine. Anything else was considered a toy, or just plain inefficient. We have been living in this world where Nvidia is the sun and everything else is just a cold, distant planet.

It really was a one-track conversation. I am Corn, by the way, for anyone just joining us. And you are right, Herman. We have spent so much time obsessing over Nvidia's stock price and the sheer throughput of these massive clusters that we almost forgot about the most ubiquitous piece of silicon on the planet: the general-purpose central processing unit. But lately, that dogma is starting to crumble. Our housemate Daniel actually sent us a prompt about this very thing, wondering if we are entering a CPU-first era for artificial intelligence. He is looking at the landscape in early twenty-six and seeing that the hardware in our pockets and on our desks is doing things we were told only a server farm could handle.

It is a fascinating question because it challenges the fundamental assumption that specialized hardware is always better. We have seen this cycle before in computing, where you move toward specialization for performance and then back toward general-purpose hardware for flexibility and cost. And today, on March eleventh, twenty-six, we are seeing optimized code running on standard CPUs that rivals what we thought required a dedicated accelerator just eighteen months ago. The irony is delicious. We spent billions building these massive GPU factories, only to find that the developers were getting incredibly clever at making models smaller and more efficient.

While the world was fighting over H-one-hundred allocations, the open-source community was busy rewriting the rules. And the CPU manufacturers, like Intel and ARM, were not just sitting around. They have been baking specific instructions into the silicon that are designed to do exactly what a GPU does, but right there on the main processor. So today on My Weird Prompts, we are going to look at why the chip you already own might be the best AI engine for most of your daily tasks. We are diving into things like matrix extensions, quantization, and the real-world performance of models like Whisper running on nothing but a standard laptop processor.

This is really about the shift from training to inference. Everyone knows GPUs win at training. If you want to build a trillion-parameter model from scratch, you need the parallel power of thousands of cores. But when you are just running a model, the rules of the game change. The latency, the memory access, the power draw, it all looks different when you are just asking a question to a local model or asking your phone to transcribe a meeting.

Right, and I think we should start with that distinction. When people say CPUs are too slow for AI, they are usually thinking about the old way of doing things. They are thinking about a processor trying to crunch numbers one by one, which is what we call scalar math. But that is not how modern CPUs work anymore. If you try to run a large language model using traditional scalar math, it is painfully slow. It is like trying to empty a swimming pool with a teaspoon. But modern CPUs have moved far beyond that. We have had vector processing for a long time, things like AVX-five-hundred-twelve on Intel chips, which is like using a bucket instead of a teaspoon. But the real game-changer is what we are seeing now with matrix extensions.

Before we get into the deep hardware architecture, let us talk about a concrete example that people can actually wrap their heads around. One of the most famous examples of this CPU-first success is Whisper dot c-p-p. This is a port of OpenAI's speech-to-text model, Whisper, specifically optimized for C and C-plus-plus to run on standard hardware. It was created by Georgi Gerganov, and it basically proved that the Python-heavy, GPU-dependent stack was not the only way to live.

That is the gold standard for this discussion. Whisper is a heavy model. If you run the original version, it wants a beefy GPU to give you real-time transcription. But then developers realized that if you bypass all the heavy software layers and write directly for the CPU instructions, you can get incredible speed. I have seen Whisper running on a standard Apple M-series chip or a modern Intel laptop where it transcribes audio faster than real-time without even waking up the fan. And it is not just Whisper. We are seeing Llama dot c-p-p doing the same for large language models.

And how are they doing that? Is it just better coding, or is there a hardware trick involved? Because you cannot just "code away" the laws of physics.

It is a bit of both, but the hardware trick is the foundation. It comes down to something called quantization. Most AI models are trained using very high-precision numbers, like thirty-two-bit or sixteen-bit floats. But it turns out that for inference, for just running the model, you do not need all that precision. You can squash those numbers down to eight bits, four bits, or even one-point-five-eight bits without losing much accuracy. We talked about this a bit in episode six hundred thirty-three when we were looking at the memory wars. If the numbers are smaller, you can fit more of them into the CPU's cache.

Right, the cache is that high-speed scratchpad that sits right next to the processor. And this leads us to the "memory wall." In modern AI, the bottleneck is rarely the math itself. We have gotten very good at math. The bottleneck is moving the data from the memory to the processor. Large language models are massive. Every time you generate a word, the CPU has to look at all the weights of the model. If those weights are sitting in slow system RAM, the processor spends ninety percent of its time just waiting for data to arrive.

This is where the CPU actually has a sneaky advantage in some scenarios. If you have a massive L-three cache, like we see on some of the high-end workstation chips or the new gaming CPUs with three-D V-cache, you can keep more of the model right there next to the math units. But the real "secret sauce" Daniel asked about is the specialized extensions. Let us talk about Intel Advanced Matrix Extensions, or AMX. This was introduced in the fourth-generation Xeon chips, but it has trickled down into the consumer space with architectures like Meteor Lake and the newer Lunar Lake and Panther Lake chips.

Explain AMX like I am five, Corn. Because "matrix extension" sounds like something out of a math textbook that I would have failed.

Okay, imagine a traditional CPU core is a very smart accountant who can do one complex calculation at a time. A GPU is a stadium full of thousands of students who can all do simple addition at the exact same time. AMX is like giving that smart accountant a specialized calculator that can do a whole page of matrix multiplication in a single heartbeat. It is a dedicated block of silicon inside the CPU core that only does matrix math. It does not replace the core; it just gives it a superpower.

So, it is like having a tiny, very fast GPU built directly into the heart of the CPU core?

In a way, yes. But it is even better because it shares the same memory and the same cache as the rest of the processor. When you use a GPU, you often have to copy data from the system memory over to the GPU memory across a P-C-I-e bus. That creates a bottleneck. With something like AMX, the data is already there. You are not moving it across a slow bridge. You are just telling the CPU to switch into matrix mode for a few cycles. This is why the latency is so much better.

That is a crucial point. I think people often confuse throughput with latency. If I am a huge company like Google or Meta, I want throughput. I want to process ten thousand requests at once, so I use a massive cluster of H-one-hundreds or B-two-hundreds. But if I am a single user on a laptop and I want my AI assistant to respond instantly, I care about latency. I want that first word to appear as fast as possible. CPUs are actually excellent at low-latency, small-batch inference.

They really are. And ARM is doing something similar with the Scalable Matrix Extension, or SME. We are seeing this in the latest mobile chips and the new Windows-on-ARM laptops. SME is designed to be very flexible. It allows the processor to handle these large blocks of data in a way that can scale depending on how much power the chip has. It is not just about doing math faster; it is about doing it with less energy. This is why your phone can now do things like live translation or background blur in a video call without burning a hole in your pocket. It is using these specialized matrix instructions instead of the general-purpose ones.

I want to go back to the memory wall for a minute because you mentioned Apple Silicon. They have been the poster child for this "unified memory" approach. How does that play into the CPU-first paradigm?

Apple's M-series chips, especially the M-four and the newer M-five, are basically the blueprint for this. They put the RAM right on the same package as the chip. The CPU and the GPU share the same pool of high-bandwidth memory. This means if the CPU is doing some pre-processing and then the matrix units need to take over, there is zero copying of data. It is all just right there. When you combine that with a massive cache, you suddenly have a system that can feed an AI model much faster than a traditional PC with separate RAM sticks.

It is interesting to see how this is changing the market. For a long time, if you were a developer, you basically had to buy an Nvidia card if you wanted to do anything with AI. But now, if you are building an edge device, maybe a smart camera or a piece of industrial equipment, you might not need that extra chip. You can just use a modern Intel or ARM processor and get the job done. This brings us to Daniel's question about the future of AI hardware. Are we going to see CPUs that render dedicated accelerators unnecessary?

I think for the "edge," the answer is a resounding yes. Think about the cost. A dedicated AI accelerator chip adds twenty, fifty, or even a hundred dollars to the bill of materials for a device. It generates heat. It takes up space. If you can do that same work on the CPU that you already have to have in the box anyway, you simplify the whole design. We are seeing this in the automotive industry too. Cars have tons of sensors and need to make split-second decisions. Using the central processor for that inference instead of a separate board reduces the complexity of the car's computer system.

There is also a sovereignty and security angle here. As someone who follows the geopolitical side of tech, I find it interesting that relying on a single company like Nvidia for all AI compute is a huge risk. If we can run these models on general-purpose silicon that is manufactured by multiple different companies, including Intel here in the United States, it makes the whole AI ecosystem much more resilient. It democratizes the technology. You do not need a special permit or a ten-thousand-dollar piece of hardware to experiment with high-end AI. You just need a modern laptop.

That is a great point. But let us pivot a bit to where this approach fails. I do not want people to think that the GPU is dead. There has to be a threshold where the CPU-first approach just cannot keep up.

Oh, absolutely. The CPU is a decathlete. It is good at everything, but it is not the world record holder in any single specialized event. If you are trying to train a model from scratch, do not use a CPU. You will be waiting until the next century. Training requires massive parallelization that only a GPU or a TPU can provide. And if you are running a high-traffic web service where you need to process hundreds of requests every second, the throughput of a GPU cluster is still going to win. The CPU is for the "batch size of one."

If it is one user, one device, one model, the CPU is the king of efficiency and latency. But if it is one million users, you still need the big iron in the data center. However, think about how many AI applications are moving to the edge. Your phone's voice assistant, your email's autocomplete, the image editing tools in Photoshop, those are all single-user tasks. For the vast majority of what we call AI in our daily lives, the CPU-first paradigm is not just feasible, it is actually superior.

I am curious about the software side of this. We talked about Whisper dot c-p-p, but how hard is it for a regular developer to take advantage of things like AMX or SME? Do they have to write assembly code? Because if they do, this revolution is going to be very slow.

Thankfully, no. Most of this is being handled by libraries. Intel has the One-A-P-I and the Math Kernel Library. ARM has their own optimized libraries. And even the big frameworks like PyTorch and TensorFlow have added incredible support for these CPU instructions over the last year. So, as a developer, you might just flip a switch or use a specific version of a library, and suddenly your model is running five times faster on the same hardware. That is the real tipping point. When the tools make it easy, everyone starts doing it.

I am looking at some of the recent product launches, like Intel's Lunar Lake architecture. They are marketing it heavily as an "AI PC." But when you look under the hood, a big part of that story is just making the CPU itself better at these matrix operations, alongside a smaller NPU. It feels like the definition of a CPU is changing. It is no longer just the thing that runs the operating system. It is becoming a heterogeneous collection of specialized engines that all live under one roof.

Lunar Lake is a great example because it shows the hybrid approach. It has a CPU with great vector units, an integrated GPU that is quite fast, and a dedicated NPU, or Neural Processing Unit. The software can then decide which part of the chip is best for the task. If it is a tiny task that needs to run in the background with zero power, like noise cancellation, use the NPU. If it is a complex task that needs low latency, use the CPU with those matrix extensions. We are seeing the death of the standalone processor. Everything is becoming a system-on-a-chip.

This reminds me of our discussion in episode six hundred sixty-three, about the real cost of power and the divide between workstation and consumer hardware. We talked about how the hardware gap is narrowing. You can now get workstation-level AI performance on a high-end consumer laptop because these instructions have trickled down. It is no longer a "pro-only" feature.

It really has changed the game for mobile developers. If you are building an AI agent, which we discussed in episode four hundred seventy-seven, you need it to be responsive. You cannot have the user waiting three seconds for the phone to send a request to the cloud and get a response back. It has to feel local. And the only way to do that across millions of devices is to optimize for the CPU that is already in those devices.

Let us talk about the "agentic" workflow for a second. Why is the CPU's ability to handle branching logic superior to a GPU for specific agentic tasks?

That is a great technical nuance. A GPU is like a massive army that all has to march in the same direction. If you tell half the army to turn left and the other half to turn right, the GPU loses its efficiency. This is called "divergence." But AI agents often need to do exactly that. They need to say, "If the user asked for a calendar invite, go to this tool. If they asked for a summary, go to that tool." That is branching logic. CPUs are the masters of branching. They have incredibly sophisticated branch predictors. So when you have an AI model that is constantly making decisions and switching tasks, the CPU can actually outperform a GPU because it does not get bogged down by that logical complexity.

So the future isn't just "more math," it is "smarter math." I love that. Now, for our listeners who are actually looking to buy hardware or build something, what are the actual takeaways here? How should they evaluate this "CPU-first" world?

First, stop over-provisioning. If you are a small business or a developer, evaluate if your inference workload actually requires a GPU. You might be surprised to find that a modern Xeon or a high-end ARM server can handle your needs for a fraction of the cost of an Nvidia H-one-hundred instance. Second, focus on memory bandwidth over raw clock speed. If you are running local large language models, the speed of your RAM and the size of your CPU cache are going to matter more than whether your processor runs at four gigahertz or five gigahertz.

And what about the specific instruction sets? What should they look for on the spec sheet?

Look for hardware with specific I-S-A support. For Intel, you want to make sure the chip supports AMX. That usually means fourth-generation Xeon or the newer mobile chips like Panther Lake. For ARM, you want to look for SME or SME-two support. And if you are in the Apple ecosystem, just know that the M-four and M-five chips have significantly beefed-up matrix units compared to the M-one or M-two.

I also think we should mention the impact on the open-source community. Projects like Llama dot c-p-p have made it possible for people to run massive models on standard hardware. It has sparked this incredible wave of innovation because you do not need to rent an expensive cloud instance to experiment. You can just download a model and run it on your desktop. The open-source community has really been the driving force here. They were the ones who realized that we were leaving so much performance on the table by relying on these heavy, GPU-centric frameworks. They stripped it all back to the bare metal.

And once they did that, they realized, "Hey, this CPU is actually a beast if you talk to it the right way." We are still in the early stages. Imagine when we have CPUs where half the die area is dedicated to matrix math and cache. We are already seeing prototypes of chips with massive stacked L-three cache, like AMD's three-D V-cache technology. While that was originally designed for gaming, it is an absolute godsend for AI inference because it solves that memory bottleneck we were talking about.

So, looking ahead, what does this mean for the cloud-native AI business model? If everyone can run these models locally on their own hardware, why would they pay a monthly subscription to a big AI provider?

That is the multi-billion dollar question. I think we will see a split. The massive, god-like models that require trillions of parameters will stay in the cloud. But the specialized, personal models, the ones that know your schedule, your writing style, and your private data, those will live on your device. And they will run on your CPU because it is more private, more secure, and faster for that specific use case. It brings the control back to the user. And it makes the technology more accessible to people in parts of the world where high-speed internet is not a given.

It is a win for privacy, for sure. If the data never leaves my processor, I do not have to worry about some company using my personal notes to train their next model. It is a great reminder that software often lags behind hardware. We have had these powerful instructions in our chips for a few years, but it took the AI boom for developers to really figure out how to squeeze every last drop of performance out of them.

It really is. And it is a testament to the ingenuity of the engineers who are finding ways to do more with less. We spent years just throwing more hardware at the problem, more GPUs, more power, more data centers. Now we are seeing the era of optimization, where we are learning how to be smart about the silicon we already have. In five years, we won't talk about "AI chips" anymore because every chip will be an AI chip by default. It will just be part of what a computer is. Like being able to display graphics or connect to the internet. It is just a baseline capability.

That is when the technology really becomes interesting, when it becomes invisible. When you are not thinking about the matrix multiplication happening in your CPU, you are just interacting with a computer that understands you and helps you get things done. Well, I think we have covered a lot of ground here. From the technical nuances of AMX and SME to the broader market implications and the future of local AI. It is a lot to digest, but it is an exciting time to be watching this space.

It really is. And for our listeners, the next time you see a headline about a new GPU, just remember that the most important AI chip might already be sitting inside your laptop or your phone. It is just waiting for the right software to wake it up. If you are curious about the technical specs we mentioned, we will have a full breakdown in the show notes.

That is a great place to wrap things up. If you found this discussion helpful, or if you are now tempted to go and try running some local models on your own machine, let us know how it goes. We are always curious to hear about real-world performance. Maybe try Whisper dot c-p-p on that old laptop in your closet and see what happens.

And if you are enjoying My Weird Prompts, we would really appreciate it if you could leave us a review on your podcast app or on Spotify. It genuinely helps other people find the show and join our community. We have been doing this for over a thousand episodes now, and the community is what keeps us going.

Definitely. If you want to dive deeper into our archives, head over to myweirdprompts dot com. You can find all our past episodes there, along with an R-S-S feed if you want to subscribe directly. And if you are on Telegram, search for My Weird Prompts. We have a channel there where we post every time a new episode drops, so you will never miss a discussion.

Thanks to Daniel for sending in this prompt. It was a great excuse to geek out on some hardware architecture. It is not every day we get to talk about the difference between scalar and vector math in a way that actually matters for people's daily lives.

It really was. I could talk about cache hierarchy all day, but I think we should probably let people get back to their lives. Fair enough. This has been My Weird Prompts. I am Herman Poppleberry.

And I am Corn. Thanks for listening!

We will see you in the next one.

Until next time!

So, just to be clear, you are the donkey and I am the sloth?

In spirit, Herman. In spirit. But on the silicon, we are all just matrix operations.

That was surprisingly deep. Alright, goodbye everyone.

Bye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.