#633: Memory Wars: The Future of Local Agentic AI

Can your PC handle the next wave of AI agents? Herman and Corn dive into VRAM, quantization, and the future of running LLMs locally.

0:000:00

Episode Details

Published: Feb 15, 2026
Duration: 27:25
Audio: Direct link
Pipeline: V4
TTS Engine
LLM
Topics: ai-agents local-ai gpu-acceleration

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In the latest episode of My Weird Prompts, recorded in their living room in Jerusalem, brothers Herman and Corn Poppleberry took a deep dive into the rapidly evolving world of local artificial intelligence. Triggered by a question from their housemate Daniel, the duo explored a growing tension in the tech world: while AI models are becoming more efficient, the "agentic" workflows that allow AI to actually perform tasks—rather than just talk—are demanding more hardware power than ever before.

The State of Local AI in 2026

Herman kicked off the discussion by setting the scene for early 2026. He noted that the "baseline" for local performance has shifted dramatically. The Llama 4 series, particularly the 8-billion parameter model, has become the standard, significantly outperforming the much larger models of previous years. Meanwhile, models like Mistral NeMo and Microsoft’s Phi-4 are "punching way above their weight class" in the 10-to-14 billion parameter range.

However, the hosts pointed out that the size of the model is no longer the only—or even the primary—concern for users. The real bottleneck has shifted to how these models interact with data through protocols like the Model Context Protocol (MCP).

The "Working Memory" Problem

One of the most insightful parts of the discussion centered on the difference between a model’s "weights" and its "Key-Value (KV) Cache." Herman used a vivid analogy, comparing a model to a person who knows everything but lacks working memory. While a model’s "brain" (the weights) might fit into 8GB of VRAM, the moment it begins a conversation, it requires space to store history and context.

With the advent of agentic AI, which might need to "read" an entire local codebase or a massive library of documents via MCP, the context window requirements have exploded. Herman explained that a 128,000-token context window can require a cache that exceeds the size of the model itself. When this memory spills over from the fast Video RAM (VRAM) of a graphics card into the slower system RAM, performance collapses from a snappy 80 tokens per second to a glacial two or three, effectively "ruining" the agentic experience.

Quantization: How Low Can We Go?

To combat these hardware limitations, the industry has turned to quantization—the process of compressing AI models by reducing the precision of their internal numbers. Herman highlighted techniques like HQQ (Half-Quadratic Quantization) and EXL2, which allow models to be compressed down to as little as 2.5 bits.

The takeaway was startling: a Llama 4 70-billion parameter model can now run on a single consumer-grade 24GB graphics card if compressed to 3 bits, retaining about 95% of its original intelligence. However, Corn raised a critical point regarding reliability. For "agentic" use cases—where the AI must call functions and execute code—precision is paramount. A "misplaced comma" in a low-precision model can break a workflow, meaning that for serious work, 4-bit or 6-bit quantization remains the gold standard.

The Great Hardware Divide: PC vs. Mac

The discussion then turned to the practical hardware needed to survive this "memory war." For PC users, the 24GB of VRAM found in high-end cards like the NVIDIA RTX 3090, 4090, or 5090 is considered the "golden zone" for running sophisticated agents.

However, Herman argued that the most hope for the average user might actually lie with Apple. Because of Apple Silicon’s Unified Memory architecture, a Mac can share its system RAM with its GPU. This allows a Mac Studio with 128GB of RAM to run massive models that would require multiple expensive graphics cards on a traditional PC. This architectural advantage makes the Mac a powerhouse for local AI, as it can hold both a large model and a massive context window in high-speed memory simultaneously.

The Path Forward: Speculative Decoding and RAG

For those without high-end workstations, Herman and Corn discussed several emerging software "tricks" that could level the playing field. One such technique is "speculative decoding," where a tiny, fast model guesses the next few words, and a larger, smarter model verifies them, potentially tripling speeds.

Another major shift is the move toward integrated Retrieval Augmented Generation (RAG). Instead of forcing the AI to keep every piece of information in its active "working memory," newer architectures allow the model to quickly search a local index and only pull in relevant snippets. Herman compared this to the difference between memorizing a whole book and being exceptionally fast at using an index. By offloading inactive context to system memory and only keeping the "active" parts on the GPU, even 16GB machines could remain viable for agentic tasks.

Conclusion

The episode concluded with practical advice for listeners looking to set up their own local agents. For those with 24GB of VRAM, Herman recommended the Llama 4 8B for speed or a quantized 70B for deep reasoning, paired with tools like OpenDevin.

While the "hardware vs. software race" shows no signs of slowing down, the brothers expressed optimism. Through a combination of smarter compression, unified memory, and architectural innovations like speculative decoding, the power to run a personal, autonomous AI agent is moving out of the data center and into the home office.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #633: Memory Wars: The Future of Local Agentic AI

Daniel's Prompt

I’d like to discuss the world of local AI and how it has developed. While programs like Ollama and models from Mistral and Microsoft are staples of the open-source community, there is a tension between producing small, quantized models and the massive VRAM requirements for agentic tool-calling and context windows like MCP. Is there hope that normal, non-supercomputer hardware can run agentic use cases locally? If so, what models and VRAM benchmarks are necessary for the kind of stable performance we see with tools like Claude Code?

Hey everyone, welcome back to My Weird Prompts. I am Corn, and I am sitting here in our living room in Jerusalem with my brother.

Herman Poppleberry, at your service. It is a beautiful day outside, but we are staying in to talk about one of our favorite subjects.

Exactly. Our housemate Daniel actually sent us a really thoughtful prompt today. He has been diving deep into the world of local artificial intelligence, and he is noticing a bit of a tug-of-war happening. On one hand, we have these incredible advancements in making models smaller and more efficient. But on the other hand, the cutting-edge agentic workflows, the kind of stuff where the A-I actually does work for you rather than just talking to you, those seem to require massive amounts of memory.

It is the classic hardware versus software race, Corn. Daniel is basically asking if there is a path forward for the average person who does not have a one hundred thousand dollar server in their basement. Can we actually run these complex, agentic tools locally without sacrificing performance?

It is a great question because it feels like the goalposts keep moving. Every time we get a model that fits on a standard consumer graphics card, the industry moves toward a new protocol or a new way of using models that suddenly requires three times as much memory. So, Herman, I want to start with the basics. Where are we right now in early twenty-six? If someone downloads Ollama today, what is the state of the art for local models?

Well, it is actually a pretty exciting time. We have moved past the era where you needed a massive cluster just to get basic reasoning. In early twenty-six, the big news is the Llama four series. The Llama four eight-billion parameter model is now the baseline for almost everything. It is significantly smarter than the old Llama three point one seventy-billion model from a couple of years ago. We are also seeing the Mistral NeMo and Microsoft Phi-four models punching way above their weight class in the ten-to-fourteen billion parameter range.

Right, but Daniel pointed out a specific tension. He mentioned things like the Model Context Protocol, or M-C-P. For those who are not following the technical blogs every day, could you explain why something like that changes the hardware requirements? It is not just about the size of the model itself, right?

Exactly. This is the part that often catches people off guard. When you look at a model, you usually think about the weights. That is the actual file size of the A-I brain. If a model is quantized down to four-bit or eight-bit precision, you might fit an eight-billion parameter model into about six or eight gigabytes of Video Random Access Memory. Most modern graphics cards have eight or twelve gigabytes, so that fits easily. But that is just the brain sitting idle.

It is like a person who knows everything but does not have any working memory.

That is a perfect analogy. As soon as you start a conversation, the A-I needs what we call a Key-Value Cache, or K-V Cache. This is the space in your graphics card memory that stores the history of the conversation and all the data the A-I is currently processing. When you use something like the Model Context Protocol, you are essentially giving the A-I a massive library of documents, databases, and tools to look at all at once. If you want a context window of one hundred twenty-eight thousand tokens, the memory required for that cache can actually exceed the size of the model itself. In twenty-six, with M-C-P allowing agents to browse your entire local file system, that cache fills up instantly.

So, if I am using an A-I agent to analyze a large codebase, like with the tools Daniel mentioned, it is not just reading one file. It is keeping the structure of the whole project in that cache. And if that cache spills over your available Video Random Access Memory, everything grinds to a halt.

It does. It either crashes or it starts using your system R-A-M, which is significantly slower. We are talking about a drop from maybe eighty tokens per second to maybe two or three tokens per second. At that point, the agentic experience is ruined. You cannot have a productive workflow if you are waiting five minutes for every code refactor.

This brings us to the core of Daniel's question. Is there hope for normal hardware? When we say normal, let us define that. Are we talking about the average gaming laptop with eight gigabytes of V-RAM, or are we talking about the higher-end consumer stuff like a Mac Studio or a P-C with a top-tier graphics card?

I think the definition of normal is shifting. If you are serious about local A-I in twenty-six, sixteen gigabytes of Video Random Access Memory is really the new baseline. If you have eight gigabytes, you are mostly limited to simple chat and very short context. But to answer Daniel's question about hope, yes, there is a lot of hope, and it is coming from two directions: better quantization and smarter architecture.

Let us talk about quantization first. I know we have mentioned it before on the show, but for the newer listeners, this is essentially the process of compressing the A-I. Instead of using high-precision numbers for the model's weights, we use lower-precision ones. Most people use four-bit quantization. How much further can we go before the A-I starts losing its mind?

We are actually seeing some incredible results with even lower bit-rates. There is a technique called H-Q-Q, or Half-Quadratic Quantization, and another called E-X-L-two. We are reaching a point where a model compressed down to two-point-five bits can still retain about ninety-five percent of the intelligence of the original uncompressed version. This is huge because it means you can fit a much larger, more capable model into a smaller memory footprint. For example, you can now run a Llama four seventy-billion parameter model on a single twenty-four gigabyte card if you use a three-bit quantization.

That is fascinating. So the model gets smaller, which leaves more room for that Key-Value Cache we were talking about. But does that solve the agentic problem? When an A-I is acting as an agent, it is calling tools, it is running code, it is checking its own work. That requires a very high level of reliability. Does a two-bit model have the precision to handle complex tool-calling without hallucinating?

That is the trade-off. For simple chat, a two-bit model is fine. But for agentic use cases where a single misplaced comma in a function call can break the whole process, you usually want at least four-bit or six-bit quantization. This is where the tension Daniel mentioned really lives. If you want the stability of something like Claude Code, you need a model that can follow strict logic.

So, let us look at the benchmarks. If someone wants to run a local agentic setup that feels as snappy and reliable as the cloud-based tools, what are the actual numbers they should be looking for?

If you want to run something like a fourteen-billion parameter model with a thirty-two thousand token context window, which is a sweet spot for coding agents, you are looking at needing about twenty-four gigabytes of Video Random Access Memory. That is exactly what you find on a high-end consumer card like the Nvidia R-T-X thirty-ninety, forty-ninety, or the current fifty-ninety series.

Okay, so that is the enthusiast level on a P-C. But what about the Mac users? We know Apple has been pushing their Unified Memory architecture. Does that change the math?

It changes the math completely. This is actually where the most hope for normal hardware lies. In a traditional P-C, your system memory and your graphics memory are separate. If you have sixty-four gigabytes of R-A-M but only eight gigabytes on your graphics card, you are still stuck. But on a Mac with Apple Silicon, the memory is shared. If you buy a Mac Studio with ninety-six or one hundred twenty-eight gigabytes of memory, the A-I can use almost all of it.

So a Mac user can run a massive seventy-billion parameter model locally, which would normally require two or three high-end graphics cards on a P-C.

Exactly. And that is where you start seeing that stable, agentic performance. When you have enough memory to keep the entire model and a massive context window in high-speed R-A-M, the agent can look at your whole project via M-C-P, think through the steps, and execute them without losing its place.

But Daniel's point about the supercomputer requirement still lingers. Most people do not have a Mac Studio with a hundred gigabytes of R-A-M. They have a MacBook Air or a standard laptop. Is there a version of this future where the sixteen-gigabyte machine is actually useful for agents?

I think there is, and it comes down to a concept called speculative decoding. This is a really clever trick where you use a tiny, lightning-fast model to guess what the next few words are going to be, and then a larger, smarter model checks those guesses. It can speed up the process by two or three times. If we can combine that with smarter memory management, where the A-I only loads the parts of the library it needs for a specific task, we could see agentic workflows on much smaller machines.

That makes sense. It is like having a fast-thinking assistant who handles the easy stuff and only interrupts the expert when they are unsure. But I want to push back a bit on the context window side. Even with those tricks, if the Model Context Protocol is feeding the A-I thousands of lines of documentation, that data has to sit somewhere. Are we going to see a shift in how these models are designed? Maybe they do not need to keep everything in the active cache?

You are hitting on a major research area right now. We are seeing a move toward what people call R-A-G, or Retrieval Augmented Generation, but integrated directly into the model's architecture. Instead of the model having to remember everything in its active memory, it has a way of quickly searching through a local database and only pulling in the relevant snippets. It is like the difference between memorizing a whole book versus being really fast at using an index.

That would definitely lower the V-RAM requirements. If the A-I only needs to hold a few thousand tokens of active context but can swap things in and out of the index instantly, then the sixteen-gigabyte card becomes a powerhouse again.

Precisely. And we are already seeing this with some of the newer local A-I frameworks. They are getting much better at offloading the inactive parts of the conversation to your regular system memory and only keeping the active part on the graphics card. It is not quite as fast as having everything on the card, but for an agent that is working in the background while you do other things, it is perfectly acceptable.

I think this is a good moment to pivot to the practical side. If a listener is inspired by Daniel's prompt and wants to set up a local agent today, what is the actual stack? What models should they be looking at if they have, say, a twenty-four gigabyte card?

If you have twenty-four gigabytes, you are in the golden zone. I would recommend looking at the Llama four eight-billion parameter version for extreme speed, or a quantized Llama four seventy-billion parameter version if you want deep reasoning. For the agentic side, tools like OpenDevin or the local implementations of the Model Context Protocol are great. You can connect them to your local files and let them go to work.

And what about the specific model families? Daniel mentioned Mistral and Microsoft's Phi models. Are those still the leaders for the smaller, agentic use cases?

Absolutely. Microsoft's Phi-four models are incredible for their size. They are specifically trained to be good at reasoning and tool-calling, which is exactly what you need for an agent. They fit into almost any modern hardware. If you are on a budget, a Phi-four model quantized to four bits will run on almost anything and give you surprisingly good results for coding and task management.

It is funny because we have talked about this before, maybe back in episode four hundred and fifty or so, when local L-L-Ms were just starting to get good. Back then, the idea of an agent running on a laptop was almost science fiction. The fact that we are now debating the nuances of V-RAM benchmarks for a local version of Claude Code is just wild.

It really is. And the pace of development is not slowing down. Every week there is a new optimization technique. I think the most important thing for people to realize is that the hardware you buy today for A-I is going to get more capable over time because the software is getting so much more efficient. Usually, software gets bloated and slower over the years, but in the A-I world, it is the opposite. We are learning how to do more with less.

That is a very encouraging thought. It means that the investment in a decent graphics card or a high-memory Mac isn't just for today's models; it is a foundation for whatever breakthroughs come next year.

Exactly. And I think we should talk about the why for a second. Why bother with all this local setup when you can just pay twenty dollars a month for a cloud service that is faster and more powerful?

That is the big question, right? For a lot of people, it is about privacy and control. If you are a developer working on a proprietary codebase, you might not want to send every single file to a third-party server. Or if you are a writer working on a sensitive project. Having that agent live entirely on your machine, with no internet connection required, that is a huge peace of mind.

It is not just peace of mind; it is also about the lack of restrictions. Cloud-based models have all these guardrails and filters that can sometimes get in the way of productive work, especially if you are doing complex research or creative writing. A local model does exactly what you tell it to do. It is your tool, completely under your control.

And let us not forget the cost over time. If you are a heavy user, those A-P-I credits for the top-tier models can add up fast. Once you have the hardware, running a local model is essentially free, minus the electricity cost. For an agentic workflow that might involve hundreds of back-and-forth calls to the model, that can save you a lot of money in the long run.

Definitely. I have seen people run local agents that spend all night refactoring a codebase or analyzing thousands of documents. If you did that with a high-end cloud model, you could wake up to a three hundred dollar bill. Locally, you just wake up to a finished project and a slightly warmer room.

Speaking of a warmer room, I know you have been experimenting with some of the newer cooling solutions for your home setup. Is that something people need to worry about if they are running these agents for hours at a time?

You know me, Corn, I love a good liquid-cooled loop. But for most people, a well-ventilated case is enough. These graphics cards are designed to handle heavy loads. The main thing is just being aware that your computer will be working hard. It is not like browsing the web; it is more like playing a high-end video game at max settings.

So, to summarize the answer for Daniel, is there hope? Yes. But it requires a bit of a shift in expectations. You might not be running the absolute largest model on a standard laptop, but with the right quantization and the right choice of model, like the Phi or Mistral series, you can absolutely have a functional, agentic A-I assistant.

And if you are willing to step up to that sixteen or twenty-four gigabyte V-RAM threshold, you can get very close to that Claude Code experience. We are not quite at the supercomputer-only stage anymore. The middle ground is expanding.

I think that is a great place to take a quick break in our discussion. When we come back, I want to dive into the specific software tools that make this possible. We have talked about the brains and the memory, but how do you actually hook it all up?

Sounds good. There are some really cool open-source projects that are making the setup process much easier than it used to be.

Alright, we will be right back.

So, before we left off, we were talking about the hardware requirements. But once you have the hardware, the next hurdle is the software stack. For a long time, setting up local A-I felt like you needed a P-h-D in computer science. You had to compile drivers, manage Python environments, and pray that everything didn't break every time you updated your system.

Right, I remember those days. It was a nightmare. But things have changed, haven't they? Ollama was really the turning point for a lot of people.

It really was. Ollama made it as simple as typing a single command. But for the agentic use cases Daniel is asking about, you need more than just a chat interface. You need something that can bridge the gap between the model and your computer's file system, your web browser, and your terminal.

This is where things like the Model Context Protocol come in. I have been seeing a lot of buzz about this lately. It is essentially a standard way for A-I models to talk to different tools, right?

Exactly. Think of it like a universal translator for A-I tools. Before, if you wanted an A-I to use a specific database, you had to write custom code to connect that model to that database. If you switched models, you might have to rewrite everything. The Model Context Protocol, or M-C-P, provides a standardized way for any model to access data and tools.

And the best part is that it is being adopted by the local A-I community. There are now M-C-P servers that run entirely on your local machine. You can have one server that handles your local documents, another that handles your calendar, and another that can run code in a secure sandbox.

And since it is a standard, you can swap out the model whenever you want. If a better model comes out from Mistral next week, you just point your M-C-P setup to the new model and all your tools still work. That is the kind of stability Daniel was asking about. It makes the local setup feel less like a fragile experiment and more like a professional tool.

I have been using a tool called Anything-L-L-M lately, which seems to be trying to bundle all of this together. It handles the model, the vector database for your documents, and the agentic workflows all in one package. Have you tried that one?

I have, and it is a great entry point. It is very user-friendly. But for the power users, I think the real excitement is in the integration with code editors. Tools like Continue or Cursor, which allow you to use local models directly inside your development environment, are game-changers. You can highlight a block of code and ask the local model to explain it, or have an agent refactor a whole file while you are working on something else.

And if you have that twenty-four gigabytes of Video Random Access Memory we talked about, the experience is almost seamless. You forget that the A-I isn't coming from some massive server farm in the desert. It is just sitting right there next to your feet.

That is the magic of it. And it is only going to get better. One thing I am really watching is the development of what we call Small-to-Large model handoffs. Imagine you have a tiny model that is always running, using almost no power. It handles your basic requests. But when you ask it something complex, it automatically wakes up the seventy-billion parameter giant, which does the heavy lifting and then goes back to sleep.

That would be incredibly efficient. It is like having a receptionist who only calls the C-E-O when it is absolutely necessary.

Exactly. That kind of intelligent resource management is what will eventually bring agentic A-I to every laptop, not just the high-end ones.

So, let us talk about the benchmarks again, but from a different angle. If someone is shopping for a new machine today with the goal of running local agents, what is the best bang for their buck?

If you are on a budget, look for a used Nvidia R-T-X thirty-ninety. It has twenty-four gigabytes of V-RAM and you can often find them at a reasonable price. If you want a new laptop, try to find one with at least sixteen gigabytes of graphics memory, though those are still mostly in the high-end gaming category.

And for the Mac side?

For Mac, do not settle for anything less than thirty-two gigabytes of Unified Memory. If you can afford sixty-four or ninety-six, that is where the real fun begins. A Mac Studio with ninety-six gigabytes is probably the best all-around local A-I machine you can buy right now without going into professional server territory.

It is an investment, for sure, but when you consider the lack of subscription fees and the privacy benefits, the math starts to make sense for a lot of people.

It really does. And honestly, just the educational value of seeing how these things work under the hood is worth a lot. When you run a model locally, you start to understand the trade-offs. You see how the context window affects performance, you see how quantization impacts intelligence. It makes you a more informed user of A-I in general.

I totally agree. Every time I tweak a setting in my local setup and see the tokens-per-second change, I feel like I am learning more about the future of computing. It is not just a black box anymore.

Exactly. So, to wrap up the technical side for Daniel, the hope is not just in the hardware getting bigger, but in the software getting smarter. We are moving toward a world where the supercomputer is optional, but the high-end consumer machine is becoming a true local powerhouse.

I love that. It is a very optimistic view of where things are heading. Before we finish up, I want to talk about the community. One of the reasons local A-I has come so far is the incredible open-source community.

Oh, absolutely. Without places like Hugging Face and the thousands of independent researchers who are fine-tuning these models, we would be years behind where we are now. The fact that someone can release a model on Tuesday and by Wednesday there are five different quantized versions of it available for different hardware is just incredible.

It is a true meritocracy. The best ideas and the most efficient techniques get adopted almost instantly. It is the complete opposite of the closed-door development we see at some of the big A-I companies.

And that is why I am confident that the local A-I scene will always be relevant. It is the laboratory where the most interesting experiments are happening.

Well, Herman, I think we have covered a lot of ground today. We have gone from V-RAM benchmarks to the philosophy of open-source development. I hope this gives Daniel some clarity on his prompt.

I hope so too. It was a great question. It is something I think about every time I look at my own setup.

Before we go, I want to mention a few things to our listeners. If you have been enjoying My Weird Prompts and you find these deep dives helpful, we would really appreciate it if you could leave us a review on your podcast app or on Spotify. It genuinely helps other people find the show, and we love hearing your feedback.

Yeah, it makes a huge difference. And if you want to get in touch with us, you can find our contact form and our full archive of over six hundred episodes at myweirdprompts dot com. You can also find our R-S-S feed there if you want to subscribe.

We are always looking for new prompts, so if you have a weird idea or a technical question like Daniel's, send it our way. We live for this stuff.

We really do. Thanks again to Daniel for the prompt. It was a fun one to dig into.

Definitely. Well, that is it for today's episode. Thanks for listening to My Weird Prompts. I am Corn.

And I am Herman Poppleberry. We will see you next time.

Bye everyone.

Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.