#1700: Can LLMs Learn Continuously Without Forgetting?

We explore a new approach: micro-training updates every few days to keep AI knowledge fresh without constant web searches.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-1853
Published: Mar 29
Duration: 21:24
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: rag fine-tuning ai-agents

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Quest for Real-Time AI Knowledge

One of the biggest bottlenecks for autonomous AI agents today is the gap between real-world events and what the model actually knows. As of 2026, most large language models (LLMs) rely on Retrieval-Augmented Generation (RAG) to access current information. While effective, RAG acts as a crutch, requiring the model to search the web, read a site, and summarize it on the fly. This process introduces latency and consumes context window space. A compelling alternative is emerging: micro-training, a method where models receive tiny, surgical fine-tuning updates every few days to bake new knowledge directly into their parametric memory.

The core challenge of this approach is "catastrophic forgetting." Neural networks are prone to overwriting old data when trained on new information. Imagine a chalkboard full of complex equations; to write a new sentence, you must erase something. If a model is fine-tuned exclusively on news from March 2026, it might lose its ability to write Python code or understand historical context. A 2025 study by Stanford showed that fine-tuning on recent data can reduce accuracy on older benchmarks by up to 30% without specific mitigations. The goal is to balance plasticity (learning new things) with stability (retaining old knowledge).

To address this, researchers are looking at Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA). Instead of retraining the entire model, LoRA updates less than 1% of the parameters. This allows for the creation of a "daily news" adapter that provides context on recent events without altering the core foundation. However, this introduces a new problem: adapter proliferation. If you have a new adapter for every three-day window, you eventually need a routing layer to decide which knowledge base is relevant to a specific prompt, creating a "mixture of experts" scenario on a micro-scale.

Beyond the technical hurdles of forgetting and routing, there is a significant governance risk. With RAG, users can see the source URL of the information. If a model is micro-trained on a hallucinated news article, that error becomes a fundamental "fact" in the model’s worldview, making it much harder to unlearn. This creates a need for a sophisticated "truth filter" or automated curriculum before data hits the training cluster. While the compute cost of these micro-updates is relatively low using PEFT—perhaps a few hundred dollars every three days for a large model—the human capital required to curate and verify the data is the real expense.

Ultimately, the choice between embedded knowledge and retrieval depends on the use case. For high-frequency autonomous agents, like trading bots or real-time navigation systems, micro-training offers instant latency and a clean context window. For research and legal work where citations are vital, RAG remains superior. The future likely lies in a hybrid approach: using sliding window fine-tuning to maintain a "working memory" of recent events while preserving long-term reasoning capabilities.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1700: Can LLMs Learn Continuously Without Forgetting?

Imagine an LLM that never has a knowledge cutoff date because it is constantly learning, sort of like a human reading the news every morning and actually remembering it. Today’s prompt from Daniel is about exactly that, asking if we could engineer a model that receives micro-trainings every few days via a data pipeline so it just inherently knows recent events without needing to click a search button or use an external tool like Tavily. It is a fascinating premise because, as of March twenty-nine, twenty-six, the gap between real-world events and what a model actually knows in its weights is still one of the biggest bottlenecks for autonomous agents.

It really is the holy grail of model architecture. By the way, today’s episode is powered by Google Gemini three Flash. I am Herman Poppleberry, and I have been diving into the research on this specifically because the industry is hit with this wall where RAG, or Retrieval-Augmented Generation, is great, but it is a crutch. Daniel’s question hits on something deeper: can we move from retrieval-augmented to knowledge-embedded?

Right, because right now, if I ask a model about a news event from three hours ago, it has to go out, find a website, read it, and then summarize it. It does not actually know the event. It is just a very fast librarian. Daniel is suggesting we bake the library into the brain every few days. Is that feasible, or are we just talking about a very expensive way to break a model?

It is technically feasible, but the "breaking the model" part is the massive hurdle. We are talking about continual learning. The concept is that instead of a massive training run every six months that costs a hundred million dollars, you run these tiny, surgical micro-trainings—essentially fine-tuning sessions—every seventy-two hours. You feed it the latest scraped data, curated news feeds, and technical papers. But the technical challenges are immense, mostly centered around something called catastrophic forgetting.

Catastrophic forgetting sounds like what happens to me after a long weekend, but I assume for an LLM, it is a bit more terminal. How does that look in the actual architecture? Is it literally overwriting the old data?

Think of it like a chalkboard that is already full of beautiful, complex equations. If you want to write a new sentence on that board, you have to erase something. In neural networks, when you train on new data, the gradient updates can overwrite the weights that stored the old data. If you fine-tune a model on nothing but news from March twenty-twenty-six, it might suddenly forget how to write Python code or lose its nuanced understanding of nineteenth-century history.

So it’s not just forgetting facts, it's losing the actual reasoning capabilities it gained during the initial multi-billion dollar pre-training?

Precisely. A twenty-twenty-five study by Stanford showed that fine-tuning on recent data can reduce accuracy on older benchmarks by up to thirty percent if you do not have specific mitigations in place. You end up with a model that knows who won the game last night but forgot how to do basic math. It’s a trade-off between plasticity—the ability to learn new things—and stability—the ability to keep what you already know.

So it becomes a goldfish with a very high IQ for current events. That does not seem ideal. But we have tools for this, right? We talk about LoRA and adapters all the time. Could we just use those to "pin" the new knowledge without touching the core foundation?

That is the most likely path forward. Low-Rank Adaptation, or LoRA, allows you to train a tiny fraction of the model’s parameters—usually less than one percent. You could essentially have a "daily news" adapter that gets updated. The model stays the same, but this extra layer provides the context of recent events.

But wait, if you have a new adapter for every three-day window, don't you eventually end up with a thousand different adapters? How does the model know which one to "wear" when I ask a question?

That’s the "mixture of experts" problem on a micro scale. You’d need a routing layer to decide which knowledge base is relevant to the prompt. But the bigger problem is that even with adapters, you eventually hit a saturation point. If you keep stacking new info into a single adapter every three days, the adapter itself starts to suffer from that same forgetting or internal interference. The weights inside that tiny one-percent layer can only hold so much "newness" before they start contradicting themselves.

I want to push on the "why" here for a second. If I am a developer, why would I bother with the compute cost of micro-training every three days when I can just use a RAG pipeline for a fraction of a cent per query? Is the latency really that much of a dealbreaker?

In twenty-twenty-six, latency is everything for agents. If you are running an autonomous trading bot or a real-time social media moderator, the two seconds it takes to perform a web search, parse the HTML, chunk the text, and feed it back into the context window is an eternity.

I guess if you're an AI agent trying to buy a stock based on a breaking news alert, by the time you've finished your "search and retrieve" cycle, the market has already moved.

A micro-trained model has that knowledge in its parametric memory. It is instant. We saw a benchmark study in late twenty-twenty-five comparing a standard RAG system using Tavily versus a model with embedded updates, and the latency difference was about two hundred milliseconds for the embedded model versus nearly three seconds for the RAG setup. Beyond just speed, there's also the "context window tax." If you use RAG, you're stuffing thousands of tokens into the prompt every time. With a micro-trained model, the context window stays clean for the actual task at hand.

That is a massive difference if you are scaling to millions of users or high-frequency tasks. But what about the "truth" problem? If we are auto-feeding news into a model’s weights every few days, how do we stop it from absorbing misinformation? With RAG, I can see the source. I can see that the model is reading a sketchy tabloid. If it is baked into the weights, the model just states it as an inherent fact.

That is the second-order effect that worries me the most: hallucination amplification. If the micro-training pipeline ingests a "hallucinated" or "fake news" article and updates the weights, that error is now part of the model’s worldview. It is much harder to "unlearn" a weight update than it is to just ignore a bad RAG result.

It’s like the difference between a friend telling you a rumor they read—where you can say "that sounds fake"—and you actually believing the rumor is a fundamental law of physics.

Spot on. We saw some experiments with Google’s Real-Time RLHF—Reinforcement Learning from Human Feedback—where they tried to use user corrections to update models in near real-time. It worked for facts, but it also made the models susceptible to "data poisoning" where a coordinated group of users could trick the model into believing something false simply by repeating it often enough across the update window. Imagine a group of trolls convincing a model that a specific public figure has passed away. If that gets baked into the weights, the model will report it as a fact for the next seventy-two hours until the next update.

It sounds like we are trading a retrieval problem for a governance problem. You would need an incredibly sophisticated "truth filter" before the data even hits the training cluster. You are basically building a digital editor-in-chief for the AI.

You are not just building a model; you are building an automated curriculum. The data management side of this is actually more complex than the compute side. You need to curate the data, de-duplicate it, verify it against multiple sources, and then format it for fine-tuning. And you have to do that every three days, forever. It requires a massive infrastructure for what we call "Data Engineering for LLMs."

Let's talk about the compute for a second, because I know you love the numbers. Is it actually affordable to fine-tune a seventy-billion-parameter model every three days?

If you are doing full parameter fine-tuning? Absolutely not. That would be insane. That’s like rebuilding a skyscraper every time you want to change the curtains. But with PEFT—Parameter-Efficient Fine-Tuning—it becomes much more reasonable. Using something like LoRA or DoRA, Weight-Decomposed Low-Rank Adaptation, you are only updating a few hundred megabytes of weights.

Walk me through the math there. What does a single "micro-update" look like on a balance sheet?

On a platform like Modal, you could spin up a cluster of H-one-hundreds, run the micro-training in an hour, and shut it down. The cost might be a few hundred dollars per update. For a company like OpenAI or Anthropic, that is pocket change—it’s less than they spend on coffee in a day. For an individual developer, it is a bit steep, but not impossible if you’re running a specialized high-value agent. The real cost isn't the electricity; it's the human capital required to ensure the data you're feeding it doesn't turn the model into a gibberish-spewing mess.

So the hurdle isn't the money, it's the "catastrophic forgetting" and the "poisoning" risks. Is there a middle ground? Could we have a model that has an "ephemeral" memory layer that gets wiped and refreshed, while the core remains static?

That is actually where the research is heading—sliding window fine-tuning. You train an adapter on the last seven days of data. When day eight hits, you don't just add more; you start a new adapter or you use a technique called "Elastic Weight Consolidation." This basically tells the model, "Hey, these weights were really important for the stuff you learned three years ago, so don't move them too much, but these other weights are flexible, you can change those for the news about the Mars landing."

I like that. It is like having a "working memory" that is separate from your "long-term memory," but both are part of the brain rather than an external notepad. But I still go back to the RAG comparison. Daniel mentioned that this would eliminate the need for search tools. But search tools provide something a model’s weights never can: a URL. If I am using an AI for research, I want to see the source. If the model just "knows" it, I have to take its word for it.

But does the user always need a URL? Think about a voice assistant or a car dashboard. If I ask my car, "Is the bridge ahead closed due to that accident ten minutes ago?" I don't want a list of citations. I want a 'Yes' or 'No' based on the most recent data possible.

That's a fair point. For "utility" AI, the source is less important than the accuracy. But how do we verify the accuracy without the source?

That is a huge point. Transparency. In a micro-trained model, the "source" is buried in a multi-dimensional vector space. You can't easily ask a model, "Which specific neuron told you that the Prime Minister resigned?" With RAG, you have a direct pointer to the document. This is why I think the "knowledge-embedded" approach is better for agents and action-oriented AI, while RAG remains the king for research and citation-heavy work.

So if I am building a robot that needs to navigate a city where the streets change or a news-cycle-aware trading bot, I want the micro-training. If I am writing a legal brief, I want the RAG.

Precisely. And the technical challenge of "data freshness" is also about the "fog of war." In the first forty-eight hours of a major event, the information is often contradictory. If you micro-train on day two, you might bake in a lot of errors. RAG allows the model to see the "live" corrections happening on the web. A model with a three-day-old micro-training is still technically "out of date" the moment the training finishes.

That is the irony, isn't it? You spend all this money to have a "current" model, and the second the weights are saved, it is already behind the curve. It is like buying a new car; the value drops the moment you drive it off the lot. Except here, the "value" is the accuracy of the information.

It really is. But there is a version of this that works: "Continual Pre-training." Instead of fine-tuning, you just keep the pre-training process running on a smaller scale. Meta did some experiments with Llama three where they showed that if you keep the learning rate low and keep feeding it a mix of old "foundation" data and new "fresh" data, you can maintain the model's capabilities while slowly shifting its knowledge base forward. They found they could retain about eighty-five percent of the old knowledge even after ten consecutive update cycles.

Eighty-five percent sounds high until you realize that the fifteen percent it forgot might be how to recognize a stop sign or how to use a comma.

It's the "unpredictability" of the forgetting that is the problem. You don't get to choose what it forgets. It’s not like a computer hard drive where you delete a specific folder. It’s more like a drop of ink in a glass of water—it diffuses everywhere. You might find that after a week of learning about the stock market, the model's ability to translate French has slightly degraded for no apparent reason.

Is there any way to "lock" those critical skills? Like, "don't ever touch the French-speaking neurons, no matter what happens in the news?"

Researchers are working on "Freezing" specific layers. You identify the layers of the transformer that are responsible for syntax and logic and you set their learning rate to zero. Then you only allow the "knowledge-heavy" layers—usually the later Feed-Forward Networks—to update. It’s a surgical approach to AI brain surgery. But even then, the layers are interconnected. You change one part of the network, and the signals flowing through it change, which can still lead to weird side effects in the "frozen" sections.

So, for the listeners who are developers and are thinking about trying this—maybe they are looking at Hugging Face’s PEFT library or thinking about setting up a pipeline on Modal—what is the actual "how-to" here? If you wanted to build a "Daniel-bot" that knows everything Daniel has posted on GitHub in the last week without using RAG, how would you start?

I would start with a "Replay Buffer." This is a classic technique in reinforcement learning. When you train on the new data—the "fresh" stuff—you also include a small, high-quality sample of the original training data. This "reminds" the model of its core identity and skills while it learns the new stuff. It's like studying for a new exam but occasionally glancing at your old textbooks to make sure you haven't forgotten the basics.

It’s the "don't forget your roots" method of AI training. I like it. But how do you select what goes into the replay buffer? You can't include everything.

You usually use a "diversity-based" selection. You want a little bit of everything: some code, some creative writing, some logic puzzles, and some basic factual knowledge. If the buffer is balanced, it acts as an anchor, preventing the model from drifting too far into the "current events" weeds. But again, we are talking about a lot of overhead. For most people, sticking with standard RAG is still the move. Micro-training feels like an "enterprise-grade" solution for very specific low-latency needs.

I agree. Unless you are building something where every millisecond counts or where the "context window" is too small to hold all the necessary RAG results, micro-training is probably overkill. But as the compute costs come down and our techniques for preventing catastrophic forgetting improve, I think we will see the "knowledge cutoff" date start to disappear from the marketing materials of these big models. They will just be "living" models.

Living models. That sounds both cool and slightly terrifying. Like a model that has its own "childhood" and "adulthood" and just keeps evolving. But what happens when the model "evolves" in a direction the developers don't like? If it is constantly learning from the world, and the world is... well, the world... how do you keep it aligned?

If the model is learning from the internet every three days, and the internet is currently having a collective meltdown over a specific topic, the model might inherit that bias or emotional volatility.

That is the million-dollar question. AI alignment is already hard when the model is "frozen." When it is a moving target, you have to have an automated alignment pipeline that is just as fast as your training pipeline. You would need an "Evaluator Model" that checks the "Student Model" after every micro-update to make sure it hasn't become toxic or biased or started advocating for something dangerous. You’re basically running a mini-RLHF cycle twice a week.

It’s a hall of mirrors. Models training models, models evaluating models. At some point, you have to wonder if a human is even in the loop anymore. If the data is scraped by an AI, cleaned by an AI, trained into a model by an AI, and then checked for safety by another AI... where are we?

In the loop? Corn, in this scenario, the human is just the person paying the electricity bill and providing the data. The rest is automation all the way down. We become the supervisors of the system rather than the practitioners. It's a fundamental shift in how we think about "content." Content becomes fuel for the model’s weights rather than something for humans to read.

That's a bit heavy for a Tuesday. But it does explain why companies are so desperate for "high-quality data." If your model is eating the news every day, you better make sure it's eating organic, farm-to-table facts and not processed junk.

Data provenance becomes the most important part of the stack. You need to know exactly where every token came from, because if the model starts acting up, you need to be able to trace it back to the specific update that caused the deviation. It’s like food safety for information.

Well, on that cheerful note of our impending obsolescence as "knowledge providers," let’s look at some practical takeaways. If you are a dev, don't throw away your RAG pipeline just yet. Tavily and other search tools are still your best friend for accuracy and citations. But if you want to experiment, look into LoRA and replay buffers. It’s a great way to understand how model weights actually work and how fragile they can be.

And if you do experiment, monitor your MMLU—Massive Multitask Language Understanding—benchmarks. If you see those numbers dropping while you feed the model news, you know you’ve got a "goldfish" problem on your hands. Watch specifically for "regression" in areas completely unrelated to your new data. That's the first sign that the catastrophic forgetting is setting in.

A very smart, very fast goldfish. It's a wild thought—a model that knows the world as it happens, but might forget how to be itself in the process. Thanks to Daniel for the prompt—this was a deep one. It really highlights the tension between wanting an AI that is "of the moment" and an AI that is reliable.

It’s the ultimate engineering trade-off. Speed versus stability. We’re seeing the industry lean toward speed right now, but I suspect we’ll see a massive "stability" correction once these agents start making real-world mistakes because they forgot their foundation.

And big thanks as always to our producer, Hilbert Flumingtop, for keeping the show running and making sure our own "weights" don't drift too far between episodes.

Also, a huge thanks to Modal for providing the GPU credits that power our generation pipeline. They make this kind of high-level exploration possible. Without that kind of compute access, we’d just be talking in circles.

This has been My Weird Prompts. If you are enjoying the show, leave us a review on Apple Podcasts or Spotify—it really does help other people find these deep dives. We love seeing the community grow and hearing what kind of wild architectural questions you all have.

You can find us at myweirdprompts dot com for the full archive and RSS feed. We've got links to the Stanford and Meta papers we mentioned in the show notes if you want to see the "forgetting" data for yourself.

Catch you in the next one, where we might be talking to an AI that already knows everything we're about to say.

See ya.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1700: Can LLMs Learn Continuously Without Forgetting?

Downloads

You Might Also Like

#1700: Can LLMs Learn Continuously Without Forgetting?