Daniel sent us this one, and it's a question we've been circling for a while. He's asking about knowledge cutoffs — the hard wall that gets baked into a model during pre-training — and whether there are mechanisms emerging to do targeted incremental pre-training runs that update what a model knows without ballooning the scope of the whole training process. The context is that post-training can't do this job, RAG pipelines are a workaround with real costs, and full pre-training runs are so expensive they're essentially a one-time event for most organizations. So: is there a smarter path being developed?
The cost framing is not rhetorical. The estimate on GPT-4's pre-training run was over a hundred million dollars. That's not a number you run twice because your model doesn't know about something that happened six months ago.
It's not "oh we'll just retrain it." It's more like "we will never retrain it, so what do we do instead?" And that tension is exactly what makes this interesting. By the way, today's episode is powered by Claude Sonnet four point six.
Which I find either deeply appropriate or deeply ironic given the topic, and I genuinely cannot decide which.
Okay, so let's actually frame the problem properly before we get into what's being built to solve it, because I think the way most people understand knowledge cutoffs undersells how structural the issue is.
It's not just "the model doesn't know recent news." That's the surface version. The deeper version is that pre-training is where a model builds its world model — its understanding of how things relate, what causes what, which concepts cluster together. The knowledge isn't stored like a database where you can swap out a row. It's distributed across billions of parameters in ways that are not cleanly separable. So when you ask why you can't just update it, the answer is that there's no "it" to update. The knowledge is the weights, and the weights are everything.
Which is why post-training — fine-tuning, RLHF, instruction tuning — none of it reaches that layer. You can change how the model responds, you can steer its behavior, you can make it more helpful or more cautious. But you cannot inject new factual world knowledge through those mechanisms in any robust way.
The analogy I keep coming back to — and I'll use this sparingly — is the difference between teaching someone a new skill and rewriting their memories. Post-training is teaching skills. Pre-training is the memories. You can't get to the memories through skills training.
RAG is basically... you're not rewriting the memories, you're just handing the person a cheat sheet before the exam.
RAG works for a lot of use cases. But it has its own failure modes. The retrieval step can surface wrong chunks, or it can surface the right information and the model still integrates it poorly. There was actually a piece from Towards AI recently making exactly this point — that RAG sometimes degrades output quality even when the retrieved information is factually correct. Because the model has to reconcile what it retrieves with what it already believes, and those two things can conflict in subtle ways.
You've got a model that's confidently wrong in its weights, you hand it a document that contradicts that, and instead of updating, it sort of... argues with the document.
Or hedges in ways that make the output worse than if you'd just left it alone. It's a real problem. And it gets worse as the gap between training cutoff and deployment date widens. Most models are in production for a year or two, sometimes longer. GPT-4's knowledge cutoff was around October twenty twenty-three, and that model has been in active use well past that point. Every month that passes is another month of world events, published research, changed circumstances that the base model simply does not have.
Daniel's question is essentially: can we do something surgical? Not a full re-run of the pre-training process, but a targeted pass that says "here is the specific knowledge we need to integrate, update these parameters, leave everything else alone.
The answer is: people are absolutely working on this, with varying degrees of success and very different approaches. The landscape right now has roughly three families of techniques. There's knowledge editing — which is the most surgical, operating at the level of specific facts. There's parametric fine-tuning with methods like LoRA, which is more targeted than full pre-training but still touches model weights. And then there's what's being called continual pre-training or mid-training, which is closer to what Daniel is actually asking about — running a pre-training-style process but scoped to a specific domain or time window.
Each of those has a different risk profile.
Knowledge editing is precise but brittle. LoRA is efficient but limited in how much new knowledge it can encode. Continual pre-training is the most capable but runs headlong into catastrophic forgetting, which is the core technical nightmare of this whole space.
Catastrophic forgetting — for anyone who hasn't heard us talk about this — is what happens when you train a neural network on new data and it overwrites the gradients that encoded the old knowledge. The model learns the new stuff and forgets the old stuff. It's not gradual degradation, it can be shockingly abrupt.
It's particularly nasty because it's not uniform. The model doesn't forget evenly. It tends to forget things that are underrepresented in the new training data, which means your incremental update can silently degrade performance in areas you weren't even focused on. You update the model's knowledge of, say, geopolitical events from the past year, and you discover three months later that its performance on some chemistry benchmark dropped fifteen percent.
Which you might not catch immediately if you're not running comprehensive evals.
And the standard mitigations — mixing old data in with new data during the incremental run, using replay buffers, elastic weight consolidation — all of them add cost and complexity. You're trying to hold the old knowledge in place while writing new knowledge in, and those two objectives pull against each other.
What's the most promising direction right now for actually doing this in a way that doesn't crater the model?
The mid-training work coming out of IBM Research is interesting. The framing they use is inserting an intermediate training phase — so it's not full pre-training, it's not fine-tuning, it sits between them. They've shown that focused mid-training on specific domains like math or science can improve reasoning capability by something like three to four times without significant loss of prior knowledge. Now, that's domain capability, not raw factual knowledge injection, but the mechanism is the same. You're doing a pre-training-style pass on a curated corpus, carefully controlling the data mix.
Three to four times improvement in reasoning on those domains is not a small number.
It's not. And the key insight is the data curation piece. The reason catastrophic forgetting happens so aggressively in naive continual pre-training is that the new data distribution is wildly different from the original training distribution. If you just dump a year's worth of news articles into a training run, the model sees a very skewed slice of the world. But if you construct a corpus that's specifically designed to bridge the old distribution and the new knowledge, you can make the update much smoother.
The engineering challenge isn't just "train on new data." It's "construct the right new data.
And that's actually a tractable problem in a way that the raw compute cost isn't. You can throw research at corpus construction. You can develop better methods for identifying which parameters are most relevant to the knowledge you want to update, which is what the knowledge editing approaches are doing at a much finer grain.
Let's talk about those for a second because ROME and MEMIT have been around for a bit, but there's newer work building on top of them.
ROME — Rank-One Model Editing — was one of the first methods that showed you could actually locate where a specific fact lives in a transformer's feed-forward layers and surgically modify those weights to change the fact. MEMIT extended that to handle batches of edits simultaneously. The problem both of them run into is what's now being called the reasoning gap. There's a paper from just recently, the MCircKE work, which identifies something quite specific: you can edit a fact so the model recalls it correctly in isolation, but the model still fails to use that fact correctly in multi-step reasoning chains.
The model knows the updated fact when you ask it directly, but when it needs to chain that fact with other things to answer a more complex question, it reaches for the old version.
Or it reaches for nothing. The circuit that does the reasoning wasn't updated even though the circuit that stores the fact was. They're not the same parameters. And the MCircKE work is trying to target both simultaneously — editing not just the fact storage but the reasoning pathways that connect that fact to downstream inference. It's much more mechanistically informed than earlier approaches.
That's a meaningful distinction. Because from a practical standpoint, a model that can recite a fact but can't reason with it isn't actually updated in any useful sense.
And this is where I think the field is making progress. The early knowledge editing work was almost too optimistic — here's how to change a fact, problem solved. The more recent work is grappling with the actual cognitive architecture of these models and asking where knowledge actually lives in a way that supports use, not just recall.
LoRA sits in a different part of this space.
LoRA is parametric fine-tuning with a twist — instead of updating all the model's weights, you freeze the original weights and train a small set of low-rank adapter matrices that sit alongside them. The adapters are cheap to train, cheap to swap, and they don't touch the base model. So in theory you can have a base model with multiple LoRA adapters for different knowledge domains, and you swap in the relevant one at inference time.
The appeal is obvious. You're not touching the base model, so you don't risk catastrophic forgetting of the original knowledge.
The limitation is equally obvious. LoRA adapters are small by design, which means their capacity to encode new knowledge is limited. They're much better at steering behavior or adapting style than at injecting substantial new factual knowledge. There's a paper from earlier this year that looked specifically at this tradeoff — RAG versus parametric learning — and the finding was roughly that parametric methods including LoRA are better at encoding structured reasoning patterns, while RAG is better at handling factual updates, especially for long-tail or highly specific information.
Which suggests they might be complementary rather than competing.
That's actually where a lot of the practical deployment work is landing. Hybrid architectures where you have a base model with some incremental parametric updates for broad knowledge shifts, plus a RAG layer for specific factual queries. Neither one alone solves the problem, but together they cover more of the failure surface.
The thing I keep coming back to is the economics of all this. The reason full pre-training is a hundred million dollar exercise is the compute. But incremental approaches — even continual pre-training — are orders of magnitude cheaper if you scope them correctly. Is there a rough sense of what targeted incremental pre-training actually costs relative to the original run?
It varies enormously depending on scope, but the general principle is that if you're training on a fraction of the data — say, a curated corpus representing one year of knowledge updates rather than the full training corpus — you're looking at roughly proportional compute savings, maybe better because you're not doing the full warmup and optimization trajectory from scratch. You're starting from a checkpoint. So if the original run cost a hundred million, a well-scoped incremental run might be in the low millions. Still not nothing, but it's a different category of investment.
That changes the calculus for who can actually do this. A hundred million is frontier lab territory. A few million is something that a well-funded enterprise or a mid-size AI company can contemplate.
Which is one reason this problem is getting so much research attention right now. The techniques that unlock incremental pre-training at reasonable cost are potentially the techniques that let more organizations maintain their own models over time rather than being permanently dependent on whatever knowledge state the frontier labs chose to freeze in.
There's a competitive angle there too. If your model's knowledge is perpetually eighteen months stale because you can't afford to retrain, and your competitor has figured out how to do targeted updates quarterly, that's a meaningful product differentiation.
Especially in domains where currency matters a lot. Legal, medical, financial, geopolitical. These are areas where a model that doesn't know about something that happened in the last year can give you advice that's not just incomplete but actively harmful.
Alright, I want to get into the practical implications of all this — what this actually looks like for someone trying to make decisions about model deployment and maintenance. But first, let's go a bit deeper on the technical mechanisms, because I think the catastrophic forgetting problem deserves more than we've given it, and there are some mitigation strategies that are worth understanding properly.
And I want to talk about the data mix question more, because I think that's underappreciated as the actual crux of whether incremental pre-training works or doesn't.
Let's do it.
Before we go further — what is pre-training actually doing? Because I think the word gets thrown around as though everyone agrees on what it means, and they don't always.
So pre-training is the phase where a model is trained on a massive corpus — essentially a compressed representation of human knowledge up to some point in time — and learns to predict text. What it's actually learning, underneath that objective, is a statistical model of how concepts, facts, and language relate to each other. The weights of the network become a kind of frozen snapshot of the world as it existed in that corpus.
Frozen being the operative word.
And the cutoff is just the date at which data collection stopped. Everything after that date doesn't exist to the model. It's not that the model is uncertain about post-cutoff events — it literally has no representation of them. There's no parameter anywhere encoding that something happened.
Which is a fundamentally different problem from the model being wrong or uncertain. It's absent.
That's what makes post-training so limited as a fix. You can fine-tune a model on new information, but fine-tuning was designed to adjust behavior, not inject large volumes of factual knowledge into the weights. The capacity just isn't there in the same way.
The knowledge cutoff isn't a bug in the implementation. It's structural. It's baked into how the whole paradigm works.
And the reason this matters more now than it did two or three years ago is that the models are being used in higher-stakes contexts where currency is actually critical. A model whose world stops in late twenty twenty-three isn't just slightly out of date — in some domains it's operating on a fundamentally different map than the one that reflects current reality.
With that framing in place, let's dig into why the data mix question is so critical—you said it’s the actual crux.
And it’s where most naive implementations fall apart. The intuition people have is: gather everything published since the cutoff, clean it, train on it. And that's almost exactly wrong.
Because the model's existing knowledge isn't uniformly distributed across topics. Some domains are heavily represented in the original corpus — major world events, foundational science, widely published literature. Others are sparse. If your update corpus has a different density profile than the original, you're essentially teaching the model that certain topics matter more than they used to, and that signal bleeds into the weights in ways that are hard to predict.
The update distorts the relative weighting of knowledge, not just the absolute content.
And this is where the continual pre-training literature has gotten more sophisticated. The better approaches don't just curate new data — they construct a replay buffer. You mix in a sample of the original training distribution alongside the new material, so the model is seeing both simultaneously. The old knowledge gets reinforced even as new knowledge is being introduced.
That's what keeps catastrophic forgetting from cascading.
It's the main mitigation, yes. The IBM mid-training work is instructive here — they found that inserting a targeted intermediate training phase focused on specific domains could improve reasoning performance by three to four times without degrading what the model already knew. The key was that the mid-training corpus was carefully constructed to complement, not compete with, the original distribution.
Three to four times is not a marginal improvement. That's a meaningful capability jump from what sounds like a relatively contained intervention.
It didn't require anything close to a full pre-training run. The compute profile was dramatically lower. Which tells you something important: the original pre-training is doing a lot of work establishing the model's general architecture of understanding, and targeted updates can leverage that architecture rather than rebuilding it.
The analogy I keep reaching for — and I'll only use this once — is that the pre-training is building the roads, and incremental updates are just resurfacing specific stretches. You're not laying new infrastructure.
That's actually a reasonable framing. The weights encode something like a connectivity structure between concepts, and new knowledge can slot into that structure if it's presented in a way that respects the existing topology.
What breaks that? What makes a particular knowledge update hard to slot in?
A few things. One is factual contradiction — if the new fact directly contradicts something in the original training data, the model has to resolve a conflict, and that's expensive and often messy. Two is novelty without context — if you're trying to introduce knowledge about something new, something that didn't exist at all before the cutoff, the model has no existing structure to hang it on. There's nowhere to slot it.
That second one seems like a fundamental limit. You can update a fact that the model already has a representation for. But you can't easily introduce a concept that has no prior representation at all.
Which is one reason RAG remains important even if parametric updates improve. For novel entities or events, retrieval gives you a way to surface the information at inference time without needing the model to have internalized it. The parametric update handles the broad knowledge shift; the retrieval layer handles the sharp edges.
The failure mode for RAG being that it doesn't always make things better.
That's understated. There's recent work looking specifically at what retrieval actually does to output quality, and the finding is counterintuitive — RAG can degrade output quality even when the retrieved information is correct. The model has to integrate external context with its parametric knowledge, and if those two sources are in tension, the model sometimes produces worse outputs than if you'd just let it answer from its weights.
You retrieve the right answer, hand it to the model, and the model gets confused by it.
Or the model hedges in ways it wouldn't otherwise, or it over-indexes on the retrieved text and loses coherence. It's not a clean injection of knowledge — it's a prompt engineering challenge every single time. Which is part of why the hybrid approach is appealing but also hard to get right in practice. And that difficulty raises the question: what’s the practical path forward?
If I'm running an AI team and I have a model in production that's eighteen months stale, I have three levers: I can do a targeted incremental pre-training run, I can bolt on RAG, or I can try to do both and manage the integration complexity. And none of them are clean.
The honest answer is that the right architecture depends heavily on the knowledge profile of what you're trying to update. If the staleness is concentrated — you're a legal tech company and the model doesn't know about a regulatory change that affects a specific domain — targeted incremental pre-training is probably your best tool. You can construct a focused corpus, do a relatively contained run, and the model internalizes the update in a way that generalizes across queries.
Whereas RAG would handle the specific document but not the surrounding context.
RAG is good at surfacing a fact. It's less good at updating the model's general understanding of a domain that has shifted. If case law in a particular area has moved significantly over eighteen months, that's not a retrieval problem — that's a knowledge structure problem.
On the OpenAI side of this — because they've been doing something that looks like iterative model updates for a while now — what do we actually know about how they're handling this?
Less than we'd like, honestly. They're not publishing the details. What we can infer from the model version cadence is that they're not doing full pre-training runs every time they release a new version — the economics don't work. The compute cost for GPT-4 was north of a hundred million, and they're releasing updates on a timescale that makes that implausible. So something incremental is happening. Whether it's targeted pre-training, heavy fine-tuning, a hybrid, we don't know.
Which is a slightly uncomfortable position for the field to be in. The frontier labs are running the most sophisticated knowledge update pipelines in existence and the research community is largely reverse-engineering from the outside.
Though to be fair, the academic literature is catching up fast. The ROME and MEMIT work, the MCircKE paper on reasoning gaps, the IBM mid-training results — these are all from the last couple of years, and the quality of the mechanistic understanding has improved dramatically. We know a lot more about why these techniques work or don't than we did even eighteen months ago.
The reasoning gap finding is the one that sticks with me from the MCircKE work. The idea that you can successfully edit a fact into the model's recall — the model will correctly state the updated fact when asked directly — but that fact doesn't propagate into multi-step reasoning. The model knows the new thing but can't think with it.
It's a dissociation between storage and integration. And it tells you something important about what knowledge actually means in these systems. It's not a lookup table. The knowledge is distributed across the weights in a way that's deeply entangled with how the model reasons. You can't just swap one value and expect the downstream logic to update coherently.
Which makes the incremental pre-training approach more compelling than the surgical editing approach for anything beyond trivial factual corrections.
For broad knowledge updates, yes. If you need to change one date or one name, MEMIT-style editing is probably fine. If you need to update a model's understanding of how a technology has evolved over a year, that's a different scope entirely. The parametric update needs to be proportional to the conceptual scope of what's changed.
The future trajectory here — is this a problem that gets more tractable or less tractable as models get larger?
Probably more tractable in terms of relative cost, counterintuitively. Larger models tend to have more redundancy in how knowledge is represented, which means incremental updates are less likely to catastrophically disrupt existing structure. The interference problem gets somewhat easier to manage. But the absolute cost of even a targeted run scales with model size, so larger models still require more resources to update.
The frontier labs have a structural advantage that compounds. They can afford the incremental runs that mid-tier players can't, and their models are also more robust to those updates.
That's a fair read. Though the LoRA and parameter-efficient fine-tuning work is narrowing that gap for some use cases. If you can get meaningful knowledge updates through adapter layers rather than full weight updates, the compute cost drops by another order of magnitude. The question is whether the knowledge actually sticks in a way that generalizes, and the evidence there is mixed.
Mixed being a polite way of saying it works until it doesn't.
The benchmark numbers can look good while the real-world generalization is quietly failing in ways that don’t show up until you hit an edge case — which raises the question of how to navigate these pitfalls when making practical decisions.
Right, and that’s the crux of it. Given all of that — the reasoning gaps, the mixed evidence on LoRA, the hybrid complexity — what does someone actually do with this if they’re trying to make practical decisions today?
Start with the diagnosis. Before you touch your training setup, you need to understand what kind of staleness you're dealing with. Is it factual — specific entities, events, dates that have changed? Or is it structural — the model's understanding of how a whole domain works has drifted? Those require different interventions.
Most teams probably conflate them.
They reach for RAG because it's faster to deploy, and then discover six months later that the model is retrieving correct documents and still producing subtly wrong outputs because its underlying domain model hasn't updated.
The practical sequence is: characterize the staleness, then choose the tool.
If it's narrow and factual, RAG or MEMIT-style editing can handle it. If it's broad and structural, you need a targeted incremental run with a carefully constructed replay buffer. And if you're going the incremental route, the data construction is where you spend your time — the corpus mix matters more than almost any hyperparameter decision.
The IBM mid-training result is useful there as a benchmark. Three to four times reasoning improvement from a contained, well-constructed intermediate phase. That's the ceiling you're aiming for if you do it right.
The floor if you do it badly is catastrophic forgetting on capabilities you need. So test continuously. Don't wait until the run is finished to evaluate degradation on your baseline benchmarks.
For teams that can't afford any kind of pre-training run, even incremental?
Invest in your retrieval architecture. Better chunking strategies, better re-ranking, better context window management. RAG's failure modes are mostly engineering problems, not fundamental limits. You can mitigate a lot of the output degradation with careful prompt design around how retrieved context gets presented to the model.
Follow the literature. The MCircKE and IBM mid-training work both dropped recently — this is a field moving fast enough that what looks like a hard constraint today has a reasonable chance of being partially solved in twelve months. Twelve months might feel optimistic, but it’s not crazy.
The mechanistic understanding is improving fast enough that I'd take that bet.
Which is maybe the most useful thing to leave people with. This is a solvable engineering problem. It's not a fundamental limit of the architecture. The knowledge cutoff exists because of economics and logistics, not because there's something about transformers that makes updating them impossible. The field is actively closing the gap.
The open question being whether the solution looks like better parametric update methods, or whether it looks like hybrid systems where the parametric base stays relatively static and retrieval handles the freshness layer. Those lead to pretty different infrastructure decisions.
Pretty different research priorities. If you believe the hybrid future is the right one, you invest in retrieval engineering and context integration. If you believe parametric updates will get cheap enough to run continuously, you invest in the training infrastructure. Right now both bets are live.
I know which one I'd rather be right about. Continuous parametric updates sound a lot cleaner than managing a retrieval stack that occasionally hands the model a correct answer it can't use.
The dream is a model that updates the way we do. Incrementally, continuously, without forgetting what it already knew.
Which is just describing a brain.
Which is just describing a brain. We've been trying to build one for seventy years. Incremental pre-training is one more step in that direction.
A very expensive step.
A slightly less expensive step than last year.
Big thanks to Hilbert Flumingtop for producing this one. And to Modal for keeping our compute pipeline running without us having to think about it too hard.
This has been My Weird Prompts. If you've got a moment, a review on Spotify goes a long way toward getting this in front of more people.
We'll see you next time.