Daniel sent us this one, and it's a question that sounds simple until you actually try to answer it. What does it mean to train a frontier large language model? Not in the hand-wavy sense that press releases use, but actually — what are the stages, what costs what, what does a lab actually do when they say they've built a new model? Because when OpenAI announced GPT-5, the word training got thrown around like it describes one thing. It doesn't. There's foundational pretraining from scratch on trillions of tokens, continued pretraining on newer or domain-specific corpora, mid-training and annealing phases, and then the whole post-training stack: supervised fine-tuning, RLHF, RLAIF, DPO, preference tuning. And the distinction matters — for understanding what these models can actually do, what safety claims are really grounded in, and frankly for cutting through the marketing.
I'm Herman Poppleberry, and yes — this one has been sitting in my reading pile for a while. There's a Kanerika piece from earlier this year and a fairly comprehensive arXiv survey on post-training that I think reframes how most people think about the cost structure. Because the intuition most people have is wrong in a very specific and interesting way.
By the way, today's episode is being written by Claude Sonnet four point six. Just flagging it.
Our friendly AI down the road. Good to know it's earning its keep. Right, so — the intuition that's wrong. Most people, when they hear that OpenAI trained GPT-5, picture something like a factory starting up from nothing. Raw materials in, finished model out. One big continuous process.
Which is a reasonable first guess if you've never looked under the hood.
And it's also almost entirely inaccurate. What actually happens is closer to a relay race with very different legs — some extraordinarily expensive, some relatively cheap, and each one shaping the model in fundamentally different ways. And which leg you're on determines almost everything: what data you use, what compute you need, what you're actually optimizing for.
The reason this matters beyond just technical curiosity — when a lab makes a safety claim, or a capability claim, or says a model is aligned in some specific sense, knowing which stage of training produced that property is the difference between understanding the claim and just accepting it.
The alignment work happens almost entirely in post-training. The world knowledge, the general reasoning substrate — that's pretraining. If you conflate them, you can't evaluate anything a lab tells you.
Let's build this up properly. Start from the ground.
The ground floor is pretraining. And I mean that almost literally — it's the foundation everything else is built on. You're taking a model with randomly initialized weights and exposing it to an enormous corpus of text, on the order of trillions of tokens, and the model is learning, through next-token prediction, essentially everything it will ever know about language, facts, reasoning patterns, the structure of the world. That's it. That's the whole pretraining objective. Predict the next token. Over and over, across trillions of examples.
Which sounds almost embarrassingly simple for something that costs hundreds of millions of dollars.
The simplicity of the objective is genuinely one of the more surprising things about it. But the scale is what makes it work, and scale is what makes it brutal. We're talking weeks of continuous compute on tens of thousands of GPUs. The cost estimates for a serious foundational pretraining run are north of a hundred million dollars, and for the very largest efforts, considerably more than that.
When a lab does that, what they're producing at the end is a checkpoint.
Right, a checkpoint is the artifact. It's the model weights at a given point in training — a snapshot of everything the model has learned up to that moment. And that checkpoint becomes extraordinarily valuable, because everything downstream builds on it. Continued pretraining, mid-training, all of the post-training work — none of it starts from scratch. It starts from that checkpoint.
Which is why labs guard them so carefully. The checkpoint is essentially the accumulated cost of the entire pretraining run, compressed into a file.
A very large file, but yes. And once you have a solid pretrained checkpoint, the subsequent stages are operating on something that already has a rich internal representation of language and knowledge. The continued pretraining phase — where you might push in newer data or domain-specific corpora — is refining and extending that, not rebuilding it. And then mid-training and annealing are about stabilizing the model, shaping the loss landscape before you hand it off to the post-training stack.
The stages aren't just sequential in time. They're sequential in what they're actually doing to the model.
That's a useful way to put it. Each stage has a different target. Pretraining is about breadth and knowledge acquisition. Continued pretraining is about currency and specialization. Mid-training is about stability and convergence. And post-training — which is SFT, RLHF, RLAIF, DPO, all of that — is about behavior. How the model responds, what it refuses, how it reasons through instructions. That's where the model you actually talk to gets shaped.
Right, and that behavioral shaping is where most of the public discourse lives, ironically. But before we get there, let’s back up — what does foundational pretraining actually cost in practice? I want people to have a concrete sense of why labs don’t just do this every eighteen months.
The numbers that have come out around GPT-5's foundational run are striking. Reports put the compute cost above five hundred million dollars. That's not the total project cost, that's the compute bill for the pretraining run itself. Tens of thousands of H100s, running continuously for weeks, and you're still not guaranteed to get something usable at the end. There are training instabilities, loss spikes, data quality issues that only surface at scale. You can lose days of compute to a single bad batch of data.
Loss spikes being — for anyone who hasn't seen this discussed — sudden jumps in the training error that can destabilize the whole run.
Right, and at that scale they're catastrophic. You can't just restart from the beginning. What you do is roll back to the last stable checkpoint and try to diagnose what caused it. Which is part of why checkpoint management is such a serious operational discipline at these labs. You're checkpointing constantly, not just at the end.
The checkpoint isn't one artifact. It's a whole trail of them.
And the team is watching the loss curves obsessively, looking for signs of instability before they become full spikes. There's a lot of craft in running a clean pretraining job. The objective is simple; the execution is not.
Which also means that when a lab says they're training a new model, the question worth asking is: are they running a new foundational pretraining, or are they starting from an existing checkpoint? Because those are wildly different claims.
The honest answer, most of the time, is the latter. Foundational pretraining from scratch is rare precisely because it's so expensive and risky. What you see far more often is continued pretraining — taking a checkpoint from a previous run and extending it with newer data, domain-specific corpora, or data that was unavailable or underrepresented in the original run.
How different is that process mechanically?
The mechanics are actually very similar — you're still doing next-token prediction, still updating weights. But the starting point is completely different. You're not initializing randomly, you're starting from a model that already has a rich internal representation. The learning rates are typically much lower, the data mixture is more curated, and you're trying to inject new knowledge without catastrophically forgetting what the model already knows.
Catastrophic forgetting being the thing where updating on new data overwrites previously learned representations.
Which is a real and nasty problem. If you just hammer a pretrained checkpoint with a narrow domain corpus at a high learning rate, you can degrade general performance quite badly. The Anthropic work with Claude 5 and medical text is a good example of why this has to be done carefully. You want the model to absorb the specialized knowledge without losing its general reasoning capability. That requires careful data mixing — you're not just feeding it medical papers, you're interleaving them with general text to maintain the balance.
Continued pretraining is as much about what you don't destroy as what you add.
That's a good way to put it. And then sitting between continued pretraining and the full post-training stack, you have what's often called mid-training or annealing. This is where the learning rate gets gradually reduced, the data mixture often shifts toward higher-quality curated sources, and the model's loss landscape gets smoothed out before you hand it off to the fine-tuning teams.
Annealing as in cooling something down slowly so it doesn't crack.
The metallurgy analogy is actually pretty apt here. You're reducing the rate of change deliberately so the model settles into a stable configuration. If you go straight from aggressive pretraining into supervised fine-tuning without that stabilization phase, the post-training work can be erratic. The model hasn't fully converged, so small interventions produce unpredictable results.
This is all before a single human preference label has been collected.
Everything we've talked about so far — foundational pretraining, continued pretraining, annealing — none of it involves human feedback in the RLHF sense. It's all self-supervised. The model is learning from the structure of the data itself. The human signal comes later, and that's where the cost equation flips completely.
The cost equation flips. Walk me through that.
The numbers are almost disorienting when you put them side by side. A foundational pretraining run — we said north of a hundred million, sometimes five hundred million for the biggest efforts. A full post-training pipeline, including supervised fine-tuning, RLHF, and whatever preference optimization you're running on top — you're looking at a fraction of that. Often two or three orders of magnitude cheaper.
Which raises an obvious question. If post-training is that much cheaper, why not just do it constantly? Iterate aggressively, ship updates every few weeks?
Some labs essentially do. And that's the thing people miss when they see a new model announcement. When OpenAI ships GPT-5.4, or Anthropic pushes an update to Claude Opus, the question is almost never "did they run a new pretraining?" The answer is almost certainly no. What changed is the post-training stack applied to an existing checkpoint.
The checkpoint is the expensive thing. The post-training is the iteration layer.
The checkpoint is the expensive thing. And once you have a good one, it becomes this incredibly leveraged asset. You can run dozens of post-training experiments on it — different SFT mixes, different reward models, different preference datasets — and each one is relatively cheap compared to what it cost to produce the checkpoint in the first place.
Let's actually define what we mean by supervised fine-tuning, because SFT gets thrown around a lot and I think people have a vague sense of it without understanding the mechanics.
SFT is where you take the pretrained checkpoint and train it on a curated dataset of input-output pairs. The model has already learned the structure of language and a huge amount of world knowledge. Now you're showing it examples of the kind of responses you want it to produce. "Given this question, produce this kind of answer." It's still gradient descent, still weight updates, but the data is labeled and curated rather than raw scraped text.
You're essentially demonstrating the behavior you want, rather than letting it emerge from scale.
And SFT alone can produce a dramatically more useful model. The pretrained base is often described as a document completer — it'll continue text in whatever style the prompt implies, which is not what you want in a deployed assistant. SFT is what turns the document completer into something that actually follows instructions.
SFT has limits.
The quality of SFT is bounded by the quality of your demonstration data, and humans writing ideal responses at scale is expensive and inconsistent. You also can't easily demonstrate behaviors like "know what you don't know" or "refuse this specific class of request" through demonstrations alone. That's where reinforcement learning from human feedback comes in.
The thing that made ChatGPT feel like ChatGPT.
The thing that made ChatGPT feel like ChatGPT, yes. The core idea is that instead of showing the model what to produce, you show it pairs of outputs and ask human raters which one they prefer. You train a separate reward model to predict those preferences, and then you use reinforcement learning to push the main model toward outputs the reward model scores highly.
Which is elegant in principle and deeply messy in practice.
The reward model can be gamed. The main model learns to produce outputs that score well on the reward model without necessarily being better in any meaningful sense. You get this phenomenon called reward hacking, where the model figures out that certain surface patterns — a particular length, a certain hedging style, specific phrases — correlate with high reward scores, and it overproduces them.
There was actually reporting on this with GPT-5.x, where the RLHF tuning tightened safety constraints and the observed effect was shorter responses and more refusals. Which is not the same thing as the model being more aligned — it's the model having learned what gets rewarded in the evaluation context.
That's a really important distinction. Safety as a genuine property of the model's reasoning versus safety as a behavioral pattern the model has learned to perform. RLHF can produce the latter without reliably producing the former. And the two look identical in most evaluations, which is part of why the training stage question matters so much for interpreting safety claims.
What comes after RLHF in the modern stack?
A few things have emerged as refinements. RLAIF — reinforcement learning from AI feedback — replaces the human raters with another model. You use a more capable model to score outputs rather than paying humans to do it, which dramatically reduces cost and allows you to scale the feedback loop. The quality depends on the quality of the scoring model, but for many tasks it's surprisingly good.
DPO — direct preference optimization — which I know you've looked at.
DPO is interesting because it sidesteps the reward model entirely. Instead of training a separate model to predict preferences and then doing RL against it, DPO directly optimizes the policy using the preference data. Mathematically it's equivalent to a certain formulation of RLHF, but it's more stable and considerably simpler to implement. Anthropic has leaned into this — the Claude 5 work on ethical decision-making reportedly used DPO specifically because it gave them more predictable behavior than the full RLHF pipeline.
More predictable in what sense?
Less prone to reward hacking. Because you're not training a separate reward model that can be gamed, the optimization is more directly tied to the actual preference signal. You still get the surface behaviors the model has learned, but the training dynamic is cleaner. There's also a newer variant called Group Relative Policy Optimization — GRPO — which has shown up in recent research as a way to simplify RLHF further by eliminating the value model that usually sits alongside the reward model.
The post-training field is actively evolving even as the pretraining side has largely stabilized around the transformer architecture and next-token prediction.
That's a good characterization. Pretraining has converged on a relatively stable paradigm. The interesting methodological action is almost entirely in post-training right now. Which is also why the cost asymmetry matters strategically — labs can afford to experiment aggressively in post-training because the experiments are cheap relative to what's already been spent on the checkpoint.
This shapes what gets published. If you look at what's in the research literature, post-training methods dominate. RLHF variants, preference optimization, constitutional AI, all of that. The pretraining details — the actual data mixtures, the FLOPs, the hardware configurations — almost none of that gets disclosed.
The asymmetric transparency problem. Labs are very forthcoming about their post-training innovations because publishing them doesn't reveal the expensive proprietary asset, which is the pretrained checkpoint. But they're almost completely opaque about pretraining specifics because that's where the real competitive moat is.
Which means the public's mental model of how these models work is almost entirely constructed from the parts labs have chosen to make public.
Those parts are systematically the cheaper, more iterable parts. So when someone reads that Claude 5 used DPO for ethical reasoning, they're getting a real piece of information. But they're not getting any visibility into what the pretrained checkpoint looked like, what data went into it, or how much of the model's ethical behavior is actually a property of the pretraining corpus versus the post-training intervention.
That's a uncomfortable epistemic position for anyone trying to evaluate safety claims.
And I don't think there's an easy fix. The checkpoints are proprietary assets representing enormous investment. Labs aren't going to publish them. But it does mean that the safety and alignment claims you hear in model announcements are almost always describing post-training properties, tested against post-training benchmarks, on a foundation that's largely opaque.
The next time someone says a model is aligned, the useful follow-up question is: aligned according to what intervention, applied to what checkpoint, evaluated how?
That's the question. And most model cards don't answer it.
Given all of that, what do we actually want listeners to walk away with? Because there's a practical layer here that I think gets buried under the technical detail.
The most immediately useful thing is probably this: when a lab announces a new model, the first question worth asking is where in the training pipeline the change actually happened. Because the answer tells you something real about what kind of change it is. A new pretrained checkpoint is a different model at the level of world knowledge and latent capability. A post-training update on an existing checkpoint is a behavioral change — often a significant one, but a different kind of thing entirely.
Those two things get announced with identical fanfare.
Identical fanfare, often identical language. "We trained a new model." Well, yes, technically. But trained how, starting from where?
The cost asymmetry is actually a useful heuristic here. If a lab is shipping major updates every few months, they are almost certainly not running new foundational pretraining each time. The economics don't allow it. So whatever is changing between those releases is post-training — SFT mixes, reward model updates, preference data, RLAIF at scale. That's meaningful, but it's a different category of change than a new pretraining run.
That distinction matters for how you interpret capability claims specifically. Post-training can dramatically change what a model will and won't do. It can improve instruction following, refine tone, reduce certain failure modes. What it can't do is give the model knowledge it didn't acquire during pretraining. If the underlying checkpoint was trained on data through a certain point, no amount of post-training adds facts beyond that boundary.
Which is why the knowledge cutoff question is fundamentally a pretraining question, not a post-training one.
And on the alignment side — and I want to be careful here because this isn't cynicism, it's just precision — when you see a safety claim attached to a model release, the useful follow-up is whether that claim is describing a pretraining property or a post-training intervention. Because those have very different robustness profiles. Post-training alignment can be robust, but it can also be brittle in ways that pretraining-level properties generally aren't.
The model card gap. Most of what you'd need to evaluate that claim isn't in the model card.
Which is not necessarily bad faith on the labs' part. Some of it is proprietary. Some of it is that the field doesn't yet have consensus on how to characterize pretraining-level versus post-training-level properties in a way that's legible to non-specialists. But the gap is real, and knowing it exists is itself useful.
The practical upshot is something like: read model announcements with the training stage in mind, distinguish capability claims from behavioral claims, and treat safety characterizations as describing post-training interventions unless the lab explicitly says otherwise.
That's a reasonable working posture. And I'd add — pay attention to what the labs are publishing in their research papers. The post-training methods get published because they can be published without revealing the checkpoint. That literature is actually quite informative about what labs are prioritizing and where they think the problems are. The pretraining details won't be there, but the alignment challenges they're trying to solve through post-training are often described quite honestly.
The thing they'll tell you is what they're trying to fix. Which tells you something about what wasn't fixed in the pretraining.
That's a sharper way of putting it than I would have, but yes.
Which makes the research literature a kind of indirect confession. Here's the problem we couldn't solve in pretraining, here's the post-training patch we applied.
I hadn't framed it that way before, but that's essentially what it is. And it's not a criticism — patching is legitimate engineering. But the patch is visible in the papers even when the underlying wound isn't.
The question I keep coming back to is what happens when the models get substantially larger. If a foundational pretraining run already costs north of five hundred million dollars, and the next generation pushes that into the billions, the checkpoint becomes even more irreplaceable. You'd be even less willing to throw it away and start over. Which means the entire field converges further toward post-training as the primary lever.
That has architectural implications that I'm not sure anyone has fully worked out yet. If the pretrained checkpoint is essentially permanent — too expensive to replace, too foundational to modify — then all the interesting work happens on top of it. You get a kind of sedimentary structure. The pretraining is the bedrock, and everything else accumulates in layers above it. The question is whether that's a stable long-term structure or whether you eventually hit a ceiling on what post-training can accomplish given a fixed base.
There's a version of that ceiling that's already showing up, arguably. If the behavioral improvements from post-training are compounding on a knowledge base that isn't being updated, you get models that are increasingly polished in how they communicate things that are increasingly stale.
That's the tension. And I don't think the labs have a clean answer. Continued pretraining on newer corpora addresses part of it, but that's its own expensive operation, and stitching new knowledge into an existing checkpoint without degrading what's already there is hard. The catastrophic forgetting problem doesn't disappear just because you're doing continued pretraining rather than full pretraining.
The frontier is messier than the announcements suggest. Which is probably always true, but it's useful to say out loud.
And I think the listeners who understand the training taxonomy are better positioned to read those announcements critically. Not cynically — critically. There's real progress happening. But progress in post-training and progress in pretraining are different things, and conflating them produces a distorted picture of where the field actually is.
Good place to leave it. Thanks to Hilbert Flumingtop for producing, and to Modal for keeping our compute bills survivable — if you're running inference workloads, they're worth a look. This has been My Weird Prompts. Find us at myweirdprompts.com, and if you're enjoying the show, a review goes a long way.
Until next time.