Daniel sent us this one, and it's a question I think a lot of people have been quietly sitting with. Everyone knows the Claude lineup by now — Haiku, Sonnet, Opus, the hierarchy of cost and capability, roughly speaking faster-cheaper through to slower-smarter. But Daniel's asking what's actually underneath that. Are these three genuinely distinct models with different architectures, different training approaches, different design philosophies? Or is it more like you built one thing and then dialed the knobs? And with Opus having dropped in January, there's now a full generation of all three to actually compare. That's what we're getting into today.
It's a better question than it might sound on the surface, because the answer has real consequences for how you build with these things. If they're scaled variants, you'd expect a kind of smooth capability gradient — same strengths, same weaknesses, just more or less of everything. If they're distinct architectures, then you'd expect qualitative differences. Things one model is structurally better at that another simply can't replicate by being larger.
Right, it's not just academic. The answer changes how you pick the right tool.
Oh, and by the way — today's script is courtesy of Claude Sonnet four point six, so there's a certain poetic recursion happening here.
The model is writing about itself. I feel like we should light a candle or something.
We should probably just get into it before the existential weight sets in.
Let's start at the architecture level.
The surface-level framing most people have is: Haiku is cheap and fast, Sonnet sits in the middle, Opus is the powerful expensive one. Which is true, but it's also a bit like describing three cars by their fuel costs.
You're saying there's an actual machine worth looking at under the hood.
And the honest answer to Daniel's question — distinct architectures versus scaled variants — is that it's both, depending on which layer you're examining. At the training objective level, all three share the same fundamental constitutional approach, the same RLHF lineage, the same Anthropic safety philosophy baked into the base. So in that sense, yes, family resemblance runs deep.
The parameter counts and architectural choices diverge meaningfully. Haiku is engineered for sub-two-hundred-millisecond inference. That's not just a smaller Opus with some layers stripped out — the attention mechanism is structured differently to prioritize throughput. Opus, which landed in January, is sitting around a trillion parameters. That's not a dial turned up; that changes what kinds of representations the model can even form.
The question of "is it one model or three" depends on what level of abstraction you're asking at.
Philosophy of training: one family. Actual computational architecture: meaningfully distinct. And that distinction matters because it determines whether the capability differences are quantitative — more of the same — or qualitative, meaning Opus can do things Haiku structurally cannot, not just slower or worse but cannot.
It reminds me a little of how people used to talk about the difference between a RISC and a CISC processor. On the surface you're comparing chips that both run code, but the fundamental design philosophy is different enough that they excel at different classes of problems. It's not that one is a better version of the other.
That's actually a really clean analogy. RISC processors — the kind that ended up in your phone — are optimized for doing simple operations extremely fast and efficiently. CISC chips, the kind that traditionally powered desktops, were built to handle complex instructions in fewer steps. Neither is universally superior. The question is always what workload you're running. And the same logic applies here. The mistake is assuming there's a single axis of "better.
Which has very practical consequences for anyone building something on top of these models.
That's where it gets interesting for developers. The cost-capability tradeoff isn't linear, and if you assume it is, you'll make poor architectural decisions — especially when certain structures inherently limit what you can achieve.
Right, and that brings us to the phrase "structurally cannot." What does that actually mean in practice? Because that's the claim that needs unpacking.
Right, so let's take Haiku first. The attention mechanism in Haiku is optimized around what you might call sparse attention patterns — it's not attending to everything in the context window with equal weight, it's making aggressive choices about what's relevant. That's how you get sub-two-hundred-millisecond inference. The tradeoff is that long-range dependencies across a very large context window become harder to track. It's not that Haiku is dumb, it's that the architecture is making a deliberate bet: most queries don't need deep cross-context reasoning, and for those queries, being fast is worth more than being thorough.
Which is probably true for something like a customer support chatbot handling ten thousand queries a day.
Completely true for that use case. Where it breaks down is something like: here are forty pages of contract language, find me the clause that contradicts this other clause buried on page thirty-two. That kind of task requires the model to hold a lot of distant context in working memory simultaneously, and Haiku's attention structure isn't built for that.
How does that actually manifest in the output though? Like, does Haiku just return nothing, or does it confidently give you a wrong answer?
That's the dangerous version — it gives you a confident wrong answer. It doesn't say "I can't do this." It finds something plausible-looking in the document and surfaces it. The failure is silent. Which is why the failure mode distinction matters so much. If the model returned an error, you'd know to escalate. Instead it returns something that looks reasonable until someone with domain expertise reads it carefully and realizes the clause it cited doesn't actually contradict anything.
It's not slower or less confident. It's wrong in a different way. And if you're not testing for that specifically, you might not catch it until it's caused a real problem downstream.
And that failure mode distinction is actually what should drive the model selection conversation for developers, more than the price sheet does.
It's not slower or less confident. It's wrong in a different way. And if you're not testing for that specifically, you might not catch it until it's caused a real problem downstream — which is why understanding failure modes is so critical.
Right, and that failure mode distinction is actually what should drive the model selection conversation for developers, more than the price sheet does.
It's not a speed-accuracy tradeoff in the simple sense. It's more that the architecture was designed with a specific class of problems in mind.
That's the cleaner way to say it. And Sonnet sits in a interesting middle position that I think gets undersold. It's not just Haiku with more parameters. The attention heads in Sonnet appear to be structured for what I'd call multi-pass reasoning — the model can revisit earlier context more fluidly. You see this in coding tasks particularly. Sonnet tends to hold the structure of a problem in a way that Haiku loses track of after a certain complexity threshold.
Is that a trained behavior or an architectural one?
Probably both, and they're hard to disentangle from the outside. Anthropic hasn't published detailed architecture papers the way some other labs have, so a lot of this is inferred from benchmark behavior and developer reports. But the inference latency profiles are consistent with different attention configurations, not just parameter scaling.
Opus is where it gets qualitatively different in a way that's hard to dismiss as just "bigger." The trillion-parameter scale isn't the interesting part on its own — parameter counts are a blunt instrument for measuring capability. What matters is what that scale enables in terms of representation. Complex multi-step reasoning tasks, the kind where you have to hold an intermediate conclusion and then revise it based on something you encounter three steps later, those are where Opus shows a meaningful gap over Sonnet, not just a marginal one.
What's a concrete example of that gap showing up?
Legal analysis is a good one. There have been developer case studies, including some published on the Anthropic documentation side, where Opus catches logical inconsistencies across a long document that Sonnet misses — not because Sonnet lacks the vocabulary or the legal knowledge, but because the reasoning chain required is long enough that the representation starts to degrade. Opus sustains it. That's a qualitative difference, not a quantitative one.
There's actually an interesting parallel in how humans handle this kind of task. Working memory research has this concept of cognitive load — the idea that there's a hard ceiling on how many distinct pieces of information you can actively manipulate at once. When a reasoning chain exceeds that ceiling, human experts start making the same kind of silent errors. They don't know they've dropped a thread; they just reconstruct something plausible from what they can still hold. What you're describing in Haiku sounds structurally similar.
That's a really useful frame. And it actually explains why the errors are so hard to catch from the outside. The model isn't flagging uncertainty because from its own perspective it hasn't dropped anything — it's generated a coherent answer. The incoherence only becomes visible when you compare it against the full context it was supposed to be reasoning over. Same thing happens with human experts under time pressure or cognitive overload. The output looks confident because the person generating it doesn't have access to the gap.
The misconception to bust here is that you can just swap Haiku in for Opus on a complex task and expect a proportionally worse result. You might get a categorically different kind of failure.
And that's where I want to shift, because the practical question of when to use which model is non-obvious if you think it through carefully. The default heuristic most people start with is: Haiku for cheap stuff, Opus for hard stuff, Sonnet when you're not sure. And that's not wrong exactly, but it papers over some real design decisions.
Give me a case where that heuristic leads you astray.
Customer-facing chatbots are the obvious Haiku use case, and usually it is the right call. But here's where teams get burned: they prototype the chatbot with Sonnet because it's easier to iterate with, everything works great, they swap to Haiku for production to cut costs, and suddenly edge cases they never noticed are failing. Not because the queries are complex, but because they'd accidentally tuned their prompts to exploit Sonnet's multi-pass reasoning. The prompt engineering that works beautifully on Sonnet can fall apart on Haiku in ways that aren't obvious until you're looking at user complaints.
The model choice has to be upstream of the prompt engineering, not downstream.
That's the discipline that enterprise teams have had to build in. And Sonnet's adoption in enterprise workflows reflects exactly that. It's become the default for a lot of production deployments not because it's the most capable, but because it's the most predictable across a wide range of query types. You're not going to hit the ceiling Haiku has on complex reasoning, and you're not paying Opus rates for tasks that don't need sustained multi-step chains.
There's something almost boring about that being the answer. The middle option is the workhorse.
It's not glamorous, but the numbers back it up. When you look at where Sonnet lands on coding benchmarks versus cost per token, the value proposition is strong. Developers building code review tools, document summarization pipelines, anything in the medium-complexity range, they're landing on Sonnet and mostly staying there.
How does that play out on something like code review specifically? Because that feels like a task where the complexity can vary enormously — reviewing a ten-line utility function is a totally different cognitive load than reviewing a pull request that touches six interdependent modules.
That's exactly the kind of variance that makes code review an interesting test case. For the ten-line function, Haiku is probably fine — it can spot a missing null check or a naming inconsistency without needing to hold much context. But the six-module pull request is a different animal. You need to track how a change in module A propagates through the interfaces to module D, and whether the assumptions baked into module F are still valid. That's where Sonnet earns its keep. And in practice what a lot of teams end up doing is running an initial Haiku pass for surface-level issues — style, obvious bugs, documentation gaps — and then routing the structurally complex reviews to Sonnet. You get most of the coverage at Haiku prices, and Sonnet only touches the cases where it actually matters.
Which is a preview of the routing conversation we're about to have.
And it's worth noting that this kind of tiered approach wasn't really possible before you had a model family with meaningfully distinct capability profiles. When you only had one model, you were paying for Sonnet-level reasoning on every ten-line function review. The family structure creates the opportunity for that optimization.
Where does Opus actually justify its cost?
Research applications are the clearest case. Not research in the casual sense, but structured analytical work where the reasoning chain is long and the cost of a subtle error is high. Legal document analysis, which we touched on. Scientific literature synthesis where you need to hold contradictions across multiple papers and reason about them together. Complex financial modeling where intermediate conclusions feed into later steps. Those are the tasks where the trillion-parameter sustained representation actually earns its keep.
The knock-on effect I keep coming back to is what this does to application architecture decisions. If you know Haiku fails categorically on certain task types, you don't just swap models, you rethink the task decomposition.
That's a important point. A lot of sophisticated AI application design now involves breaking a complex task into subtasks, routing the simpler ones to Haiku, escalating the reasoning-heavy steps to Sonnet or Opus. You're not choosing one model for an application anymore, you're building a model routing layer. And the economics of that can be surprisingly good.
Haiku handles ninety percent of the volume, Opus gets the ten percent that actually needs it.
The hard engineering problem is building the classifier that knows which is which. Get that right and you've dramatically cut your inference costs without sacrificing quality on the tasks that matter. Get it wrong and you've just added latency and complexity for no gain.
What does a good classifier actually look like in practice? Because that feels like it could easily become its own expensive problem.
It can, and that's a real trap. The naive version is a rules-based router — if the input is longer than X tokens, escalate to Sonnet. Which is better than nothing but misses a lot. A more sophisticated version uses a lightweight model, often something like a fine-tuned Haiku, to classify the task before routing it. But the classification task is simpler than the downstream reasoning task, so Haiku can handle it reliably. The signal you're actually looking for is: does this query require holding and revising intermediate conclusions? That's the structural indicator that predicts where the tier boundaries matter.
Which is its own design challenge that didn't exist when you only had one model to worry about.
The model family created the problem it also solves. That's sort of the shape of where AI application design has landed — which raises the question: what does this mean for decision-makers?
So what's the actual takeaway for someone listening to this who has to make a real decision next week?
Start with the task, not the model. That sounds obvious but most people invert it. They pick a model based on budget or familiarity and then discover the constraints later. If you map the task first, the model choice usually follows pretty naturally.
Walk me through the map.
Sub-second response time required, high query volume, relatively bounded task scope? You're not leaving meaningful capability on the table, you're just not paying for what you don't need. Anything involving sustained reasoning chains, document-length context, or multi-step logic where intermediate conclusions matter? That's Sonnet at minimum, Opus if the error cost is high.
The honest version of the Sonnet recommendation is that it covers a wide band.
Wider than people expect. The instinct to reach for Opus on anything that feels "hard" is understandable but usually wrong. Sonnet handles the majority of what enterprise teams actually need, and the predictability across query types is worth something that doesn't show up cleanly in benchmark tables.
If I were a developer who'd never run all three on my actual workload, what would you tell me to do?
Run all three on the same representative sample of your real queries, not synthetic benchmarks. Not a hundred hand-crafted test cases. Your actual production traffic or as close to it as you can get. Look for the cases where Haiku fails qualitatively, not just marginally. That's where your routing boundaries are.
The failure shape tells you more than the success rate.
And the experiment costs almost nothing compared to the architectural decisions it informs. You'll probably find that the task distribution in your application is more Haiku-compatible than you assumed, and that Opus is justified in a narrower slice than you feared.
Which is a more optimistic conclusion than the model selection conversation usually lands on.
It usually lands on cost anxiety. The better version of that conversation is capability clarity — understanding what each tier can actually do.
Capability clarity — I like that framing. But the question I keep sitting with is how long these distinctions actually hold. If the next generation of Haiku is trained with architectural improvements that give it something closer to Sonnet's multi-pass reasoning, does the tier system start to blur?
That's the open question. There's a reasonable argument that as training efficiency improves, the capability gap between tiers compresses. You could imagine a future Haiku that handles the reasoning chains that currently require Sonnet, at sub-200ms latency, at current Haiku prices.
Which would be great for developers and somewhat disorienting for anyone who built their routing logic around today's capability boundaries.
The routing layer becomes a liability if the model landscape shifts under it. That's a real architectural risk that I don't think gets discussed enough. You build a classifier that distinguishes Haiku-appropriate from Opus-appropriate tasks, and then the next model release moves the line on you.
It's a little like building infrastructure around a specific API version and then the versioning scheme changes. You've made a bet on stability that the underlying platform didn't actually promise you.
That's the right analogy. And the teams that have been burned by it tend to be the ones that hardcoded model names into their routing logic rather than abstracting the capability profile. If your classifier is asking "is this a Haiku task or an Opus task," you're in trouble the moment those names refer to different capability profiles than they do today. If your classifier is asking "does this task require sustained multi-step reasoning," you've got something more durable, because that question stays meaningful even as the models evolve.
The practical advice isn't just "map your tasks now" but "build your routing layer so it can be recalibrated.
Treat the model selection as a parameter, not a constant. That's probably the most future-proof way to think about it.
There's something almost poetic about that. The model family that forced you to think carefully about task decomposition also teaches you to hold your architectural assumptions loosely.
AI application design as a practice in epistemic humility. Daniel would appreciate that framing.
He'd probably send us a follow-up prompt about it. Big thanks to Hilbert Flumingtop for producing the show, and to Modal for keeping our inference pipeline running without us having to think about it, which is exactly how we like it.
Exactly how we like it.
If this episode was useful, leave us a review wherever you listen. It helps people find the show. This has been My Weird Prompts. We'll see you next time.