#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.

0:000:00
Episode Details
Episode ID
MWP-2564
Published
Duration
28:00
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Death of Needle-in-a-Haystack

For much of 2023 and early 2024, the needle-in-a-haystack (NIAH) benchmark was the standard test for long-context AI models. Drop one fact into a sea of irrelevant text, ask the model to retrieve it. Simple. Useful. And now, completely saturated.

Every major model — Claude, Gemini, GPT — hits 99% on NIAH at absurd context lengths. As one observer put it, it's "the benchmark equivalent of bragging that your car can roll downhill." A late 2024 paper from NVIDIA's RULER team confirmed that NIAH performance has essentially zero correlation with actual long-context reasoning ability.

What Replaced It: Four Harder Benchmarks

RULER (NVIDIA Research) — A suite of subtasks that stress different capabilities. The standout is variable tracking: models receive a long context with variables being assigned and reassigned (X=7, then later X=12, then X=4) and must report the final value. Results were brutal — models above 95% on NIAH at 128K tokens dropped below 50% on variable tracking at the same length. Multi-key retrieval and common words extraction showed similar collapses.

BABILong — Takes classic bAbI reasoning tasks and embeds them in massive distractor texts. The key finding: even at 11,000 tokens (roughly 25 pages), models claiming million-token windows start failing multi-step reasoning. Single-hop retrieval works fine. Two-hop reasoning degrades noticeably. Three-hop reasoning? Many models are at chance.

NoCha (Narrative Claim Verification on Novels) — Tests whether models can verify claims about full-length novels. The claims are written in the same style as the book, eliminating lexical anchors. Models that score 99% on NIAH drop below 70% accuracy on multi-chapter claim verification — barely better than chance on a binary task.

Michelangelo — The most philosophically interesting benchmark, named after sculpting. [Note: The script was cut off before Michelangelo was fully explained.]

The Core Insight

The gap between claimed and effective context windows is enormous. A model can hold a million tokens in its window but only reason across two or three facts. For enterprise use cases — legal document analysis, research synthesis, codebase understanding — the effective context window is what matters, not the theoretical maximum. These benchmarks reveal that current architectures excel at retrieval but struggle with comprehension, state tracking, and multi-hop reasoning across long contexts.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Corn
Daniel sent us this one — he wants to talk about long-context evaluation, specifically why that needle-in-a-haystack benchmark everyone used to cite is basically useless now, and what's actually replaced it. He flagged four benchmarks in particular: RULER, BABILong, NoCha, and Michelangelo. And the punchline he wants us to land is why frontier models can ace a million-token retrieval test yet completely fall apart at eight thousand tokens of genuine multi-hop reasoning. This is the gap between the claimed context window and the effective context window.
Herman
Oh, this is such a good topic. And Daniel's right — needle-in-a-haystack is saturated. It's done. Every major model hits ninety-nine percent on it out to absurd lengths, and people still wave it around like it means something. It's the benchmark equivalent of bragging that your car can roll downhill.
Corn
Before we get into the replacements though — fun fact, DeepSeek V four Pro is writing our script today. So if anything sounds unusually coherent, that's why.
Herman
I'll take that as a compliment to the model and an insult to our usual standards. Alright, let's start with why needle-in-a-haystack is meaningless now. The original NIAH test — you drop one fact into a giant pile of irrelevant text, you ask the model to retrieve it. That's it. No reasoning, no synthesis, no connecting multiple pieces of information. It's a Ctrl-F test.
Corn
The thing is, it was genuinely useful in twenty twenty-three and early twenty twenty-four. It exposed real weaknesses. Models would forget the beginning of long documents, or the middle would go dark. But then everyone optimized for it. They trained specifically on long-context retrieval patterns. And now Claude, Gemini, GPT — they all basically max it out.
Herman
There's a paper from late twenty twenty-four that showed something brutal. I think it was the RULER team at NVIDIA. They demonstrated that NIAH performance has essentially zero correlation with actual long-context reasoning ability. You can be perfect at finding a needle and still be completely incapable of doing anything useful with a long document.
Corn
That's the setup. Let's walk through what actually replaced it. Daniel mentioned four — start with RULER?
Herman
Yeah, RULER came out of NVIDIA Research, and it was one of the first systematic attempts to say okay, if needle-in-a-haystack is too easy, what does a harder retrieval benchmark look like? And they didn't just make one test — they built a whole suite with different subtasks that stress different capabilities. The one that gets the most attention is variable tracking.
Corn
Explain variable tracking. What's the setup?
Herman
Imagine the model gets a long context — could be tens of thousands, hundreds of thousands of tokens — and scattered throughout that context are variables being assigned and reassigned. Like, X equals seven, then three hundred pages later, X equals twelve. Then another two hundred pages, X equals four. And at the very end, the question is: what is the final value of X? The model can't just find one needle. It has to track state changes across the entire context.
Corn
Which is fundamentally different from retrieval. Retrieval is find the sentence that says X. This is reconstruct a sequence of mutations and compute the current state.
Herman
And the results were not pretty. The RULER paper tested a bunch of models that were hitting above ninety-five percent on standard NIAH at a hundred twenty-eight thousand tokens. On variable tracking at the same context length, some of them dropped to below fifty percent. Some were effectively at chance.
Corn
Fifty percent on something that any human with a notepad could do perfectly. That's the gap in a nutshell. What are the other RULER subtasks?
Herman
Multi-key retrieval is another big one. Instead of one needle, there are multiple keys scattered throughout the context — say, five different API keys or five different passcodes — and the model has to retrieve all of them. Sounds simple, but it turns out that as the number of keys increases, performance degrades way faster than you'd expect. Models that can reliably find one key at a million tokens can't find four keys at thirty-two thousand.
Corn
Because the attention mechanism gets diluted. It's trying to hold too many retrieval targets simultaneously.
Herman
That's the leading hypothesis, yeah. The third one Daniel mentioned is common words extraction, and this one is almost insultingly simple. The model is given a long context and asked to list all the words that appear exactly N times. Or find the most frequent word. It's a counting and aggregation task. And models are terrible at it.
Corn
Terrible at counting words in their own context window. That's almost poetic.
Herman
It really is. And it exposes something fundamental. These models don't have a reliable internal representation of what's in their context. They can attend to things, but they can't summarize statistical properties of the context itself. Ask a model how many times the word "the" appears in a hundred-thousand-token document and it'll just guess.
Corn
RULER basically says: retrieval is easy, state tracking is hard, multi-target retrieval is hard, and any kind of aggregation over the full context is nearly impossible. That's the first benchmark suite. What about BABILong?
Herman
BABILong is where things get really interesting, because it moves from retrieval to reasoning. It's built on top of the bAbI tasks — those are classic reasoning benchmarks from Facebook AI Research back in twenty fifteen, twenty sixteen. Simple things like: John went to the kitchen. John picked up the apple. Where is the apple? Basic spatial and temporal reasoning chains.
Corn
Right, those were originally designed to test reasoning in short contexts. A few sentences.
Herman
And what the BABILong team did was take those same reasoning problems and embed them in massively long distractor texts. So now the model has to read through the equivalent of an entire novel, find the relevant sentences — which might be pages apart — and then chain them together to answer a question. It's not just find the sentence. It's find the relevant sentences among a sea of irrelevance, and then reason over them.
Corn
How long are we talking?
Herman
They tested up to one million tokens, but the really damning results show up much earlier. Even at eleven thousand tokens — eleven thousand, not a hundred thousand — some models that claim million-token context windows start failing at multi-step reasoning tasks.
Corn
Eleven thousand tokens is roughly, what, twenty-five pages? That's a short story. And models with million-token context windows can't reason across twenty-five pages?
Herman
That's the finding. And it gets worse when the reasoning requires more hops. A single-hop question — find one fact — works fine at long contexts. Two-hop reasoning degrades noticeably. Three-hop reasoning? Many models are at chance by three hops, even at moderate context lengths.
Corn
The number of reasoning steps is a harder constraint than the raw context length. That's a really important insight. You can have a million tokens in your window but if you can only reason across two or three facts, your effective context window for anything useful is tiny.
Herman
That's exactly the punchline Daniel wanted us to land, and we'll get there. But let me finish BABILong. The most striking result from their twenty twenty-four paper was that increasing context length from four thousand to eleven thousand tokens caused reasoning accuracy to drop by twenty to forty percentage points on multi-hop tasks, depending on the model. And these are models that show basically flat NIAH performance across the same range.
Corn
Needle-in-a-haystack says everything's fine, and BABILong says you've already lost the plot by eleven thousand tokens. That's not a small discrepancy. That's two completely different stories about what these models can do.
Herman
This is why evaluation methodology matters so much. If you're an enterprise deciding whether to trust a model with your hundred-page legal document, and you look at the NIAH benchmark, you'd think great, it can handle it. BABILong says absolutely not, not if you need it to reason across that document.
Corn
Alright, let's move to the third one. This one sounds particularly devious.
Herman
NoCha stands for Narrative Claim Verification on Novels. And it's brilliant because it exploits the exact weakness that makes NIAH easy. In needle-in-a-haystack, the needle is lexically distinct. You drop in a sentence about eating a pizza in the middle of a document about corporate tax law, and the model can find it because it's semantically anomalous. It stands out. NoCha eliminates that crutch.
Corn
The needle looks exactly like the hay.
Herman
NoCha uses full-length novels — real books, entire novels — and generates claims about the narrative that the model has to verify as true or false. But here's the key: the claims are written in the same style as the book. Same vocabulary, same sentence structure, same narrative voice. There's no lexical anchor. The model can't just pattern-match. It actually has to understand the story.
Corn
Give me an example of what a NoCha claim looks like.
Herman
If the novel is Pride and Prejudice, a true claim might be "Elizabeth Bennet initially refuses Mr. Darcy's first marriage proposal." That's a plot point, but it's not a direct quote — the book never says "Elizabeth initially refuses Mr. Darcy's first marriage proposal" in those words. A false claim might be "Elizabeth Bennet visits her aunt in Manchester" when she actually visits her aunt in London, or something that's plausible within the world of the novel but didn't happen. The model has to know the story well enough to distinguish what happened from what could have happened.
Corn
These claims are being verified against the full text of the novel sitting in the context window. So the model has the book right there. It's not testing memory.
Herman
The book is in the context. The model can theoretically look at any part of it. But the novel might be a hundred thousand tokens long, and the claim requires integrating information from chapters two, seven, and fourteen. That's the challenge.
Corn
How do the frontier models do on NoCha?
Herman
There was a paper — I think it was from early twenty twenty-five — that tested several leading models on NoCha with full-length novels. Even the best models were below seventy percent accuracy on claim verification when the claims required integrating information from multiple chapters. And these are models that score ninety-nine percent on NIAH at the same context length.
Corn
Below seventy percent on a binary true-false task. That's barely better than flipping a coin for some of them.
Herman
And remember, this is not a hard task for a human who's read the book. If you've actually read Pride and Prejudice, you can verify "Elizabeth refused Darcy's first proposal" instantly. The model has the entire book in front of it and still can't do it reliably.
Corn
That says something profound about what "reading" means for these models. They're not reading the way we read. They're doing something that looks like reading for simple retrieval tasks but breaks down completely when you ask for comprehension.
Herman
The NoCha authors made exactly that point. They argued that NIAH-style benchmarks measure a model's ability to locate text, not to understand it. And the fact that models fail on NoCha at context lengths where NIAH is perfect suggests that as context grows, models aren't building a coherent representation of the content. They're just indexing it for retrieval.
Corn
Which brings us to Michelangelo. This is the one I find most philosophically interesting. What's the setup?
Herman
Michelangelo is named after the idea of sculpting — you're revealing the latent structure within the context. The benchmark tests whether models can identify patterns and relationships that aren't explicitly stated anywhere in the text, but that emerge from the aggregate of many individual pieces of information.
Corn
It's not retrieval, it's not even reasoning over stated facts. It's pattern recognition across the entire context.
Herman
A classic example from the Michelangelo paper: the context contains descriptions of hundreds of individual transactions between different entities. No single sentence tells you who the central hub is. But if you aggregate all the transactions, a pattern emerges — entity A is connected to everyone else. The question is: can the model identify entity A as the central node?
Corn
That requires integrating information from potentially every part of the context. You can't answer it by finding one sentence. You can't even answer it by finding two or three sentences and chaining them. You have to have absorbed the whole thing.
Herman
That's what makes it so hard. The Michelangelo benchmark includes several types of these latent structure tasks. There's the network centrality one I just described. There are temporal pattern tasks where the model has to identify a recurring sequence of events that's never explicitly labeled as a pattern. There are correlation tasks where two variables are related but the relationship is only visible across dozens of data points scattered through the text.
Corn
How long are the contexts in Michelangelo?
Herman
They tested at various lengths, but the key results are at thirty-two thousand and sixty-four thousand tokens. And here's the thing — even at thirty-two thousand tokens, which is well within the claimed comfort zone of every frontier model, performance on latent structure tasks is abysmal. We're talking thirty to fifty percent accuracy on tasks that require genuine integration across the full context.
Corn
I'm guessing NIAH scores at thirty-two thousand tokens are essentially perfect for these same models.
Herman
Ninety-nine point something percent. The divergence couldn't be starker. At the exact same context length, the same model goes from near-perfect retrieval to near-chance pattern recognition.
Corn
We've walked through all four benchmarks. Let's synthesize this. What's the through-line? What do RULER, BABILong, NoCha, and Michelangelo collectively tell us?
Herman
They tell us that there's a hierarchy of difficulty for long-context tasks, and the gap between levels is enormous. Level one is simple retrieval — find the sentence that matches this query. That's solved. All frontier models can do this at a million tokens. Level two is multi-target retrieval or state tracking — find multiple things, or track changes to one thing. Models start failing here at surprisingly short contexts. Level three is multi-hop reasoning — connect facts A and B and C to answer a question. Performance collapses at three or more hops beyond maybe eight thousand to sixteen thousand tokens.
Corn
Level four is what Michelangelo tests — latent structure, pattern recognition that requires integrating across the entire context. Models basically can't do this at all beyond relatively short documents.
Herman
And here's why this matters. When a company says "our model has a one-million-token context window," what they're really saying is "our model can do level one retrieval at one million tokens." They're not saying it can reason, track state, or identify patterns at that length. But most users don't know the difference. They hear "million-token context window" and think the model can read and understand War and Peace in one go.
Corn
It can't. It can find a specific sentence in War and Peace. It cannot tell you whether the narrative structure of War and Peace mirrors a particular historical pattern unless that's explicitly stated somewhere in the text.
Herman
There was a really good piece about this — I want to say it was on SemiAnalysis or maybe The Gradient, late twenty twenty-four — that introduced the concept of the "effective context window." It's the length at which the model can still perform a given task at above some threshold, say eighty percent accuracy. And the effective context window varies wildly depending on what you're asking the model to do.
Corn
For simple retrieval, the effective context window might be a million tokens. For two-hop reasoning, it might be thirty-two thousand. For three-hop reasoning, maybe eight thousand. For latent structure tasks, maybe four thousand.
Herman
That's the dirty secret of long-context marketing. The number on the spec sheet is the maximum possible context length for the easiest possible task. It's like a car manufacturer advertising top speed based on driving downhill with a tailwind.
Corn
What's the actual mechanism here? Why do models fail at reasoning over long contexts even when retrieval works fine?
Herman
There are a few competing explanations and it's probably a combination. One is attention dilution. As the context grows, the attention mechanism has to spread its weights across more and more tokens. Even with sophisticated attention architectures, relevant information gets less attention weight when it's surrounded by more noise.
Corn
The signal-to-noise ratio degrades with length.
Herman
That's part of it. Another factor is what some researchers call the "lost in the middle" problem — models pay disproportionately more attention to the beginning and end of the context, and information in the middle gets underrepresented. But the deeper issue might be representational. The model has a fixed-size representation of everything it's read. When you shove a million tokens into that representation, you're compressing aggressively. Fine details get blurred. For retrieval, you just need to know roughly where something is. For reasoning, you need precise logical relationships, and those get lost in the compression.
Corn
That compression analogy makes a lot of sense. It's like the difference between knowing that a book contains a sentence about Paris and knowing the exact logical relationship between that sentence and three other sentences scattered across different chapters.
Herman
There's a third factor that I think is underappreciated. These models are trained on mostly short-context data. The pre-training corpus is full of articles, forum posts, code files — things that fit in four thousand or eight thousand tokens. There's comparatively little training data that requires reasoning across a hundred thousand tokens. So the models never really learn to do it. The long-context capability is sort of bolted on after the fact through fine-tuning and architectural tricks like RoPE scaling.
Corn
It's not that the architecture is fundamentally incapable — it's that the training distribution doesn't teach it to reason at length, and the post-training fixes only get you so far.
Herman
That's my read, yeah. And it's consistent with what we see. Models that are fine-tuned specifically for long-context reasoning tasks do better on BABILong and RULER. But the general-purpose frontier models, even the very best ones, still show this massive gap.
Corn
Let's talk about what this means for actual users. Daniel's in the AI space, he's probably using these models with long contexts regularly. What should he and people like him actually take away from all this?
Herman
The first takeaway is: don't trust the context window number on the spec sheet. It's almost meaningless for anything beyond simple retrieval. If you're using a model to analyze a long document, you need to think about what kind of cognitive work you're asking it to do.
Corn
Break that down. What kinds of tasks are safe at long contexts, and what kinds should make you nervous?
Herman
Safe at long contexts: find-me tasks. Find every mention of this contract clause. Find all the dates. Find the sentence where the author discusses the budget. Summarization is also relatively safe — it's mostly a compression task, and models are decent at that even at length, though you'll lose nuance.
Corn
What about the nervous category?
Herman
Anything that requires connecting multiple pieces of information that aren't adjacent. If you're asking "does the argument in chapter three contradict the conclusion in chapter twelve," that's multi-hop reasoning across a gap, and you should be skeptical of the answer at anything beyond maybe thirty thousand tokens. If you're asking "what's the overall thematic pattern in this book," that's a latent structure task, and frankly, I wouldn't trust any current model to do that reliably on a full-length novel.
Corn
Even though the model will give you a confident, articulate answer.
Herman
Especially because it'll give you a confident, articulate answer. That's the dangerous part. The model doesn't know what it doesn't know. It'll happily spin you a plausible-sounding thematic analysis of a novel while having effectively processed maybe fifteen percent of the text.
Corn
The practical advice is: for long documents, break your queries into retrieval-style questions. Don't ask for synthesis across distant sections. Don't ask for pattern recognition. Ask for specific facts, and do the synthesis yourself.
Herman
Or use a multi-step workflow. Chunk the document, process each chunk separately, and then have the model reason over the chunk summaries. That's not perfect either — you lose cross-chunk connections — but it's often more reliable than dumping the whole thing in and hoping the model integrates it properly.
Corn
That's a useful heuristic. Chunk and summarize, then reason over summaries. It's basically admitting that the model can't do the integration itself and you're doing it manually.
Herman
That's the honest assessment of where we are in mid twenty twenty-six. The claimed context windows are a million tokens, two million tokens, and they're going to keep growing. But the effective context window for reasoning hasn't budged nearly as much. It's maybe doubled in two years while the claimed window has gone up by a factor of fifty.
Corn
That's a wild asymmetry. And most coverage doesn't make this distinction at all. You see headlines about million-token context windows and the implicit message is "the model can now read and understand entire books in one go." And that's just not true.
Herman
The benchmark community has done a really good job of developing these more rigorous tests. RULER, BABILong, NoCha, Michelangelo — each one exposes a different dimension of the gap between retrieval and understanding. But the benchmark results haven't filtered into public awareness the way the context window numbers have.
Corn
Because "million tokens" is a simple, impressive number you can put in a press release. "Sixty-three percent on multi-hop reasoning at thirty-two thousand tokens" is a lot harder to market.
Herman
To be fair to the model developers, they're not necessarily being dishonest. The models do technically support million-token contexts. You can put a million tokens in and get a response out. The architecture handles it. It's just that the quality of the response degrades in ways that aren't captured by the benchmarks most people look at.
Corn
I think that's a generous framing. If you know that your model's reasoning collapses at eight thousand tokens and you market it as having a million-token context window without that caveat, that's at least a sin of omission.
Herman
That's fair. And I think the pressure to compete on context length has created a race where nobody wants to be the first to admit that the number doesn't mean what customers think it means.
Corn
Let's zoom out for a second. Is this fixable? Are there architectural approaches on the horizon that might close the gap between retrieval and reasoning at long contexts?
Herman
There's a lot of work on this. One promising direction is better attention mechanisms — things like ring attention, dilated attention, sparse attention patterns that try to maintain coverage of the full context without the quadratic cost. Another is explicit memory architectures where the model maintains a separate, structured representation of what it's read.
Corn
Like a running summary that gets updated as it reads.
Herman
Instead of trying to attend to every token equally, the model compresses what it's read into a structured memory and reasons over that. Some of the newer architectures are exploring this. There's also work on retrieval-augmented generation inside the context window — having the model search its own context for relevant chunks before reasoning, rather than trying to keep everything in active attention simultaneously.
Corn
The model becomes its own search engine over its own context.
Herman
And that's essentially admitting that the naive approach of "just make the attention window bigger" isn't going to solve the reasoning problem. You need architectural changes that are specifically designed for integration and synthesis, not just for longer attention.
Corn
Do you think we'll see a point where the effective context window for reasoning catches up to the claimed context window? Or is this gap fundamental?
Herman
I don't think it's fundamental in principle. Humans can read long books and reason across them. It takes time and effort, but we can do it. So there's no theoretical reason a sufficiently advanced AI couldn't. But I think the current transformer-based architectures, even with all the modifications we've added, are hitting diminishing returns on long-context reasoning. We might need a different approach.
Corn
That's a sobering assessment. So for the foreseeable future, the advice is: enjoy your million-token context window for search and summarization, but don't trust it for thinking.
Herman
If the model tells you something about a long document that requires connecting dots across distant sections, go find those sections yourself and check. The model might be right, but the odds aren't as good as the confidence level suggests.
Corn
That's a good place to land. Daniel, I hope this gives you a useful framework for thinking about when to trust long-context outputs and when to be skeptical. The benchmarks have evolved, the evaluation is much better than it was two years ago, and the picture that's emerging is: retrieval is solved, reasoning at length is not.
Herman
If you're evaluating models for a use case that requires genuine long-context understanding, look past the context window number. Look at RULER scores, BABILong scores, NoCha and Michelangelo if they're available. Those tell you way more about what the model can actually do than the headline token count.
Corn
Thanks to Hilbert Flumingtop for producing, and thanks to Modal for keeping our pipeline running smoothly. This has been My Weird Prompts. You can find every episode at myweirdprompts.
Herman
If you're enjoying the show, leave us a review wherever you listen — it helps. We'll be back soon with whatever Daniel throws at us next.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.