#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2564
Published: Apr 25
Duration: 28:00
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: context-window reasoning-models benchmarks

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Death of Needle-in-a-Haystack

For much of 2023 and early 2024, the needle-in-a-haystack (NIAH) benchmark was the standard test for long-context AI models. Drop one fact into a sea of irrelevant text, ask the model to retrieve it. Simple. Useful. And now, completely saturated.

Every major model — Claude, Gemini, GPT — hits 99% on NIAH at absurd context lengths. As one observer put it, it's "the benchmark equivalent of bragging that your car can roll downhill." A late 2024 paper from NVIDIA's RULER team confirmed that NIAH performance has essentially zero correlation with actual long-context reasoning ability.

What Replaced It: Four Harder Benchmarks

RULER (NVIDIA Research) — A suite of subtasks that stress different capabilities. The standout is variable tracking: models receive a long context with variables being assigned and reassigned (X=7, then later X=12, then X=4) and must report the final value. Results were brutal — models above 95% on NIAH at 128K tokens dropped below 50% on variable tracking at the same length. Multi-key retrieval and common words extraction showed similar collapses.

BABILong — Takes classic bAbI reasoning tasks and embeds them in massive distractor texts. The key finding: even at 11,000 tokens (roughly 25 pages), models claiming million-token windows start failing multi-step reasoning. Single-hop retrieval works fine. Two-hop reasoning degrades noticeably. Three-hop reasoning? Many models are at chance.

NoCha (Narrative Claim Verification on Novels) — Tests whether models can verify claims about full-length novels. The claims are written in the same style as the book, eliminating lexical anchors. Models that score 99% on NIAH drop below 70% accuracy on multi-chapter claim verification — barely better than chance on a binary task.

Michelangelo — The most philosophically interesting benchmark, named after sculpting. [Note: The script was cut off before Michelangelo was fully explained.]

The Core Insight

The gap between claimed and effective context windows is enormous. A model can hold a million tokens in its window but only reason across two or three facts. For enterprise use cases — legal document analysis, research synthesis, codebase understanding — the effective context window is what matters, not the theoretical maximum. These benchmarks reveal that current architectures excel at retrieval but struggle with comprehension, state tracking, and multi-hop reasoning across long contexts.

Mentions

bAbI Classic reasoning tasks from Facebook AI Research
BABILong Long-context multi-hop reasoning benchmark
Claude Anthropic's frontier language model
Gemini Google's multimodal AI model
GPT-4 OpenAI's powerful language model
Michelangelo Latent structure detection in long texts
Needle-in-a-Haystack (NIAH) Simple retrieval benchmark for long contexts
NoCha Narrative claim verification on full novels
NVIDIA Company behind RULER and GPU technology
RULER Multi-task long-context evaluation suite from NVIDIA

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Creator's Picks 304 episodes

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Daniel sent us this one — he wants to talk about long-context evaluation, specifically why that needle-in-a-haystack benchmark everyone used to cite is basically useless now, and what's actually replaced it. He flagged four benchmarks in particular: RULER, BABILong, NoCha, and Michelangelo. And the punchline he wants us to land is why frontier models can ace a million-token retrieval test yet completely fall apart at eight thousand tokens of genuine multi-hop reasoning. This is the gap between the claimed context window and the effective context window.

Oh, this is such a good topic. And Daniel's right — needle-in-a-haystack is saturated. It's done. Every major model hits ninety-nine percent on it out to absurd lengths, and people still wave it around like it means something. It's the benchmark equivalent of bragging that your car can roll downhill.

Before we get into the replacements though — fun fact, DeepSeek V four Pro is writing our script today. So if anything sounds unusually coherent, that's why.

I'll take that as a compliment to the model and an insult to our usual standards. Alright, let's start with why needle-in-a-haystack is meaningless now. The original NIAH test — you drop one fact into a giant pile of irrelevant text, you ask the model to retrieve it. That's it. No reasoning, no synthesis, no connecting multiple pieces of information. It's a Ctrl-F test.

The thing is, it was genuinely useful in twenty twenty-three and early twenty twenty-four. It exposed real weaknesses. Models would forget the beginning of long documents, or the middle would go dark. But then everyone optimized for it. They trained specifically on long-context retrieval patterns. And now Claude, Gemini, GPT — they all basically max it out.

There's a paper from late twenty twenty-four that showed something brutal. I think it was the RULER team at NVIDIA. They demonstrated that NIAH performance has essentially zero correlation with actual long-context reasoning ability. You can be perfect at finding a needle and still be completely incapable of doing anything useful with a long document.

That's the setup. Let's walk through what actually replaced it. Daniel mentioned four — start with RULER?

Yeah, RULER came out of NVIDIA Research, and it was one of the first systematic attempts to say okay, if needle-in-a-haystack is too easy, what does a harder retrieval benchmark look like? And they didn't just make one test — they built a whole suite with different subtasks that stress different capabilities. The one that gets the most attention is variable tracking.

Explain variable tracking. What's the setup?

Imagine the model gets a long context — could be tens of thousands, hundreds of thousands of tokens — and scattered throughout that context are variables being assigned and reassigned. Like, X equals seven, then three hundred pages later, X equals twelve. Then another two hundred pages, X equals four. And at the very end, the question is: what is the final value of X? The model can't just find one needle. It has to track state changes across the entire context.

Which is fundamentally different from retrieval. Retrieval is find the sentence that says X. This is reconstruct a sequence of mutations and compute the current state.

And the results were not pretty. The RULER paper tested a bunch of models that were hitting above ninety-five percent on standard NIAH at a hundred twenty-eight thousand tokens. On variable tracking at the same context length, some of them dropped to below fifty percent. Some were effectively at chance.

Fifty percent on something that any human with a notepad could do perfectly. That's the gap in a nutshell. What are the other RULER subtasks?

Multi-key retrieval is another big one. Instead of one needle, there are multiple keys scattered throughout the context — say, five different API keys or five different passcodes — and the model has to retrieve all of them. Sounds simple, but it turns out that as the number of keys increases, performance degrades way faster than you'd expect. Models that can reliably find one key at a million tokens can't find four keys at thirty-two thousand.

Because the attention mechanism gets diluted. It's trying to hold too many retrieval targets simultaneously.

That's the leading hypothesis, yeah. The third one Daniel mentioned is common words extraction, and this one is almost insultingly simple. The model is given a long context and asked to list all the words that appear exactly N times. Or find the most frequent word. It's a counting and aggregation task. And models are terrible at it.

Terrible at counting words in their own context window. That's almost poetic.

It really is. And it exposes something fundamental. These models don't have a reliable internal representation of what's in their context. They can attend to things, but they can't summarize statistical properties of the context itself. Ask a model how many times the word "the" appears in a hundred-thousand-token document and it'll just guess.

RULER basically says: retrieval is easy, state tracking is hard, multi-target retrieval is hard, and any kind of aggregation over the full context is nearly impossible. That's the first benchmark suite. What about BABILong?

BABILong is where things get really interesting, because it moves from retrieval to reasoning. It's built on top of the bAbI tasks — those are classic reasoning benchmarks from Facebook AI Research back in twenty fifteen, twenty sixteen. Simple things like: John went to the kitchen. John picked up the apple. Where is the apple? Basic spatial and temporal reasoning chains.

Right, those were originally designed to test reasoning in short contexts. A few sentences.

And what the BABILong team did was take those same reasoning problems and embed them in massively long distractor texts. So now the model has to read through the equivalent of an entire novel, find the relevant sentences — which might be pages apart — and then chain them together to answer a question. It's not just find the sentence. It's find the relevant sentences among a sea of irrelevance, and then reason over them.

How long are we talking?

They tested up to one million tokens, but the really damning results show up much earlier. Even at eleven thousand tokens — eleven thousand, not a hundred thousand — some models that claim million-token context windows start failing at multi-step reasoning tasks.

Eleven thousand tokens is roughly, what, twenty-five pages? That's a short story. And models with million-token context windows can't reason across twenty-five pages?

That's the finding. And it gets worse when the reasoning requires more hops. A single-hop question — find one fact — works fine at long contexts. Two-hop reasoning degrades noticeably. Three-hop reasoning? Many models are at chance by three hops, even at moderate context lengths.

The number of reasoning steps is a harder constraint than the raw context length. That's a really important insight. You can have a million tokens in your window but if you can only reason across two or three facts, your effective context window for anything useful is tiny.

That's exactly the punchline Daniel wanted us to land, and we'll get there. But let me finish BABILong. The most striking result from their twenty twenty-four paper was that increasing context length from four thousand to eleven thousand tokens caused reasoning accuracy to drop by twenty to forty percentage points on multi-hop tasks, depending on the model. And these are models that show basically flat NIAH performance across the same range.

Needle-in-a-haystack says everything's fine, and BABILong says you've already lost the plot by eleven thousand tokens. That's not a small discrepancy. That's two completely different stories about what these models can do.

This is why evaluation methodology matters so much. If you're an enterprise deciding whether to trust a model with your hundred-page legal document, and you look at the NIAH benchmark, you'd think great, it can handle it. BABILong says absolutely not, not if you need it to reason across that document.

Alright, let's move to the third one. This one sounds particularly devious.

NoCha stands for Narrative Claim Verification on Novels. And it's brilliant because it exploits the exact weakness that makes NIAH easy. In needle-in-a-haystack, the needle is lexically distinct. You drop in a sentence about eating a pizza in the middle of a document about corporate tax law, and the model can find it because it's semantically anomalous. It stands out. NoCha eliminates that crutch.

The needle looks exactly like the hay.

NoCha uses full-length novels — real books, entire novels — and generates claims about the narrative that the model has to verify as true or false. But here's the key: the claims are written in the same style as the book. Same vocabulary, same sentence structure, same narrative voice. There's no lexical anchor. The model can't just pattern-match. It actually has to understand the story.

Give me an example of what a NoCha claim looks like.

If the novel is Pride and Prejudice, a true claim might be "Elizabeth Bennet initially refuses Mr. Darcy's first marriage proposal." That's a plot point, but it's not a direct quote — the book never says "Elizabeth initially refuses Mr. Darcy's first marriage proposal" in those words. A false claim might be "Elizabeth Bennet visits her aunt in Manchester" when she actually visits her aunt in London, or something that's plausible within the world of the novel but didn't happen. The model has to know the story well enough to distinguish what happened from what could have happened.

These claims are being verified against the full text of the novel sitting in the context window. So the model has the book right there. It's not testing memory.

The book is in the context. The model can theoretically look at any part of it. But the novel might be a hundred thousand tokens long, and the claim requires integrating information from chapters two, seven, and fourteen. That's the challenge.

How do the frontier models do on NoCha?

There was a paper — I think it was from early twenty twenty-five — that tested several leading models on NoCha with full-length novels. Even the best models were below seventy percent accuracy on claim verification when the claims required integrating information from multiple chapters. And these are models that score ninety-nine percent on NIAH at the same context length.

Below seventy percent on a binary true-false task. That's barely better than flipping a coin for some of them.

And remember, this is not a hard task for a human who's read the book. If you've actually read Pride and Prejudice, you can verify "Elizabeth refused Darcy's first proposal" instantly. The model has the entire book in front of it and still can't do it reliably.

That says something profound about what "reading" means for these models. They're not reading the way we read. They're doing something that looks like reading for simple retrieval tasks but breaks down completely when you ask for comprehension.

The NoCha authors made exactly that point. They argued that NIAH-style benchmarks measure a model's ability to locate text, not to understand it. And the fact that models fail on NoCha at context lengths where NIAH is perfect suggests that as context grows, models aren't building a coherent representation of the content. They're just indexing it for retrieval.

Which brings us to Michelangelo. This is the one I find most philosophically interesting. What's the setup?

Michelangelo is named after the idea of sculpting — you're revealing the latent structure within the context. The benchmark tests whether models can identify patterns and relationships that aren't explicitly stated anywhere in the text, but that emerge from the aggregate of many individual pieces of information.

It's not retrieval, it's not even reasoning over stated facts. It's pattern recognition across the entire context.

A classic example from the Michelangelo paper: the context contains descriptions of hundreds of individual transactions between different entities. No single sentence tells you who the central hub is. But if you aggregate all the transactions, a pattern emerges — entity A is connected to everyone else. The question is: can the model identify entity A as the central node?

That requires integrating information from potentially every part of the context. You can't answer it by finding one sentence. You can't even answer it by finding two or three sentences and chaining them. You have to have absorbed the whole thing.

That's what makes it so hard. The Michelangelo benchmark includes several types of these latent structure tasks. There's the network centrality one I just described. There are temporal pattern tasks where the model has to identify a recurring sequence of events that's never explicitly labeled as a pattern. There are correlation tasks where two variables are related but the relationship is only visible across dozens of data points scattered through the text.

How long are the contexts in Michelangelo?

They tested at various lengths, but the key results are at thirty-two thousand and sixty-four thousand tokens. And here's the thing — even at thirty-two thousand tokens, which is well within the claimed comfort zone of every frontier model, performance on latent structure tasks is abysmal. We're talking thirty to fifty percent accuracy on tasks that require genuine integration across the full context.

I'm guessing NIAH scores at thirty-two thousand tokens are essentially perfect for these same models.

Ninety-nine point something percent. The divergence couldn't be starker. At the exact same context length, the same model goes from near-perfect retrieval to near-chance pattern recognition.

We've walked through all four benchmarks. Let's synthesize this. What's the through-line? What do RULER, BABILong, NoCha, and Michelangelo collectively tell us?

They tell us that there's a hierarchy of difficulty for long-context tasks, and the gap between levels is enormous. Level one is simple retrieval — find the sentence that matches this query. That's solved. All frontier models can do this at a million tokens. Level two is multi-target retrieval or state tracking — find multiple things, or track changes to one thing. Models start failing here at surprisingly short contexts. Level three is multi-hop reasoning — connect facts A and B and C to answer a question. Performance collapses at three or more hops beyond maybe eight thousand to sixteen thousand tokens.

Level four is what Michelangelo tests — latent structure, pattern recognition that requires integrating across the entire context. Models basically can't do this at all beyond relatively short documents.

And here's why this matters. When a company says "our model has a one-million-token context window," what they're really saying is "our model can do level one retrieval at one million tokens." They're not saying it can reason, track state, or identify patterns at that length. But most users don't know the difference. They hear "million-token context window" and think the model can read and understand War and Peace in one go.

It can't. It can find a specific sentence in War and Peace. It cannot tell you whether the narrative structure of War and Peace mirrors a particular historical pattern unless that's explicitly stated somewhere in the text.

There was a really good piece about this — I want to say it was on SemiAnalysis or maybe The Gradient, late twenty twenty-four — that introduced the concept of the "effective context window." It's the length at which the model can still perform a given task at above some threshold, say eighty percent accuracy. And the effective context window varies wildly depending on what you're asking the model to do.

For simple retrieval, the effective context window might be a million tokens. For two-hop reasoning, it might be thirty-two thousand. For three-hop reasoning, maybe eight thousand. For latent structure tasks, maybe four thousand.

That's the dirty secret of long-context marketing. The number on the spec sheet is the maximum possible context length for the easiest possible task. It's like a car manufacturer advertising top speed based on driving downhill with a tailwind.

What's the actual mechanism here? Why do models fail at reasoning over long contexts even when retrieval works fine?

There are a few competing explanations and it's probably a combination. One is attention dilution. As the context grows, the attention mechanism has to spread its weights across more and more tokens. Even with sophisticated attention architectures, relevant information gets less attention weight when it's surrounded by more noise.

The signal-to-noise ratio degrades with length.

That's part of it. Another factor is what some researchers call the "lost in the middle" problem — models pay disproportionately more attention to the beginning and end of the context, and information in the middle gets underrepresented. But the deeper issue might be representational. The model has a fixed-size representation of everything it's read. When you shove a million tokens into that representation, you're compressing aggressively. Fine details get blurred. For retrieval, you just need to know roughly where something is. For reasoning, you need precise logical relationships, and those get lost in the compression.

That compression analogy makes a lot of sense. It's like the difference between knowing that a book contains a sentence about Paris and knowing the exact logical relationship between that sentence and three other sentences scattered across different chapters.

There's a third factor that I think is underappreciated. These models are trained on mostly short-context data. The pre-training corpus is full of articles, forum posts, code files — things that fit in four thousand or eight thousand tokens. There's comparatively little training data that requires reasoning across a hundred thousand tokens. So the models never really learn to do it. The long-context capability is sort of bolted on after the fact through fine-tuning and architectural tricks like RoPE scaling.

It's not that the architecture is fundamentally incapable — it's that the training distribution doesn't teach it to reason at length, and the post-training fixes only get you so far.

That's my read, yeah. And it's consistent with what we see. Models that are fine-tuned specifically for long-context reasoning tasks do better on BABILong and RULER. But the general-purpose frontier models, even the very best ones, still show this massive gap.

Let's talk about what this means for actual users. Daniel's in the AI space, he's probably using these models with long contexts regularly. What should he and people like him actually take away from all this?

The first takeaway is: don't trust the context window number on the spec sheet. It's almost meaningless for anything beyond simple retrieval. If you're using a model to analyze a long document, you need to think about what kind of cognitive work you're asking it to do.

Break that down. What kinds of tasks are safe at long contexts, and what kinds should make you nervous?

Safe at long contexts: find-me tasks. Find every mention of this contract clause. Find all the dates. Find the sentence where the author discusses the budget. Summarization is also relatively safe — it's mostly a compression task, and models are decent at that even at length, though you'll lose nuance.

What about the nervous category?

Anything that requires connecting multiple pieces of information that aren't adjacent. If you're asking "does the argument in chapter three contradict the conclusion in chapter twelve," that's multi-hop reasoning across a gap, and you should be skeptical of the answer at anything beyond maybe thirty thousand tokens. If you're asking "what's the overall thematic pattern in this book," that's a latent structure task, and frankly, I wouldn't trust any current model to do that reliably on a full-length novel.

Even though the model will give you a confident, articulate answer.

Especially because it'll give you a confident, articulate answer. That's the dangerous part. The model doesn't know what it doesn't know. It'll happily spin you a plausible-sounding thematic analysis of a novel while having effectively processed maybe fifteen percent of the text.

The practical advice is: for long documents, break your queries into retrieval-style questions. Don't ask for synthesis across distant sections. Don't ask for pattern recognition. Ask for specific facts, and do the synthesis yourself.

Or use a multi-step workflow. Chunk the document, process each chunk separately, and then have the model reason over the chunk summaries. That's not perfect either — you lose cross-chunk connections — but it's often more reliable than dumping the whole thing in and hoping the model integrates it properly.

That's a useful heuristic. Chunk and summarize, then reason over summaries. It's basically admitting that the model can't do the integration itself and you're doing it manually.

That's the honest assessment of where we are in mid twenty twenty-six. The claimed context windows are a million tokens, two million tokens, and they're going to keep growing. But the effective context window for reasoning hasn't budged nearly as much. It's maybe doubled in two years while the claimed window has gone up by a factor of fifty.

That's a wild asymmetry. And most coverage doesn't make this distinction at all. You see headlines about million-token context windows and the implicit message is "the model can now read and understand entire books in one go." And that's just not true.

The benchmark community has done a really good job of developing these more rigorous tests. RULER, BABILong, NoCha, Michelangelo — each one exposes a different dimension of the gap between retrieval and understanding. But the benchmark results haven't filtered into public awareness the way the context window numbers have.

Because "million tokens" is a simple, impressive number you can put in a press release. "Sixty-three percent on multi-hop reasoning at thirty-two thousand tokens" is a lot harder to market.

To be fair to the model developers, they're not necessarily being dishonest. The models do technically support million-token contexts. You can put a million tokens in and get a response out. The architecture handles it. It's just that the quality of the response degrades in ways that aren't captured by the benchmarks most people look at.

I think that's a generous framing. If you know that your model's reasoning collapses at eight thousand tokens and you market it as having a million-token context window without that caveat, that's at least a sin of omission.

That's fair. And I think the pressure to compete on context length has created a race where nobody wants to be the first to admit that the number doesn't mean what customers think it means.

Let's zoom out for a second. Is this fixable? Are there architectural approaches on the horizon that might close the gap between retrieval and reasoning at long contexts?

There's a lot of work on this. One promising direction is better attention mechanisms — things like ring attention, dilated attention, sparse attention patterns that try to maintain coverage of the full context without the quadratic cost. Another is explicit memory architectures where the model maintains a separate, structured representation of what it's read.

Like a running summary that gets updated as it reads.

Instead of trying to attend to every token equally, the model compresses what it's read into a structured memory and reasons over that. Some of the newer architectures are exploring this. There's also work on retrieval-augmented generation inside the context window — having the model search its own context for relevant chunks before reasoning, rather than trying to keep everything in active attention simultaneously.

The model becomes its own search engine over its own context.

And that's essentially admitting that the naive approach of "just make the attention window bigger" isn't going to solve the reasoning problem. You need architectural changes that are specifically designed for integration and synthesis, not just for longer attention.

Do you think we'll see a point where the effective context window for reasoning catches up to the claimed context window? Or is this gap fundamental?

I don't think it's fundamental in principle. Humans can read long books and reason across them. It takes time and effort, but we can do it. So there's no theoretical reason a sufficiently advanced AI couldn't. But I think the current transformer-based architectures, even with all the modifications we've added, are hitting diminishing returns on long-context reasoning. We might need a different approach.

That's a sobering assessment. So for the foreseeable future, the advice is: enjoy your million-token context window for search and summarization, but don't trust it for thinking.

If the model tells you something about a long document that requires connecting dots across distant sections, go find those sections yourself and check. The model might be right, but the odds aren't as good as the confidence level suggests.

That's a good place to land. Daniel, I hope this gives you a useful framework for thinking about when to trust long-context outputs and when to be skeptical. The benchmarks have evolved, the evaluation is much better than it was two years ago, and the picture that's emerging is: retrieval is solved, reasoning at length is not.

If you're evaluating models for a use case that requires genuine long-context understanding, look past the context window number. Look at RULER scores, BABILong scores, NoCha and Michelangelo if they're available. Those tell you way more about what the model can actually do than the headline token count.

Thanks to Hilbert Flumingtop for producing, and thanks to Modal for keeping our pipeline running smoothly. This has been My Weird Prompts. You can find every episode at myweirdprompts.

If you're enjoying the show, leave us a review wherever you listen — it helps. We'll be back soon with whatever Daniel throws at us next.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

The Death of Needle-in-a-Haystack

What Replaced It: Four Harder Benchmarks

The Core Insight

Mentions

Downloads

You Might Also Like

Featured In

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps