#2164: Getting the Most From Large Context Windows

Frontier models have million-token context windows, but attention degrades well before you hit the limit. New research reveals why bigger isn't bet...

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2322
Published: Apr 12
Duration: 26:18
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: context-window ai-reasoning ai-memory

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Why Bigger Context Windows Don't Solve the Lost-in-the-Middle Problem

Modern language models can accept context windows up to a million tokens, yet their reasoning quality degrades well before reaching that limit. The gap between theoretical capacity and practical performance isn't a retrieval problem—it's fundamental to how attention mechanisms work at scale.

The Core Problem: Attention Doesn't Scale Gracefully

Recent research from Du et al. (2025) challenged the industry's leading explanation for context degradation. The team ran an experiment replacing all non-relevant tokens in long contexts with blank spaces. If the "lost-in-the-middle" problem was simply that models couldn't find relevant information in noise, removing the noise should fix it. It didn't. Degradation persisted even when the model had a single relevant sentence surrounded by nothing, as long as that sentence appeared in the middle of a long sequence.

This finding reframes the entire conversation: the problem isn't retrieval. It's that the attention mechanism itself struggles with long sequences, regardless of content. Expanding context windows doesn't solve this—it just creates a bigger space in which to lose things.

The U-Shaped Curve and the Phase Transition

The foundational "Lost in the Middle" paper (Stanford and UC Berkeley, 2023) mapped this degradation precisely. When given documents where only one contains the answer, models showed a U-shaped performance curve: information at the beginning and end of context is used reliably, while information in the middle is effectively invisible.

But Veseli et al. (2025) discovered something more granular: this U-shape only holds when context is less than fifty percent full. Cross that threshold, and the pattern shifts dramatically. Above fifty percent capacity, the model abandons the middle entirely and favors recency, making the earliest tokens most likely to be lost.

This creates a paradox with no safe configuration: below fifty percent full, you lose the middle. Above fifty percent full, you lose the beginning. This phase transition has real consequences for long-running AI sessions—what users experience as "context rot," where the model forgets instructions from hours earlier or re-asks questions already answered.

Sliding Windows vs. LLM Summarization: The Empirical Surprise

Given these constraints, what actually works? JetBrains published a major empirical study (December 2025, NeurIPS Deep Learning for Code workshop) comparing two strategies across five different settings.

Observation masking (or sliding windows) keeps the most recent N turns verbatim and replaces older content with a placeholder like "some details omitted for brevity." It's simple, but effective: both strategies cut costs by over fifty percent compared to unmanaged context growth. Strikingly, in four out of five settings, observation masking matched or beat LLM summarization on solve rates while being cheaper. With Qwen3-Coder 480B, observation masking boosted solve rates by 2.6% versus unmanaged context and was 52% cheaper.

LLM summarization compresses older conversation portions into digests using a cheaper model (often Claude Haiku), while the main agent runs on a more capable model. The intuition is sound: preserve semantic content rather than deleting it. But the empirical data revealed a counterintuitive failure mode. Agents using LLM summarization ran fifteen percent longer on average than those using observation masking. With Gemini 2.5 Flash, summarization led to agents running fifty-two turns versus forty-five for masking.

The hypothesis: LLM-generated summaries smooth over signals that an agent should stop. The summary creates false confidence—"I've done X, Y, Z, I'm making headway"—when the agent should recognize it's stuck. Additionally, summarization is expensive per call with almost no cache reuse, accounting for more than seven percent of total cost per instance for large models.

The Hybrid Approach: Best of Both Worlds

JetBrains proposed and tested a hybrid strategy: use observation masking as the primary defense, triggering LLM summarization only when context becomes truly unwieldy. On SWE-bench Verified with Qwen3-Coder 480B, the hybrid was seven percent cheaper than pure observation masking and eleven percent cheaper than pure LLM summarization, with comparable solve rates.

OpenHands implements this natively through their PipelineCondenser, allowing multiple compression stages in sequence: remove old events first, then summarize what remains, then truncate if needed.

The Frontier: Hierarchical Memory Trees

Beyond these tactical approaches, hierarchical memory represents a different paradigm. TiMem (Institute of Automation, Chinese Academy of Sciences, January 2026) mirrors how human memory consolidates over time:

Level 1: Raw dialog turns (high fidelity, recent)
Level 2: Non-redundant event summaries (session-level)
Level 3: Routine contexts and recurrent interests (daily)
Level 4: Behavioral features and preferences (weekly)
Level 5: Stable personality and values (monthly)

Recall is complexity-aware: simple queries search only levels one, two, and five; complex queries traverse the full hierarchy. On long-horizon conversation benchmarks (LoCoMo and LongMemEval-S), TiMem achieved 75.3% and 76.88% respectively, with 52% reduction in recalled context length compared to baselines.

The Open Questions

The empirical landscape is becoming clearer, but critical questions remain. How much drift accumulates across multiple compression passes in a long session? Should compression be invisible to users or transparent? And does working from summaries rather than raw text meaningfully affect reliability and trust in high-stakes domains?

The gap between theoretical context capacity and practical performance isn't closing—it's being managed, layer by layer, with increasingly sophisticated strategies that acknowledge attention's fundamental limitations.

BLOG_POST_END

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2164: Getting the Most From Large Context Windows

Alright, here's what Daniel sent us this time. He's asking about context window management in modern AI systems — specifically the gap between theoretical limits and actual performance. So we've got frontier models with context windows up to a million tokens, and yet reasoning quality degrades well before you hit that ceiling. He wants us to dig into the mechanisms behind that — attention dilution, lost-in-the-middle effects, recency bias — and then walk through the full landscape of solutions: sliding windows, dynamic summarization, native platform features like Claude Code's compaction command. The key questions he's flagging: have frontends implemented transparent auto-summarization without needing manual triggers? Are there projects doing hierarchical memory with different compression ratios for different time horizons? And from a UX angle — should compression be invisible or transparent to users? Does working from summaries rather than raw text affect reliability and trust?

Great set of questions. And I want to start with something that I think reframes the whole conversation, because there's a really important piece of research that most people in this space haven't fully absorbed yet. A paper from Du et al. in twenty twenty-five ran an experiment where they replaced all the non-relevant tokens in a long context with blank spaces. The reasoning was: if the lost-in-the-middle problem is a retrieval problem — the model can't find the needle in the haystack — then removing the haystack should fix it. The needle should be obvious.

And it didn't fix it.

Not at all. Degradation persisted. Which means the problem isn't retrieval. It's a function of input sequence length itself. The attention mechanism struggles with long sequences regardless of what the content actually is. You could give the model a single relevant sentence surrounded by nothing, and it still struggles if that sentence is in the middle of a long sequence.

That's kind of philosophically unsettling, isn't it? Because the whole industry narrative has been "bigger context windows solve memory problems." But if the degradation is baked into how attention works at scale, expanding the window doesn't fix anything. You just have a bigger space in which to lose things.

That's the paradox. And the empirical shape of the failure is well-documented at this point. The foundational paper came out of Stanford and UC Berkeley in twenty twenty-three — the "Lost in the Middle" paper. They set up a clean experiment: give the model a set of documents where only one contains the answer, vary where that document appears in the context, and measure performance. What they found was a U-shaped performance curve. Information at the very beginning or end of the context is used reliably. Information in the middle is effectively invisible.

So the model has primacy bias and recency bias simultaneously.

Which is already counterintuitive enough. But then Veseli et al. in twenty twenty-five refined this further. They found the U-shape only holds when the context is less than fifty percent full. Once you cross that fifty percent threshold, the pattern shifts — the model starts favoring recency above everything else, and the earliest tokens are now the most likely to be lost.

So there's a phase transition at the halfway mark.

The practical implication is almost comically specific. Teresa Torres summarized it well: below fifty percent full, you lose the middle. Above fifty percent full, you lose the beginning. There is no configuration where everything is safe.

Which explains something I've noticed with Claude Code specifically — you put your architectural decisions or your CLAUDE.md rules at the start of a session, and by the time you're two hours in, it's like the model has never heard of them.

That's context rot, and it's the user-facing symptom of everything we've been describing. The MindStudio team documented this failure mode in detail earlier this month. You start a long session, the first hour is sharp — tight reasoning, consistent decisions. Then somewhere around the two-hour mark, the model starts re-asking questions you already answered, suggesting code that contradicts what it wrote an hour ago. It's not a bug in any traditional sense. It's the attention mechanism doing exactly what it does.

By the way, quick note — today's episode is being written by Claude Sonnet four point six. Which I find either ironic or appropriate, depending on your disposition.

Probably both. Alright, so given that this is the landscape — attention degrades, bigger windows don't fix it, there's a fifty percent phase transition — what do you actually do about it? Let's go through the main strategies, because they're quite different in their tradeoffs.

Start with the simplest one.

Sliding window, or what SWE-agent calls observation masking. You keep the most recent N turns verbatim and replace everything older with a placeholder — "some details omitted for brevity." The agent's reasoning chain stays intact, but the older tool outputs, file reads, test logs — those get masked. JetBrains published a major empirical study on this in December twenty twenty-five, presented at the NeurIPS Deep Learning for Code workshop. They compared observation masking against LLM summarization across five different settings, and the results were striking.

How striking?

Both strategies cut costs by over fifty percent compared to unmanaged context — just letting the context grow freely. But in four out of five settings, observation masking matched or beat LLM summarization on solve rates while being cheaper. With Qwen3-Coder 480B, observation masking boosted solve rates by two point six percent versus unmanaged context, and was fifty-two percent cheaper.

So the dumb approach is competitive with the smart approach.

More than competitive — it's often better. Which is a pattern that keeps appearing in machine learning. Simple baselines are chronically underestimated. The intuition is that LLM summarization should be better because it preserves semantic content rather than just deleting it. But the JetBrains data says: not reliably.

What's the failure mode of masking though? Because "some details omitted" sounds like it creates total amnesia for anything outside the window.

That's the catch. Observation masking slows context growth but doesn't stop it — if you allow infinite turns, the context still grows, just more slowly. And more importantly, the model has zero awareness that anything exists outside the window. It's not "I remember this vaguely" — it's "this never happened." For tasks with a bounded number of steps, that's often fine. For open-ended long-horizon tasks, you can hit a wall.

Which is presumably why people reach for summarization.

Right. The idea with LLM summarization is that instead of deleting old content, you have a separate model — often a cheaper one — compress the older portion of the conversation into a digest. OpenHands uses this approach, summarizing twenty-one turns at a time while keeping the most recent ten verbatim. The Anthropic SDK supports it natively through a compaction control parameter — you can specify a cheaper model, like Claude Haiku, to do the summarization while the main agent keeps running on Sonnet.

That's a sensible architecture — use the cheap model for the compression work.

It is, but the JetBrains study found a genuinely counterintuitive problem. Agents using LLM summarization ran fifteen percent longer on average than agents using observation masking. With Gemini 2.5 Flash, summarization led to agents running fifty-two turns on average versus forty-five for masking. The hypothesis is that LLM-generated summaries smooth over signs that the agent should stop. The summary makes the agent feel like it's making progress — "I've done X, Y, Z, I'm making headway" — when actually it should recognize it's stuck and give up.

So the summary creates false confidence.

That's the best way to describe it. And there's a related problem: summarization is expensive per call. Each summary processes a unique slice of conversation history, so you get almost no cache reuse. The JetBrains study found summarization calls can account for more than seven percent of total cost per instance for large models.

Which is a meaningful overhead if you're running a lot of agents.

And it compounds with the fidelity question. What actually survives summarization? The Anthropic cookbook is explicit about this: ticket IDs, categories, priorities, outcomes — those get retained. Full knowledge base article text, detailed classification reasoning, complete drafted responses — those get lost. And when a model is summarizing its own prior work, you get hallucination risk on top of information loss. The model may confabulate details about decisions it made, especially complex technical ones.

There's something recursive and slightly alarming about that. The model summarizes its own previous reasoning, possibly introduces errors, then reasons from that summary, and then that gets summarized again.

It's a lossy compression chain. Each pass potentially introduces drift. Over a long session with multiple compaction events, you could accumulate meaningful distortions. It's an open empirical question how much drift actually accumulates in practice, but the theoretical risk is real.

So what does the hybrid approach look like? Because it seems like you want masking for the common case and summarization as a fallback.

That's exactly what JetBrains proposed and tested. Use observation masking as the primary defense, and only trigger LLM summarization when the context becomes truly unwieldy — when masking alone can't prevent overflow. On SWE-bench Verified with Qwen3-Coder 480B, the hybrid was seven percent cheaper than pure observation masking and eleven percent cheaper than pure LLM summarization, with a comparable solve rate improvement. OpenHands exposes this natively through their PipelineCondenser — you can chain multiple condensers in sequence: remove old events first, then summarize what remains, then truncate if needed. It's a production-ready multi-stage compression architecture.

Now let's talk about the more sophisticated end of the spectrum. Because hierarchical memory is a different paradigm entirely.

TiMem is the most interesting recent work here. Published in January twenty twenty-six by researchers at the Institute of Automation at the Chinese Academy of Sciences. The core idea is a Temporal Memory Tree with five levels. Level one is raw dialog turns — created after each exchange, high fidelity. Level two is non-redundant event summaries across a session. Level three captures routine contexts and recurrent interests across a day. Level four is evolving behavioral features and preference patterns across a week. Level five is stable personality, preferences, and values — updated monthly.

So it mirrors roughly how human memory consolidates over time.

That's the explicit inspiration. Recent memory is kept at high fidelity, older memory is progressively abstracted into patterns and then into a persona profile. And the recall mechanism is complexity-aware — simple queries only search levels one, two, and five. Complex queries traverse the full hierarchy. You're not over-retrieving for simple questions, but you can go deep when you need to.

What are the benchmark numbers?

Seventy-five point three percent on LoCoMo and seventy-six point eight eight percent on LongMemEval-S, which are the main long-horizon conversation benchmarks. Fifty-two percent reduction in recalled context length on LoCoMo compared to baselines. It outperforms MemOS, Mem0, MemoryOS, A-MEM, and MemoryBank. And it requires no fine-tuning — it's a plug-and-play layer on any LLM.

The no fine-tuning part is significant. Because if you need to retrain the model to use the memory system, your deployment story gets very complicated.

There was also a concurrent paper from Ningning Zhang et al. called HiMem, published the same week in January twenty twenty-six. They use a dual-channel segmentation strategy — one channel tracks concrete interaction events, the other tracks stable knowledge. The two are linked hierarchically. It's a slightly different architecture but converging on the same insight: different types of information should be compressed on different schedules.

Before we get to the UX side of this, I want to flag MemGPT, because it takes a completely different approach philosophically.

MemGPT from UC Berkeley in twenty twenty-three is the one that framed this as an operating systems problem. The LLM is like a CPU with limited RAM. External storage is the disk. Information gets paged in from archival memory when needed and paged out when it's not. And critically — the model itself manages what to keep in context via function calls. It's not a separate system deciding what the model sees. The model is an active participant in its own memory management.

Which raises a question about trust. If the model decides what to remember, and the model has primacy and recency biases, then you're letting a biased system manage its own biases.

That's a real concern. Though the MemGPT framing is that you give the model explicit memory management tools — read, write, search — and it uses them deliberately rather than relying on implicit attention. The biases are in the attention mechanism; explicit function calls are a different pathway.

Let's get into the UX question, because I think this is where the discussion gets genuinely interesting from a product perspective. The core tension is: should compression be visible to users, or should it happen silently in the background?

The current landscape is pretty fragmented on this. Web UIs — ChatGPT, Claude dot ai — offer no visibility whatsoever. No context percentage, no indicator that anything is being compressed or lost. Teresa Torres made a sharp observation about this: these tools make it seem like the context window is infinite because you can just keep chatting forever. Most users have no idea that the model's effective memory of the conversation is degrading as the conversation grows.

And that seems actively harmful. Not in a dramatic way — no one's getting hurt — but it creates systematically miscalibrated expectations.

Which is the argument for transparency. Claude Code takes the opposite approach: the context utilization percentage is always visible in the status bar. You can run slash compact with custom preservation instructions, you can run slash context to inspect what's in the window, slash clear to wipe it. The user is in the loop at every stage.

The MindStudio guide made a recommendation I found compelling — compact at sixty percent utilization, not ninety-five.

This is the "summarizing a degraded view" problem. If you wait until eighty or ninety-five percent full to compact, the model is already working with partial, compressed information. The summary it generates reflects that degradation. You're not summarizing the original conversation — you're summarizing the model's already-impaired view of the conversation. Whereas at sixty percent, the model still has full uncompressed access to everything, so the summary it generates is based on complete information.

It's the difference between making a copy before the original deteriorates versus making a copy of a deteriorated original.

And the Anthropic SDK data shows the cost savings are real regardless of when you compact. They ran five customer service tickets through an agentic workflow — without compaction it consumed two hundred and eight thousand eight hundred and thirty-eight tokens. With automatic compaction enabled, it consumed eighty-six thousand four hundred and forty-six tokens. That's a fifty-eight point six percent reduction. Two compaction events triggered automatically during processing.

Fifty-eight percent is a meaningful number when you're running thousands of agent instances.

It compounds fast. The compaction mechanism works by monitoring token usage per turn, injecting a summary prompt as a user turn when the threshold is exceeded, having the model generate a summary wrapped in summary tags, clearing the conversation history, and resuming with the compressed context. It's elegant, but there's an important limitation: it doesn't work optimally with server-side sampling loops, because cache tokens accumulate and can trigger compaction prematurely.

So it's developer-facing transparency, not end-user-facing transparency. The SDK gives developers control, but that control doesn't propagate to the person actually using the application.

That's the gap. And I think it's the most interesting unsolved problem in this space. You have Claude Code, which gives sophisticated users full visibility and control. You have the Anthropic SDK, which gives developers programmatic control. You have OpenHands' condenser architecture, which is automatic but invisible to end users. What doesn't exist yet — at least not in a mature form — is a consumer-facing interface that shows users their context status, tells them what's been compressed, and lets them specify what to preserve.

And the counterargument to transparency is that most users don't want to think about this. They want to talk to an AI assistant, not manage a database.

Which is a reasonable product instinct. But the MindStudio guide identifies a practice that makes transparency feel necessary: they recommend sending a verification message after every compaction — "summarize where we are and what we're working on next" — to confirm the post-compaction state matches your expectations. If something important got dropped, you add it back immediately.

That's a workflow that only works if you know compaction happened.

Which is the crux of it. If compaction is invisible, you can't verify. If you can't verify, you're operating on the assumption that the AI remembers everything it needs to remember — which is false, and increasingly false as the session grows.

There's something worth naming here about the trust calibration problem. Because there are two failure modes. One is the user knowing the system has limits and being anxious about it. The other is the user not knowing and developing false confidence.

And false confidence is the more dangerous failure mode. If I know the AI's memory degrades, I can compensate — I can restart sessions, I can compact early, I can explicitly re-state critical constraints. If I don't know, I just assume the AI is tracking everything, and I'm surprised when it contradicts itself or ignores an instruction I gave an hour ago.

The "who decides what to forget" question is interesting too. Because the options are pretty different in their implications. You've got the model deciding — MemGPT style. You've got a separate summarizer model deciding. You've got the user deciding via explicit commands. You've got a rule-based system deciding via sliding window rules. You've got a hierarchical scheduler deciding via temporal boundaries like TiMem.

And each of those has different trust properties. A rule-based system is predictable — you know exactly what gets dropped. A separate summarizer model is opaque but potentially more semantically intelligent. The user deciding is the most trustworthy from a user perspective but requires the most cognitive load. The model deciding is interesting because it's the most context-aware, but it's also the most circular — you're asking the model to make decisions about its own limitations.

The Paulsen paper from twenty twenty-five is worth flagging here because it broadened the scope of the problem. The original lost-in-the-middle research was mostly about needle-in-a-haystack tasks — find this specific fact. Paulsen showed that context degradation affects a much wider range of task types, and often with far fewer tokens for complex tasks. So you can't just say "well, my use case doesn't involve retrieval, so I'm fine."

The complexity of the task matters a lot. A simple factual lookup might degrade at a hundred thousand tokens. A complex multi-step reasoning task might start showing degradation at twenty thousand tokens. The threshold isn't fixed — it scales with task demand.

Which means the engineering answer can't be a single fixed compaction threshold either. You'd want something that's sensitive to what the agent is actually trying to do.

TiMem's complexity-aware recall is a step in that direction, but it's on the retrieval side. On the compaction side, I'm not aware of anyone who's built a task-complexity-aware compaction trigger yet. That feels like an open research and engineering problem.

What's your overall read on where this is heading? Because we have academic systems like TiMem with impressive benchmarks, and we have production systems like OpenHands and Claude Code with pragmatic solutions, and they feel like they're on different timescales.

I think the gap closes faster than people expect, because the economic pressure is real. A fifty-eight percent token reduction isn't just a cost saving — it's a capability expansion. If you can run twice as many agent steps for the same budget, you can tackle harder tasks. The teams building production agents have strong incentives to adopt whatever works.

And the JetBrains finding that simple observation masking beats sophisticated summarization in most settings is actually good news for adoption, because it's cheap and easy to implement.

The hybrid approach is probably where most production systems end up — masking for the common case, summarization as a fallback for long trajectories, hierarchical memory for applications where long-term user modeling matters. No single strategy dominates across all use cases.

The UX transparency question feels less settled to me. Because the current situation — where sophisticated users in Claude Code get full visibility and everyone else using a web UI gets nothing — is a bit of an accidental divide.

It is. And I think the right answer is something like what Claude Code does, but designed for a less technical audience. You don't need to show users token counts. But you could show a simple indicator — "your conversation memory is getting full, would you like me to summarize what we've covered?" That's transparent without being technical.

And it shifts the decision to the user at the right moment rather than either surprising them with degraded behavior or burdening them with constant memory management.

The verification practice is the key thing. If you compact and then ask "what do you remember about our goals for this session?" — that thirty-second check catches most of the cases where something important got dropped. That workflow should be built into the interface, not left as a manual step for power users.

Alright, practical takeaways. What should someone actually do with this?

If you're using Claude Code for long sessions, compact at sixty percent, not ninety-five. Put your most critical constraints in a preservation instruction when you run slash compact. And treat the post-compaction verification as non-optional. If you're building agentic systems, start with observation masking before reaching for LLM summarization — the JetBrains data suggests it'll perform comparably and cost significantly less. Only add summarization for trajectories that genuinely overflow masking's capacity.

For anyone using web UIs who doesn't have access to compaction tools — the Teresa Torres recommendation is worth following: start a fresh chat when you switch topics, when the model does something noticeably wrong, or when the conversation exceeds roughly fifteen messages. It's blunt, but it's the only lever you have.

And if you're evaluating memory architectures for longer-horizon applications — personal assistants, long-running research agents, anything where the same user interacts across many sessions — TiMem's approach of progressive compression with temporal structure is the most principled framework we have right now. Seventy-five percent accuracy on LoCoMo with a fifty-two percent reduction in recalled context length is a strong result.

The deeper question that this whole episode circles around — whether AI systems should be honest with users about their memory limitations — feels like it has an obvious answer, but the industry hasn't acted on it yet.

Calibrated trust is the goal. Users who know the system has limits can work with those limits. Users who don't know are just waiting to be surprised. The technology to make this transparent exists — Claude Code proves it. The question is whether consumer product teams prioritize it.

That's probably where we leave it. Thanks as always to our producer Hilbert Flumingtop for keeping this show running. Big thanks to Modal for the GPU credits that power the generation pipeline — genuinely couldn't do this without them. This has been My Weird Prompts. If you want to find us, head to myweirdprompts dot com for the RSS feed and all the ways to subscribe. We'll see you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2164: Getting the Most From Large Context Windows

Why Bigger Context Windows Don't Solve the Lost-in-the-Middle Problem

The Core Problem: Attention Doesn't Scale Gracefully

The U-Shaped Curve and the Phase Transition

Sliding Windows vs. LLM Summarization: The Empirical Surprise

The Hybrid Approach: Best of Both Worlds

The Frontier: Hierarchical Memory Trees

The Open Questions

Downloads

You Might Also Like

#2164: Getting the Most From Large Context Windows