So here's the question that kicked this one off for me: in a multi-agent wargame, how does an AI general remember a decision they made forty-seven turns ago, without having their entire conversation history dumped into context every single turn? That's the memory architecture problem. And if you get it wrong, you don't have a serious simulation. You have a very expensive improv exercise.
Herman Poppleberry here, and yeah, that framing is exactly the right entry point. Because the naive answer is, well, just give the agent a long context window. Modern models are hitting two hundred thousand tokens. Surely you can fit everything in there. But that's the trap. The memory problem is not fundamentally a size problem.
Before we get into why that's wrong, let me set up the full picture. We've touched on world state before in the wargaming series, so I'll be quick. But for this episode we're really covering all three layers of what a serious simulation needs. Layer one is world state: referee-authored, shared, the objective snapshot of what is happening right now. Unit positions, environmental conditions, the map. Every actor reads a filtered version of it each turn. Layer two is specific context injection, which is the per-actor private layer. Persona, doctrine, the things this particular agent knows that others don't. The fog-of-war material. And layer three, which is where we're spending most of our time today, is persistent memory. The long-running store of what each actor has done, said, believed, and been told across all previous turns.
And the critical thing about layer three that makes it technically interesting is that it is not directly visible to the agents. The actor doesn't just read their memory like reading a file. They interact with it through a mediated retrieval or summarization mechanism. And that mediation is where all the interesting engineering decisions live.
Also, by the way, today's script is powered by Claude Sonnet four point six. Just worth noting before we get deep into the weeds.
Right. So let's establish why the three-layer model exists at all. Because you could imagine a simpler world where each agent just gets a big context block with everything relevant. World state, their persona, their history. Dump it all in, run the turn, done.
Which sounds clean until you think about what happens at turn three hundred with twenty actors.
The cost and latency alone would make it unworkable. But there's a more fundamental issue, which is blinding. In a serious wargame, Agent A must not have access to Agent B's reasoning history. Not just their current position, but their internal monologue, their prior commitments, their emotional state across turns. If you're using a single shared context architecture, you've already broken the simulation before it starts. The whole point of fog-of-war is that actors operate under genuine uncertainty about what others know and intend.
So the three-layer separation isn't just an optimization. It's a correctness requirement.
That's the right way to think about it. Layer one being shared but filtered is actually doing a lot of work. The referee maintains the master world state, and each actor gets a view of it that reflects what they should plausibly be able to observe given their position and capabilities. Layer two then shapes how they interpret that view. An agent with an aggressive commander persona and Air-Land Battle doctrine is going to read the same battlefield snapshot very differently than an agent running a defensive posture with a different doctrine model.
And then layer three is where the history lives. So when that aggressive commander is deciding whether to push through a gap in the line, they're not just responding to the current snapshot. They're drawing on a history of prior decisions, commitments they made, intelligence they received. Or at least, they should be.
Which brings us to the blinding discipline, because this is where a lot of simulation implementations get it wrong. The referee, what you might call the simulation master, has God Mode access. They can read every actor's layer three store in full. That's essential for detecting metagaming, which is when an agent acts on information it shouldn't have. If your LLM is drawing on training data patterns to anticipate an opponent's move rather than in-game intelligence, the referee can catch that by auditing the memory layer.
That's a subtle failure mode. The model essentially has prior knowledge about how, say, a particular military doctrine tends to play out, and it bleeds through into the simulation.
And it's hard to detect without that referee access because the agent's outputs might look completely reasonable. The behavior is consistent with the persona. But the reasoning is contaminated. You need to be able to diff what the agent claims to know against what's actually in their memory store.
There's almost a parallel to how human analysts talk about mirror imaging — when an intelligence analyst unconsciously projects their own assumptions onto an adversary's decision-making. The model is doing something structurally similar. It's filling gaps in its in-game knowledge with patterns from training, and the outputs look plausible enough that you'd never catch it without the referee layer.
That's exactly the right analogy. And it's one reason why the God Mode access is not just a technical safeguard. It's an epistemological safeguard. You're protecting the integrity of what the simulation can actually tell you.
So the referee sees everything, actors see only their own slice, and even that slice is mediated. Let's get into the three implementation patterns for layer three, because I think this is where the real engineering tradeoffs live.
So the three main patterns are vector stores per persona, summarization chains, and full-history replay. And they represent fundamentally different tradeoffs between fidelity, cost, and what I'd call psychological coherence.
Start with full-history replay, because I want to understand why it fails before we talk about what works.
Full-history replay is the naive approach. You maintain a log of everything the actor has done and said, and at each turn you replay the last N turns of that log into the context window. The appeal is obvious: perfect recall within that window, no compression artifacts, the agent has exactly what happened.
But N has a hard ceiling.
Even at two hundred thousand tokens, if you have a complex simulation running three hundred turns with rich internal monologue and multi-party messaging at each turn, you're going to hit that ceiling somewhere around turn fifty to seventy, depending on verbosity. And then you're back to a sliding window, which means the agent genuinely cannot access anything before the cutoff. The general has no memory of the campaign's opening moves.
And the cost scaling is brutal. You're paying for two hundred thousand tokens of context on every single turn for every single actor.
For a twenty-actor simulation running three hundred turns, you're looking at a context cost that grows linearly with turn count and linearly with actor count. It becomes economically unworkable fast. So full-history replay is essentially a prototyping approach. It's useful for testing that your simulation logic is correct in the first ten turns. It's not a production architecture.
But how does that sliding window failure actually manifest in practice? Like, what does it look like when an agent loses access to its early history?
It tends to show up as what you might call strategic drift. The agent's behavior becomes increasingly reactive and short-term because all it has is recent context. An actor who established a clear strategic doctrine in the first twenty turns — say, a commitment to avoiding civilian infrastructure — starts making decisions that erode that doctrine incrementally, because the original commitment is no longer in the window. Each individual decision looks locally defensible. But the cumulative drift is significant. And if you're not running the referee audit, you might not catch it until you're doing post-hoc analysis and you notice the agent's behavior in turn two hundred looks nothing like their behavior in turn ten.
So the failure is gradual enough that it's easy to miss in real time.
Which is exactly what makes it dangerous for a production simulation. Okay, so vector stores. This is the retrieval-augmented approach.
Each actor gets their own dedicated vector database. Pinecone, Milvus, Redis with vector indexing, whatever fits your infrastructure. At each turn, before the agent generates its response, the system runs a semantic retrieval against that actor's store. The query is typically constructed from the current world state plus the current turn's injected context. What comes back is the set of prior experiences most semantically relevant to what's happening right now.
So the agent doesn't get everything. They get what's relevant.
And relevance is determined by embedding similarity, not chronology. This is a critical design choice. A general facing an artillery bombardment retrieves memories of past artillery encounters, regardless of whether those encounters were in turn three or turn two hundred and forty. The temporal structure is replaced by associative structure.
Which is actually closer to how human memory works.
It is. And the retrieval latency is fast enough to be practical. For a store of ten thousand or more documents, modern embedding-based retrieval is typically under a hundred milliseconds. That's not adding meaningful overhead to a simulation turn.
But there's a blinding implementation detail here that I want to make sure we nail down. How do you guarantee Actor A cannot query Actor B's vector store?
Strict access control at the infrastructure level. Each persona's vector store is a separate logical database with separate credentials. The retrieval service is actor-scoped. When the turn orchestrator runs Actor A's turn, it only has credentials to query Actor A's store. The retrieval call cannot be constructed to cross-query another actor's store. This has to be enforced at the system architecture level, not at the prompt level. If you're relying on instructions in the prompt to prevent cross-contamination, you've already lost.
Because the model will eventually find a way to surface something it shouldn't.
Or more precisely, if you share an embedding space across actors, semantic similarity queries can surface memories from the wrong actor. You need the separation to be structural.
And it's worth being concrete about what that looks like when it goes wrong. Imagine you have a shared embedding space and Actor A queries for memories about artillery positioning in a river valley. If Actor B had a detailed internal monologue about their artillery doctrine in similar terrain, that document could surface in Actor A's retrieval results because the embedding similarity is high. The access control logic in the prompt says "only use your own memories," but the retrieval has already returned the contaminated result. The model is now reasoning from information it shouldn't have, and there's no clean way to detect that from the output alone.
So the failure is invisible at the output layer, which is exactly the worst kind of failure for a simulation that's supposed to produce interpretable results.
Right. The outputs look fine. The reasoning looks fine. The blinding is just silently broken. Okay. Summarization chains. This is the third pattern and I suspect it's the most psychologically interesting one.
This is where it gets really fascinating. Instead of storing raw turn logs and retrieving them semantically, you maintain a running narrative summary of the actor's history. After each turn, a separate summarizer LLM takes the existing summary plus the new turn's events and produces an updated compressed narrative. The agent then gets this summary as part of their context each turn.
So the agent is essentially reading a biography of themselves.
A biography written by an automated editor. And that editor's choices become the agent's psychology. This is what I'd call the summarizer-as-subconscious concept. If your summarizer prompt is optimized for tactical efficiency, it's going to preserve unit movements, engagements, territorial changes. It's going to compress or drop diplomatic exchanges, emotional commitments, minor slights.
And then three turns later when the agent needs to remember that they promised humanitarian corridor access to the Red Cross representative, that commitment is just... gone.
There's a concrete failure mode from a simulation run where a military AI was maintaining a negotiation track alongside kinetic operations. The summarizer was tuned to prioritize tactical state. Over about thirty turns, the humanitarian commitments got compressed into a single clause in the summary. By turn eighty, the agent was authorizing actions that directly violated commitments it had made at turn fifteen. Not because the model was being inconsistent. Because from the model's perspective, those commitments had never been salient. They weren't in the retrieved context.
Which is a problem for the validity of the simulation because real military commanders would be managing that commitment actively. And there's a real-world analog here that I think is worth naming. In actual military operations, there are dedicated legal and policy advisors embedded in command structures precisely to track these kinds of commitments across time. The law of armed conflict obligations, the rules of engagement constraints, the diplomatic side agreements. Those don't get to fall out of working memory just because the tactical situation is demanding. The institutional structure exists to enforce that continuity. In a simulation, the summarizer is supposed to be playing that role, and if it's not designed to, nobody is.
So the summarizer design is actually modeling something real about how organizations maintain institutional memory under operational stress.
Which means getting it wrong isn't just a technical failure. It's a representational failure. You're simulating a command structure that has no institutional memory, and then drawing conclusions from its behavior.
And this is where the summarizer design becomes a first-order engineering decision, not an afterthought. The summarizer prompt needs to explicitly preserve categories of information that are easy to compress away. Diplomatic commitments. Emotional state signals. Stated intentions about future behavior. Relationships with other actors. The things that feel like soft data but are actually load-bearing for long-run consistency.
How do you even test whether your summarizer is doing this correctly?
One approach is what you might call a consistency probe. You run the simulation, then at various points you take a sample of the agent's raw turn logs from early in the simulation and compare them against what the summarized history says about those events. You're looking for systematic omissions. If the summarizer consistently drops a particular category of information, you'll see it as a pattern of gaps between raw log and summary.
And if you find gaps, you're essentially debugging the agent's psychology.
That's not an overstatement. Because the summary is the only history the agent has. If the summary says the agent has always been cautious and measured, the agent will behave cautiously and measuredly even if the raw logs show a more volatile early history. The summarizer has retroactively constructed a coherent narrative.
That's actually a bit unsettling when you think about it in terms of simulation validity. You're not just building a memory system. You're building a character.
And this is the deepest insight in the whole architecture. The persistent memory layer is not a neutral recording. Every design choice you make in how memories are compressed, retrieved, and weighted is a choice about who this agent is. The persona prompt in layer two sets the initial character. The memory architecture in layer three determines how that character develops over the course of the simulation.
Let's talk about cross-turn consistency mechanisms, because I think this is where the generative agents work from Stanford becomes really relevant.
The Park et al. generative agents paper from 2023 is still the foundational reference here. The architecture they introduced has three components: memory streams, reflection, and planning. The memory stream is the raw log of observations and actions. Reflection is a periodic process where the agent looks at its own memory stream and generates higher-level insights. Things like, I have noticed that the Blue Force consistently withdraws when I concentrate artillery. Those insights are stored back into the memory stream with a higher importance weight.
So the agent is explicitly building a model of patterns in its own experience.
And those high-importance insights get retrieved preferentially when the agent is constructing its context for a new turn. This is what prevents the kind of contradiction where an agent does something that directly conflicts with a strategic insight it developed ten turns ago. The insight is in the retrieved context. The agent has to reason in light of it.
How does this compare to MemGPT's approach?
MemGPT, which came out of UC Berkeley in late 2024, takes a more OS-like framing. The idea is that the LLM's context window is like RAM, and the external memory store is like disk. The system explicitly manages what gets paged in and out of the context window based on relevance signals. The agent has explicit control functions it can call to retrieve from external memory or to write new information to external memory.
So the agent is a more active participant in its own memory management.
Which has interesting implications for a wargaming context. You could design actors that are explicitly strategic about what they remember. An intelligence-focused persona might be more aggressive about writing observations to external memory. A tactical-focused persona might prioritize short-term operational state and let strategic context get paged out.
That starts to feel like a meaningful behavioral difference between actor types, not just a technical implementation detail.
It absolutely does. And this is where LangGraph's checkpointing system becomes relevant for the engineering side. The January 2026 update to LangGraph added explicit per-agent state isolation in checkpointers. What that means in practice is that each agent's reasoning graph state is saved independently at each turn. If an agent starts behaving inconsistently, the referee can inspect the checkpoint sequence and trace exactly where the behavior diverged from the established pattern.
So you can actually do a post-mortem on an agent's decision-making.
You can roll back to a prior checkpoint and re-run the turn with different retrieval parameters to understand whether the inconsistency was driven by a bad retrieval result, a summarization artifact, or something in the world state injection. That debuggability is enormously valuable for simulation integrity.
Let me push on the failure modes here, because I think the amnesiac dictator problem is worth naming explicitly.
This is the case where summarization removes emotional state entirely. You have an actor playing a head of state who has been humiliated in a diplomatic exchange. The raw log captures the exchange in detail. The model's response at that turn shows clear emotional coloring, the actor is angry, they feel their credibility has been damaged. But the summarizer, tuned for strategic efficiency, compresses that exchange into, turn forty-two, diplomatic contact with Actor C, no agreement reached.
And then twenty turns later when Actor C makes a conciliatory overture, the agent responds as if the prior humiliation never happened.
Because it didn't happen in the only history the agent has access to. The emotional state was real in the raw log, but the summarizer decided it wasn't strategically relevant, and so it was compressed away. The agent that emerges from turn sixty is psychologically discontinuous with the agent from turn forty-two.
Which is a problem for the validity of the simulation because real strategic actors carry emotional state across interactions. Grudges are real. Credibility concerns are persistent. And there's a fun historical footnote here that actually illustrates the point. The Versailles negotiations in 1919 are a case study in how accumulated emotional state shapes strategic decision-making in ways that pure rational-actor models completely miss. The French position was heavily shaped by the psychological weight of 1871, the Franco-Prussian War, the occupation of Paris. That wasn't just a data point about prior conflict. It was an active emotional driver that shaped every territorial and reparations demand. If you were running a simulation of those negotiations and your summarizer was dropping emotional state, you'd end up with agents that converge on a reasonable settlement in about fifteen turns. The actual negotiations took six months and produced a treaty that most historians think made the next war more likely.
So the emotional memory isn't flavor. It's load-bearing for the outcome.
And this is actually something the Frazer-Nash Red Force Response work touches on. When you're using AI agents to find novel courses of action in military wargaming, the value comes from the agent developing a coherent strategic identity over the course of the simulation. If that identity is being randomly reset by a poorly designed summarizer, you're not getting novel strategic reasoning. You're getting turn-by-turn improvisation that happens to be expressed in the voice of a military commander.
The second-order effects here are significant. Because the simulation conclusions depend on the agents behaving consistently enough that their behavior is interpretable.
This is the validity question that I think doesn't get enough attention in the wargaming literature. When you run a simulation and you observe that Actor A escalated in response to Actor B's move at turn one hundred and fifty, you want to be able to say something meaningful about why that escalation happened. Was it driven by Actor A's accumulated frustration? Their strategic doctrine? Their assessment of the military balance? If the memory architecture is introducing systematic distortions, your causal interpretation of the simulation is compromised.
You might think you're learning something about escalation dynamics when you're actually learning something about your summarizer's compression bias.
And this is why the referee's God Mode access is not just a blinding enforcement mechanism. It's an analytical tool. The referee needs to be able to see, for any given decision an agent makes, what was in their retrieved context at that turn. What did they actually remember? What was the summarized history they were operating from? Without that visibility, you can't distinguish between a meaningful simulation result and an artifact of your memory architecture.
Let me ask about memory as a weapon, because this angle came up in the research and I think it's genuinely underexplored.
This is one of the more interesting strategic dimensions of the architecture. If Agent A sends a deceptive message to Agent B, that message is now a permanent part of Agent B's layer three store. Even if Agent B later discovers the deception, the summarized version of their history may still be colored by the period during which they believed the deception. The initial lie has left a trace in the agent's psychological narrative.
So disinformation has a kind of temporal persistence in this architecture that it might not have if you were just running the simulation turn by turn without memory.
And a sophisticated simulation designer might want to model that explicitly. How does an actor's decision-making change when they have a history of having been deceived by a particular counterpart? Does the summarization process capture the updated understanding, or does the original deception continue to shape the narrative? These are not just interesting research questions. They're design decisions you have to make when you're building the summarizer.
And there's an interesting asymmetry there too. The discovery of the deception is a single event at a specific turn. But the period of believing the deception might span thirty or forty turns, during which the agent made a whole series of decisions predicated on the false information. The summarizer has to somehow represent both the period of false belief and the corrective moment, and the relative weight it assigns to each shapes how the agent behaves going forward.
Right. Does the agent become permanently suspicious of that counterpart? Or does the summarizer treat the discovery as a clean reset? Those are completely different psychological states, and they'd produce completely different strategic behaviors in subsequent turns. The summarizer is making that call, whether you've explicitly designed it to or not.
Okay, I want to get into practical takeaways, because I think there are some concrete implementation steps that are actionable for people building these systems. What's your first one?
The most important one, and I cannot stress this enough, is to design your summarizer prompts with explicit preservation categories. Do not let the summarizer optimize purely for tactical or strategic salience. Build in explicit instructions to preserve diplomatic commitments, stated intentions about future behavior, relationship state with specific actors, and emotional valence of significant interactions. These are the categories that are easiest to compress away and most load-bearing for long-run consistency.
So you're essentially writing a schema for what the summarizer must include, not just what it should prioritize.
Treat it as required fields in a structured output. The summary must contain a relationship state section. It must contain a commitments section. It must contain an emotional state section. If you leave it to the summarizer's judgment about what's important, it will default to the easiest thing to extract, which is tactical state.
Second takeaway.
Separate vector stores per persona, with strict infrastructure-level access controls. This is not optional. Do not use a single shared embedding space and try to filter by metadata. Use separate logical databases with separate credentials. The blinding discipline has to be structural, not instructional. If your access control is implemented in a prompt, it will fail.
And the third one.
Implement a memory audit layer for the referee. This means building a tool that lets the simulation master inspect, for any given turn and any given actor, three things: what was in the raw memory store, what was retrieved for that turn, and what the current summarized history says. You need to be able to diff these three views to catch summarization artifacts and retrieval failures. Without this, you're flying blind on simulation validity.
That's essentially an observability layer for agent psychology.
Which is a phrase I'd encourage people to actually use when they're designing these systems. You need observability into agent psychology the same way you need observability into a distributed system's network behavior. The failure modes are subtle, they compound over time, and they're very hard to detect from outputs alone.
What's the practical starting point for someone who's building one of these systems from scratch?
Start with retrieval-augmented generation before you add summarization chains. Get the vector store per persona architecture working first. Have actors retrieve their top five or ten most relevant memories each turn. Verify that retrieval is surfacing the right things by running consistency probes. Only once you're confident the retrieval architecture is working should you layer in summarization, because summarization introduces compression artifacts that can be hard to distinguish from retrieval failures if you're debugging both at the same time.
Separate the complexity. Validate each layer independently.
And there's a specific test I'd recommend before you commit to your summarizer design. Have actors intentionally contradict themselves. Write a turn where an actor makes a strong commitment, then several turns later try to have them violate that commitment. See if the retrieval system catches it and surfaces the prior commitment in context. If it doesn't, your retrieval is not working correctly for consistency enforcement.
That's a nice adversarial test. You're probing for the failure mode directly rather than waiting for it to emerge naturally.
And it will emerge naturally in a long-running simulation. The question is whether you find out during testing or after you've run two hundred turns and you're trying to interpret results.
Let me zoom out for a second, because I want to make sure we've addressed the misconception that longer context windows solve this problem. Because I know that's the intuition a lot of people bring to this.
The context window argument goes like this: if models eventually have a million token context window, you can just stuff everything in and the memory architecture problem goes away. There are three reasons this is wrong. First, cost. Even if the technical limit is a million tokens, the cost of processing a million token context on every turn for every actor in a twenty-actor simulation is economically prohibitive for any serious deployment. Second, latency. Longer contexts mean longer generation times, which compounds across hundreds of turns. Third, and most importantly, the inclusion-exclusion problem doesn't go away. You still have to decide what goes into that context. A million token window doesn't help if the relevant memory from turn fifteen is buried in noise from turns sixteen through two hundred. The retrieval and prioritization problem is still there.
And there's a fourth one, which is that the blinding discipline is a logical requirement, not a size requirement. Even if you had infinite context, you couldn't put Actor A's history in Actor B's context without breaking the simulation.
That's the cleanest way to put it. Blinding is a correctness constraint, not a capacity constraint. The three-layer architecture exists because it's the right logical structure for the problem, not because current models are too small.
So where does this go? Because context windows are going to keep growing. Models are going to get cheaper. Does the three-layer model become less important over time, or more?
My read is that it becomes more important, not less. Here's why. As context windows grow, the temptation to take shortcuts increases. It becomes technically possible to do things that are architecturally wrong. You can fit an actor's entire history into context, so you do, and you've broken blinding without realizing it. The discipline of the three-layer model becomes more necessary precisely because the technical constraints that previously enforced it start to relax.
The guardrails come off and you have to supply your own.
And the second reason is that as simulations become more sophisticated, the memory architecture becomes the primary differentiator between platforms that produce valid results and platforms that produce impressive-looking outputs that don't actually tell you anything meaningful. The frontier of the problem shifts from can we run a multi-agent simulation at all to can we trust the results. And trust depends entirely on memory architecture integrity.
There's a fundamental tension in all of this that I want to name before we close. You're building systems that need to forget strategically while maintaining enough to be consistent. Those two requirements are in tension, and there's no clean resolution.
This is genuinely unsolved. The summarization approach gives you strategic forgetting but at the cost of compression artifacts. The vector retrieval approach gives you associative recall but at the cost of chronological coherence. Full-history replay gives you perfect recall but only within a window. Every production system is making a tradeoff between these failure modes, and the right tradeoff depends on what you're trying to learn from the simulation.
Which means the memory architecture has to be designed in light of the research questions you're trying to answer, not just in light of what's technically convenient.
And that's probably the meta-takeaway for anyone building these systems. The memory architecture is not infrastructure. It's epistemology. It shapes what the agents can know, what they can remember, who they are across time. Those are not engineering footnotes. They're the core of what makes the simulation meaningful.
On that note, I think we've covered a lot of ground. The three-layer model, the blinding discipline, the three implementation patterns for persistent memory, the summarizer-as-psychology concept, cross-turn consistency mechanisms, the failure modes, and the practical steps for building systems that actually work. This is genuinely one of the more technically rich areas we've gotten into in the wargaming series.
And the thing that keeps pulling me back to this topic is that these aren't just academic architecture questions. The validity of conclusions drawn from serious wargaming depends on getting this right. If your memory architecture is introducing systematic distortions, you might be making real decisions based on simulation artifacts rather than simulation insights.
The stakes are real. Alright. Big thanks to our producer Hilbert Flumingtop for putting this one together. And thank you to Modal for the GPU credits that keep this whole operation running. This has been My Weird Prompts. If you're not already following us on Spotify, that's the easiest way to make sure you don't miss an episode. We'll see you next time.
See you then.