So here's what Daniel sent us this one, and I think it's a genuinely important setup episode. He's asking: what are the standards in conventional, human wargaming for decision analysis? Before we get into action spaces and escalation ladders in AI simulations, we need to establish what the professional wargaming community actually expects from a simulation used for real decision support. He wants us to cover the history, from RAND to the Naval War College to the modern think-tank ecosystem. The methodologies, from matrix games to what he calls BOGSAT. What outputs a serious wargame is supposed to produce, and crucially, what it is not supposed to produce. Validation, repeatability, the red-blue-white cell structure, after-action reviews. And then why this all matters for AI wargaming, given that most LLM simulation projects seem to skip straight to plugging in personas without engaging with seventy-plus years of methodology. Ground it in real institutions, real standards documents, Perla, Caffrey, McHugh, MORS. There's a lot here.
There really is. And I think the framing is exactly right, which is that you cannot evaluate whether an AI wargame is any good until you know what a good wargame looks like in the first place. That baseline is what most of the coverage skips.
I'm Herman Poppleberry, by the way. No wait, that's you.
Ha. Yes, I'm Herman Poppleberry, and before we go any further, today's script is powered by Claude Sonnet four point six, which means our friendly AI down the road is writing the words we're currently speaking. There's something appropriately recursive about that given the topic.
A script about the methodology of simulating decisions, written by a simulation. Anyway. Let's start with the question that I think most people don't actually have a crisp answer to: what makes a wargame a serious decision-support tool as opposed to a very expensive role-play session?
The short answer is methodology and intent. But let's unpack that. The professional wargaming community, and there is a professional community with certifications, canonical texts, and institutional standards, defines a wargame as a simulation of a conflict or competitive situation in which human players make decisions that affect outcomes, and those decisions are adjudicated against some model of reality. The key word is adjudicated. There has to be a mechanism for determining whether your plan actually works given the constraints of the real world, logistics, physics, adversary capability, time. Without adjudication, you're just storytelling.
And that adjudication is where a lot of amateur simulations, including some AI ones, fall apart.
That's the whole thread we're going to pull on today. But let's start with history, because the methodology didn't appear out of nowhere. It was built incrementally over decades in response to real failures of strategic thinking.
Walk me through the lineage.
The American tradition has three main pillars. The first is the Naval War College at Newport, Rhode Island, which is probably the oldest serious institutional home for wargaming in the United States. Between the two world wars, the NWC ran hundreds of wargames exploring Pacific scenarios, Japanese naval doctrine, fleet engagements, logistics chains across the Pacific. The scale of that effort was remarkable. Admiral Chester Nimitz, after the Pacific war, made a statement that gets quoted constantly in wargaming literature. He said that nothing that happened during the war was a surprise except for the Kamikazes, because it had all been gamed out in the classrooms at Newport. That is an extraordinary claim, and it holds up reasonably well historically. The NWC wasn't predicting specific battles. They were stress-testing the decision space, mapping what was possible, and training officers to think through problems systematically before lives were on the line.
So the output wasn't a forecast. It was preparation.
Preparation and pattern recognition. Officers who had spent years gaming Pacific scenarios had internalized a kind of strategic vocabulary. When the real situation emerged, they weren't starting from zero. Now the second pillar is RAND, and RAND represents a different evolution. Post-World War Two, nuclear weapons changed the strategic calculus so fundamentally that the old tactical wargaming frameworks were insufficient. You couldn't just run a fleet engagement simulation to think about nuclear deterrence. RAND, which was founded in nineteen forty-six as a think tank with deep ties to the Air Force, pioneered the application of systems analysis and social science to strategic wargaming. They moved the game from the operational level, how do you fight a battle, to the strategic level, how do you prevent a war, or if deterrence fails, how do you manage escalation. Their work in the nineteen fifties and sixties on nuclear strategy involved structured scenario analysis where the goal wasn't to win a simulated war but to understand the decision logic of both sides under extreme pressure.
The SIOP wargames you mentioned. What was that specifically?
The Single Integrated Operational Plan was the U.S. nuclear war plan, and RAND analysts used structured wargames to stress-test the assumptions baked into it. Not to predict how a nuclear exchange would go, which would be a fool's errand, but to surface the assumptions the planners were making. Things like: we're assuming Soviet command and control degrades at this rate, we're assuming our submarines can maintain communication under these conditions, we're assuming decision timelines look like this. The wargame forces you to make those assumptions explicit, and then it tests whether the plan still works when you vary them. That's the core intellectual contribution of that era.
Which is a very different thing from running the simulation and saying, here is what will happen.
Completely different. And that distinction, between insight and prediction, is the central professional norm of the entire field. We'll come back to it. The third pillar is the modern think-tank ecosystem, and this is where the methodology has expanded into policy analysis beyond pure military planning. Institutions like the Center for Strategic and International Studies, CSIS, the Center for a New American Security, CNAS, and the Atlantic Council have all developed wargaming programs that bridge military methodology and policy research. The gold standard recent example is the CSIS Taiwan wargame from twenty twenty-three, which is notable not just for what it found but for how it published its results. They released the full methodology, the scenario assumptions, the data tables, the iteration results. Twenty-four iterations of the same basic Taiwan Strait conflict scenario, with different starting conditions. That level of transparency is unusual in the field, and it set a new benchmark for what reproducible policy wargaming can look like.
Twenty-four iterations of the same scenario is interesting. That's not just running the game once and writing up the narrative. That's treating it like a statistical sample.
Which is exactly the point, and it connects directly to the MORS standards on repeatability. MORS, the Military Operations Research Society, was founded in nineteen fifty-seven specifically to bring scientific rigor to military analysis, including wargaming. Their professional certification program for wargame analysts is one of the few formal credentialing systems in the field, and one of their core principles is that a single play of a scenario is a data point, not a finding. You need multiple iterations with different player teams to start identifying which outcomes are robust and which are artifacts of one particular team's choices.
That's the kind of thing that sounds obvious when you say it out loud but gets skipped constantly.
All the time. And it gets skipped in AI simulations even more than in human-run games, because running a simulation a hundred times is computationally cheap if you're using an LLM, but most projects don't structure the output to actually compare across runs in any meaningful way. They run it once, get a narrative, and treat that as the answer. But let's talk about methodology, because the history only makes sense in context of the tools people developed.
Right, so what does the spectrum actually look like?
At one end you have what the field calls BOGSAT, which stands for Bunch Of Guys Sitting Around a Table. The name is deliberately pejorative in most professional contexts. It describes an unstructured discussion where a group of experts talk through a scenario without any formal adjudication, role assignment, or structured output. Now, there's a common misconception here that I want to address directly: BOGSAT isn't inherently useless. When the participants are genuinely expert and the facilitator is skilled, you can surface valuable insights from a well-run seminar discussion. The problem is scalability and repeatability. There's no mechanism to ensure that all the relevant decision points got examined, no way to compare this session's outputs to a previous one, and no way to know whether the conclusions reflect the structure of the problem or just the particular dynamics of the people in the room that day.
So the issue isn't that it's unstructured, it's that you can't audit it.
And you can't replicate it. Move up the spectrum and you get to seminar wargames, which add structure through facilitation and role assignment. Players are assigned to represent specific actors, a Blue team representing friendly forces or the U.S. government, a Red team representing the adversary, and a White team, which is the control cell. The facilitator runs the game according to a scenario script, and players make decisions within their assigned roles. The key innovation in the more rigorous version of this format is the Matrix Game, which was developed in the nineteen eighties and has become a professional standard. In a Matrix Game, you don't just say what you want to do. You make an argument. You state the action, the expected result, and the reasons why that result should follow. The White Cell adjudicates based on the logical coherence of the argument, not just on dice rolls or pre-defined tables. That means the adjudication is explicit and can be reviewed after the fact.
So if I'm playing Red and I argue that my cyber operation takes down the Blue logistics network, I have to explain why that's plausible, not just declare it.
And the White Cell can push back. They can say, your argument assumes that Blue's network has this vulnerability, but we've established in the scenario that they've patched that class of vulnerability. Your action either fails or produces a degraded result. That exchange becomes part of the record. It's auditable. Compare that to an LLM simulation where the model just decides that a cyber operation succeeds because that's the narratively plausible next step, with no mechanism for checking whether it's operationally realistic.
That's a painful comparison.
It gets more painful the more you look at it. Further up the spectrum you have rigid-rule wargames, which use pre-defined combat results tables and probabilistic resolution mechanisms. The McHugh manuals, formally Frank McHugh's Fundamentals of Wargaming, which was the Naval War College's foundational text on rules and adjudication, codified a lot of this. The idea is that you specify in advance how different types of actions resolve, what factors affect the probability of success, how attrition is calculated. The advantage is repeatability and consistency. The disadvantage is that you're only modeling what you thought to model when you designed the rules. If the scenario produces an action type you didn't anticipate, the rules may not cover it well.
Which is essentially the alignment problem for wargames.
Then at the far end of the spectrum you have computer-assisted wargames, which use software to handle the logistical bookkeeping, tracking fuel consumption, ammunition stocks, unit positions, communication delays, while human players focus on the actual decision-making. The software enforces physical and logistical constraints automatically, which removes a whole class of errors where players implicitly assume unlimited resources or instantaneous movement. The joint theater-level wargames the U.S. military runs, things like Unified Quest and the various combatant command exercises, operate in this space.
So the spectrum is essentially: how much of the adjudication burden is carried by rules versus human expert judgment, and how much of the bookkeeping burden is carried by software versus humans.
That's a clean way to frame it. And the choice of where to sit on that spectrum should be driven by the question you're trying to answer, not by the tools available. That's a point Peter Perla makes repeatedly in The Art of Wargaming, which came out in nineteen ninety and remains the most-cited foundational text in the professional community. Perla's argument is that wargame design is fundamentally about matching the level of abstraction to the level of the question. If you're trying to understand strategic decision logic, you probably don't need to simulate individual aircraft sorties. If you're trying to understand operational logistics, you might need exactly that level of detail.
And most AI wargaming projects, from what I can tell, pick their abstraction level based on what the LLM is comfortable generating, which is narrative text about human decisions, not based on the question they're trying to answer.
Which is the methodological inversion that makes a lot of this work analytically brittle. Let's shift now to outputs, because I think this is where the misconception problem is most acute.
Right, because the thing people expect from a wargame and the thing a wargame can actually deliver are quite different.
The professional community, going back to Perla and reinforced by Caffrey's On Wargaming from twenty twenty-one, is very explicit that wargames are not predictive tools. They do not tell you what will happen. What they do, when designed and run correctly, is produce four categories of output. First, surfaced assumptions. A wargame forces the planning team to make explicit every assumption that was previously implicit in their plan. The classic example is logistics. Military plans often have an implicit assumption that the port will be available, that the fuel supply chain will function, that the communication network will hold. A wargame where the adversary has every incentive to attack those nodes will surface whether your plan actually survives those attacks, and it will force you to confront assumptions you didn't know you were making.
Which is uncomfortable in a useful way.
The discomfort is the point. The second output category is identified decision points. A well-run wargame maps exactly when a commander or a political leader faces a fork in the road, a moment where the choice between two courses of action has dramatically different downstream consequences. That's enormously valuable for decision support because it tells you where to focus analytical attention before the crisis. You want to have thought through the decision at the fork before you're standing at the fork under time pressure.
The third one?
Stress-tested plans. You take your best current plan and you run it against a competent adversary played by people whose job is to find its breaking points. The output isn't a score. It's a description of the conditions under which the plan fails. This plan works unless the adversary uses cyber operations against our logistics network in the first forty-eight hours. This plan works unless the political coalition fractures at the point where casualties exceed a certain threshold. Those conditional findings are actionable. You can go back and redesign the plan to be more robust, or you can accept the risk and plan for how to respond when those conditions arise.
And the fourth?
Action space mapping. One of the things a wargame consistently reveals is that the range of options actually available to a player is narrower than they initially believed. You think you have ten options. After you run the scenario and see which ones are physically possible, politically sustainable, and logistically executable in the relevant timeframe, you might actually have three. That narrowing is analytically valuable. It prevents what the field sometimes calls option inflation, where planners convince themselves they have more flexibility than they actually do.
Now I want to get into validation, because this is the epistemically hard part. You cannot ground-truth a wargame against the future war that didn't happen. So how does the professional community handle the fact that they can't directly verify their outputs?
This is genuinely one of the harder methodological problems in the field, and the community has developed a set of practices that provide process validation rather than outcome validation. The logic is: if you can't verify the output against ground truth, you can at least verify that the process for generating the output was rigorous. The first practice is the peer review of scenario design. Before a serious wargame is run, the scenario assumptions, the starting conditions, the rules of adjudication, are reviewed by subject matter experts who weren't involved in designing the game. This catches obvious errors, unrealistic assumptions, and places where the scenario is inadvertently stacked to produce a particular outcome.
Which is a real risk. You can design a wargame that produces whatever conclusion you wanted before you started.
And it happens, especially in contexts where the sponsor of the wargame has a preferred answer. The red-teaming of scenario design is a check against that. The second practice is the red cell standard. In a professionally run wargame, the Red Cell, the people playing the adversary, must be staffed by genuine experts in adversary doctrine and decision-making, not just by people who lost a coin flip. The Red Cell's job is not to lose gracefully to Blue. Their job is to play the adversary as competently as the adversary would actually play. A Red Cell that pulls its punches produces a wargame that tells Blue they're more capable than they are, which is worse than useless for decision support.
There's a real institutional pressure problem there, right? Because the people commissioning the wargame often don't love having their plans beaten convincingly.
That tension is one of the persistent challenges in institutional wargaming, and it's one of the reasons the White Cell structure exists. The White Cell, the control cell, is independent of both Blue and Red. Their job is to manage information flow, adjudicate outcomes, and ensure the game is producing analytically valid results rather than a preferred narrative. They have the authority to intervene if Blue is making unrealistic assumptions or if Red is playing implausibly. That independence is structurally important. If the White Cell is staffed by people who report to the Blue Cell's commanding officer, the independence breaks down.
And information control. Walk me through how that works, because this connects directly to the fog-of-war problem.
The White Cell controls what each team knows at any given point in the game. Blue doesn't know Red's full order of battle. Red doesn't know Blue's contingency plans. What each team knows is determined by what their intelligence and surveillance capabilities would realistically reveal under the scenario conditions. The White Cell issues situation reports, what the field calls sitreps, that reflect what each side would plausibly know, including noise, uncertainty, and deliberate deception. This is critically important because one of the most common failure modes in poorly run wargames is that both teams implicitly know too much about each other, which produces unrealistically clean decision-making. Real strategic decisions are made under deep uncertainty, and a wargame that doesn't model that uncertainty is modeling a fantasy.
And this is exactly where most LLM persona simulations fall apart. The model generating responses for the Chinese general and the model generating responses for the American president are the same model with access to the same training data.
There's no genuine information asymmetry. The Chinese general persona implicitly knows what the American president persona is planning because they're both outputs of the same underlying model. Maintaining genuine fog of war in an LLM simulation requires architectural choices, separate context windows, strict information partitioning, a genuine White Cell function that mediates between the player instances. Most projects don't build any of that. Now, the after-action review is the final piece of the validation structure, and it's arguably the most important.
Why most important?
Because the AAR is where the analytical value is actually extracted from the game. Running the scenario is just data collection. The AAR is the analysis. The McHugh manual's approach to structured AARs, and this has been incorporated into the broader U.S. military exercise framework under FM seven-zero, is built around a specific question: not what happened, but why did you decide what you decided? The AAR doesn't replay the sequence of events. It excavates the decision logic. Why did the Blue commander choose to hold at this line instead of advancing? What information were they working from? What assumptions were they making? What did they think Red was going to do? That excavation reveals the assumptions and mental models that drove the decisions, which is the actual analytical output. And critically, the structured AAR prevents hindsight bias. Without a structured format, AARs tend to become post-hoc rationalization sessions where people explain why their decisions were obviously correct given what they knew. The structured format forces people to reconstruct their decision logic from the information state they had at the time, not from the information state they have now.
So the AAR is essentially the research output that the game was designed to generate.
The game is the experiment. The AAR is the analysis. If you run a wargame and don't do a structured AAR, you've collected data and thrown it away. And yet that's exactly what most AI simulation projects do. They generate a narrative of what happened and present that as the output, with no mechanism for examining why the AI agents made the choices they made or whether those choices reflect any genuine model of adversary decision logic.
Let's talk about the escalation problem specifically, because I know this is one of the places where AI simulations go most visibly wrong.
This is where the absence of methodology produces the most dangerous outputs. Professional wargaming has a well-developed framework for modeling escalation, built substantially on Herman Kahn's escalation ladder from his nineteen sixty-five work On Escalation. The ladder has forty-four rungs, from subcrisis maneuvering at the bottom to spasm or insensate war at the top, and the framework provides a shared vocabulary for identifying where in the escalation space a given action sits. The value of the ladder isn't that it's a perfect model of how escalation works. It's that it gives the White Cell and the player teams a common reference frame for adjudicating whether a proposed action is a small step up, a large jump, or an attempt to skip multiple rungs. Without that reference frame, AI agents tend to exhibit what you might call escalation compression. They either stay stuck at a low level of conflict because nothing in their training data gives them a strong signal to escalate, or they jump to the highest available option because the narrative logic of the scenario seems to call for a dramatic resolution.
The story wants a climax.
The story wants a climax, and LLMs are fundamentally story-completion engines. That's not a flaw in the technology, it's just what the technology is. The flaw is treating a story-completion engine as a strategic decision simulator without adding the structural constraints that prevent the narrative logic from overriding the operational logic.
There's also the Perla cycle of research point, which I think is underappreciated. Can you explain what that means?
Perla's argument, and this is one of the more sophisticated methodological contributions in The Art of Wargaming, is that wargaming is not a standalone activity. It's one component of a cycle that includes historical analysis, operations research, exercises, and real-world feedback. The cycle works like this: you analyze historical cases to build your model of how the relevant kind of conflict works. You design a wargame that tests your model against a simulated adversary. You run the game and extract insights. You validate those insights against historical cases and operations research. You update your model. Then you run exercises that operationalize the insights, and you incorporate feedback from real operations when available. The cycle is iterative and self-correcting. Each component checks the others. AI simulations that run in a vacuum, with no connection to historical analysis, no operations research validation, no human-in-the-loop exercise integration, are operating outside this cycle entirely. They're producing outputs that have no mechanism for being checked against anything except the internal consistency of the model's training data.
Which is a closed loop that can drift arbitrarily far from reality.
And will, over time, especially on edge cases and novel scenarios, which are exactly the scenarios you most want to model. The training data distribution of any LLM is going to be thinner on genuinely novel strategic situations than on conventional ones, so the model's performance is worst precisely where decision support is most needed.
Okay, let's bring this to the practical level. Because I think the audience for this conversation includes people who are actually building AI wargaming tools, not just analyzing them. What do you take away from all of this as actionable guidance?
The first and most important thing is to define the decision-support question before you touch the technology stack. What is the wargame for? What decision is it supposed to inform? Who is the decision-maker, and what are they trying to learn? If you can't answer those questions clearly before you start designing your simulation, you're going to end up with a system that generates plausible-sounding output without any guarantee that the output is relevant to any real decision. The MORS framework is useful here because it provides a structured process for scenario design that starts with the analytical question and works backward to the simulation architecture.
And the scenario design process includes the adversary model, which is where a lot of the methodology lives.
The Red Cell problem is real for AI simulations. If you're using an LLM to play both sides of a conflict, you need to think carefully about how you're modeling the adversary's decision logic. The professional standard is that Red Cell players should have deep expertise in adversary doctrine, not just general knowledge. For an AI system, that means the adversary model needs to be grounded in actual doctrine documents, historical decision patterns, institutional constraints, not just a general instruction to act like a Chinese general. The persona approach is not inherently wrong, but a persona without a doctrine model underneath it is just a costume.
The second takeaway you mentioned was the red-blue-white cell separation.
This is an architectural requirement, not a nice-to-have. If you're building a multi-agent AI wargame, you need genuine information partitioning between the Blue and Red agents, and you need a White Cell function that is not just another LLM persona but a structured adjudication mechanism. The White Cell needs to enforce physical and logistical constraints, control information flow, and check proposed actions against an explicit model of what is operationally possible. Building that properly is significantly harder than just spinning up three LLM instances and having them talk to each other, but without it, you don't have a wargame, you have a collaborative fiction exercise.
And for people who want to go deeper on the foundational methodology before they build anything?
Perla's The Art of Wargaming is the place to start. It's from nineteen ninety but the core methodology is timeless, and it's the most readable entry point into the professional literature. Caffrey's On Wargaming from twenty twenty-one is the modern comprehensive overview, particularly strong on the institutional history and the DoD context. McHugh's Fundamentals of Wargaming is more technical and focused on rules and adjudication, useful if you're designing the mechanics of your game. And the MORS certificate program in wargaming is worth looking at if you want the current professional standard for how analysts are trained. The reading list is not long, but the concepts require real engagement. You can't skim Perla and think you've absorbed the methodology.
The gap between reading the methodology and actually building something that embodies it is also real.
It's substantial. The professional wargaming community has spent decades discovering failure modes that aren't obvious until you've run a badly designed game and watched it produce garbage output with great confidence. That accumulated experience is exactly what gets thrown away when people build AI simulations from scratch without engaging with the literature. It's not that the problems are unsolvable. It's that they've already been solved, and the solutions are documented, and most AI wargaming projects just don't look them up.
The rush to build is outpacing the discipline to design.
Which is a pattern we see across a lot of AI application development, not just wargaming. The technology is exciting and accessible, so people build first and ask methodological questions later, if at all. In most domains, that produces products that are annoying or suboptimal. In wargaming for decision support, it produces tools that could inform bad strategic decisions. The stakes are different.
The open question I keep coming back to is whether AI can actually augment these human-centric processes rather than replace them, and what that looks like in practice. Because there are real advantages to what AI can do, the speed, the scale of iteration, the ability to model complex interdependencies. The question is whether you can get those advantages while preserving the methodological rigor.
That's the right question, and I think the answer is yes, but only if you treat AI as a component of the wargaming system rather than as the wargaming system. AI is potentially very good at the bookkeeping functions that computer-assisted wargames already use software for. It's potentially useful for generating adversary action options that human Red Cell players can then evaluate and select from, rather than having the AI execute the full decision. It's potentially useful for synthesizing AAR outputs across many iterations. What it's not good at, in its current form, is replacing the human judgment in the White Cell, the expert adversary modeling in the Red Cell, or the structured debrief in the AAR. Those functions require the kind of contextual expertise and accountability that doesn't yet exist in any AI system.
The next piece of this puzzle is how you actually map these professional standards onto AI action spaces and escalation ladders, which is where this series goes from here.
And that mapping is where the methodology we've covered today becomes directly operational. You can't design a good escalation ladder for an AI simulation without understanding Kahn's framework. You can't design a good action space without understanding the decision-point mapping that serious wargames produce. The foundational work is not academic throat-clearing. It's the prerequisite for building something that actually works.
Thanks as always to our producer Hilbert Flumingtop for keeping this show running. And a big thanks to Modal for providing the GPU credits that power the pipeline behind this episode. If you want to follow along as we dig deeper into the AI wargaming series, search for My Weird Prompts on Telegram to get notified when new episodes drop. This has been My Weird Prompts. We'll see you next time.
See you then.