So Daniel sent us this one, and it's a good one. He writes: a two-stage AI forecasting pipeline was run on April 9th to assess the durability of the April 8th Iran-Israel-US ceasefire. Stage A was a snowglobe-style actor-level Monte Carlo simulation, thirty-eight actors across four timesteps, with a referee model authoring world state between turns and information hygiene ensuring each actor only saw the referee state plus their own private memory. Stage B was a six-lens LLM council, independent answers, cross-review, chairman synthesis. Both stages routed through OpenRouter to Claude Sonnet 4.5. Cost: roughly six to twelve dollars, eighteen minutes wall clock. The headline forecast: fifty-five percent ceasefire survival at twenty-four hours, twenty-two percent at seventy-two hours, ten percent at one week, four percent at one month. The council was more pessimistic than the Stage A simulation, which gave twenty-eight percent for seventy-two-hour survival. All six council lenses agreed the ceasefire is structurally unsustainable, and the most dangerous window is April eleventh through thirteenth. Daniel wants us to dig into the methodology, the divergences, the honest limitations, and what any of this actually means for how we should use AI forecasting on live geopolitical crises.
Herman Poppleberry, by the way, for anyone new. And yeah, this is the kind of prompt that makes me want to cancel everything else I had planned for today.
You had nothing planned.
I had plans. Anyway. Let's start with the architecture because I think people are going to skim past the methodology and go straight to the numbers, and that would be a mistake. The numbers are only interpretable if you understand what generated them.
By the way, quick note for the listeners: today's script is generated by Claude Sonnet 4.6, which means we're one generation removed from the model that ran the actual simulation. There's a mild recursion happening here that I find delightful.
The meta-layer is real. So: the pipeline has two distinct stages, and the reason you do both is that they're answering slightly different questions. Stage A, the Monte Carlo simulation, is asking what happens when individual actors, each with their own incentive structures and private information, make sequential decisions. You've got thirty-eight actors modeled: Iranian principals, IRGC factions, Hezbollah, the IDF, US decision-makers, mediators, Gulf states, Russia, China. Each one sees only what a referee model says they should see, plus their own accumulated memory from prior timesteps. The referee is umpiring the world state. And then you run the whole thing across four timesteps: now, plus twenty-four hours, plus one week, plus one month.
The information hygiene piece is what I find most interesting architecturally. Because the failure mode in most AI forecasting is that you just ask a model what's going to happen and it gives you its priors dressed up as analysis. The snowglobe design forces the model to actually simulate decision-making under asymmetric information, which is the actual condition these actors are operating in.
That's the key insight. Real geopolitical actors don't have access to each other's internal deliberations. Hezbollah's military command doesn't know what's happening in the Khamenei succession discussions. The IRGC doesn't have a clean read on Netanyahu's domestic political calculations. So if you model them as if they all share a common information pool, you're going to get systematically overconfident outcomes. The snowglobe design is trying to replicate the actual epistemic conditions of the crisis.
And Stage B is doing something fundamentally different.
Stage B is a council of lenses, six parallel analytical frameworks that each independently assess the situation, then cross-review each other's outputs, and then a chairman model synthesizes. The lenses might be things like structural realism, domestic political incentives, historical pattern matching, economic pressure analysis, that kind of framing. Each one produces an independent forecast, then they see what the others said, and then you get a synthesis. The point is to catch blind spots that any single analytical frame would miss.
So Stage A is bottom-up, emergent from actor behavior. Stage B is top-down, structural analysis. And they gave you pretty different numbers.
Meaningfully different. Stage A simulation gave twenty-eight percent for seventy-two-hour survival. The council came in at twenty-two percent. The council was more pessimistic, and more interestingly, all six lenses agreed it was structurally unsustainable. That's not a marginal divergence, that's a systematic divergence. The sim, modeling individual actors, sees more room for a hold than the structural analysis does.
Which tells you something about where the tension actually lives. Individual actors might want a pause. The system doesn't support one.
That's the interpretation I'd offer, yes. And it maps onto something we see in historical ceasefire data. Parties to a conflict can have genuine desire for a tactical pause, tactical being the operative word, while the structural conditions that produced the conflict remain entirely intact. The ceasefire isn't resolving anything, it's just a momentary equilibrium that requires all parties to simultaneously restrain themselves in a context where restraint is not incentivized.
Let's get into the actual findings, because the numbers alone don't capture how fragile this thing looked from the data they had on April tenth. Walk me through the immediate violations.
So the ceasefire was announced April eighth. Within the first twenty-four to thirty-six hours: the IDF conducted what the Wall Street Journal reported as its largest airstrikes of the war, with a hundred and eighty-two people killed in Lebanon on April eighth and ninth. Hezbollah paused for thirty-six hours and then launched a seventy-rocket barrage into northern Israel on April ninth. And Iran, after the ceasefire took effect, launched missile attacks against the UAE and Kuwait. The UAE intercepted several ballistic missiles.
So the ceasefire was already being violated in multiple directions before the ink was dry, and yet we're still calling it a ceasefire.
Because of the Lebanon scope ambiguity, which is not an accident. Iran's position is that the ceasefire covers Lebanon and Hezbollah. Israel's position, and the Trump administration's explicit statement, is that it does not. This is what the pipeline called weaponized ambiguity. Both parties can claim the other is in violation while claiming their own actions are within scope. It's not a bug in the agreement, it's a feature that lets each side preserve domestic narratives while continuing to operate militarily.
That's a diplomatic structure I would describe as a ceasefire in the same way that a piece of paper in a drawer is a ceasefire.
The pipeline's language was "tactical pause." Which I think is more generous than the situation warrants, but it's analytically precise. It's a pause in the specific Iran-US-Israel direct exchange, not a cessation of regional hostilities.
There's one genuinely de-escalatory data point in all of this and it's the Strait of Hormuz, which makes it simultaneously encouraging and depressing.
The partial opening of Hormuz is the single action the pipeline identifies as genuinely de-escalatory. Pre-war traffic was around a hundred and thirty-five vessels per day. Current throughput is ten to fifteen. So we're at roughly eight to eleven percent of normal. Bloomberg's reporting on April ninth was asking why Hormuz hadn't reopened after the ceasefire, and the answer is that it has, technically, in the sense that it's not completely closed. But functionally, global energy markets are still looking at a ninety percent reduction in Strait throughput.
Whether that's Iranian good faith or Iran maintaining a chokehold they can tighten again in forty-eight hours.
The pipeline treated it as ambiguous but noted it's the only move any party has made that points toward de-escalation rather than away from it. Which is a very low bar.
Let's talk about the divergences, because I think that's where the methodological story gets most interesting. The simulation got the Hezbollah timing wrong.
The simulation predicted Hezbollah rocket fire within twenty-four hours of the ceasefire. Reality was a thirty-six-hour pause followed by a seventy-rocket barrage. The pipeline's own assessment was: direction correct, timing wrong. And that's actually a meaningful distinction in forecasting terms. Getting the direction right means your model has the incentive structure roughly correct. Getting the timing wrong means either the resolution of your timesteps is too coarse, or there are variables affecting the timing that your actors don't have access to, which in the snowglobe design they wouldn't.
Or Hezbollah's internal deliberation process is genuinely opaque even to a model trying to simulate it.
Which it almost certainly is. Hezbollah's command structure, particularly in the post-Nasrallah period, is not well-documented from open sources. A simulation grounded on Tavily and ISW data is going to have significant gaps in its model of Hezbollah's internal decision-making.
Now the Operation Hourglass prediction is the one I want to spend real time on because it's the most epistemically interesting part of this whole thing.
The simulation predicted a Mossad sabotage operation against the Natanz nuclear facility, labeled Operation Hourglass, at 0300 on April eleventh, with an eighty-five percent probability and a specific plausible-deniability cover story. The council reviewed this and flagged two problems: first, the deniability assumption was probably wrong because Iran attributes covert actions against its nuclear program to Israel essentially immediately regardless of cover stories, so the premise of deniability is historically unsupported. Second, the 0300 timestamp is almost certainly a simulation artifact. Models tend to produce specific-looking details when generating narrative predictions, and a precise timestamp in a multi-week geopolitical forecast is the kind of specificity that should make you immediately suspicious.
It's the AI equivalent of a witness who's too confident about what color shirt the suspect was wearing.
The timestamp has the texture of false precision. But here's where it gets genuinely strange. Independently of the simulation, Reuters reported on April ninth that Russia is coordinating with the IDF on evacuating a hundred and ninety-eight workers from the Bushehr nuclear plant. That's not a simulation output. That's a real-world action by a third party that the simulation couldn't have generated from its priors because the information wasn't in the training data.
And the evacuation of workers from a nuclear facility is one of the clearest pre-strike signals that exists.
It's a very hard signal. If you're moving personnel out of a facility, you either have reason to believe it's about to be struck or you have reason to believe the region is about to become significantly more dangerous. In either case, the behavioral signal is pointing in the same direction the simulation's narrative prediction was pointing, even though the simulation generated a specific story and the real world generated a behavioral signal. The convergence is striking.
So the question becomes: how do you weight that? The simulation produced a prediction with a specific mechanism and a probably-wrong timestamp. The real world produced a behavioral signal that confirms the direction but through a completely different mechanism. Do those reinforce each other or are they independent confirmations of the same underlying dynamic?
That's the epistemically hard question, and I want to be honest that there's no clean answer. The most conservative interpretation is that both the simulation and the Russian evacuation are responding to the same underlying reality, which is that Israeli strikes on Iranian nuclear infrastructure are a credible near-term possibility given the current state of the conflict. The simulation inferred this from actor incentives. Russia inferred this from whatever intelligence or direct communication they have with Israeli officials. The convergence doesn't mean either prediction is correct. It means multiple independent methods are pointing in the same direction, which is a meaningful signal without being a guarantee.
And the simulation's specific prediction could still be wrong about mechanism, timing, and target while the general direction remains valid.
Which is actually how most useful forecasting works. You're trying to identify the direction of travel and the rough magnitude, not generate a specific event prediction with a timestamp.
Let's talk about the Khamenei variable because the pipeline called it the highest-impact unknown in the entire forecast and I think that framing is right.
Mojtaba Khamenei succeeded his father as Supreme Leader, and reports from April seventh through ninth were indicating he was incapacitated or unconscious in Qom. The pipeline's council weighted sixty percent probability of genuine incapacitation. And the reason this matters so much is the conditional structure of the forecast. If Mojtaba is functional, the seventy-two-hour hold probability is thirty-five percent. If he's incapacitated, it drops to ten percent. That's a twenty-five percentage point swing on a single variable.
Because an incapacitated Supreme Leader means the IRGC is operating without clear political authority over its decisions.
The IRGC has its own institutional interests, its own factional dynamics, and in a power vacuum it's not obvious that anyone can give binding orders. Hardline factions within the IRGC might see a moment of leadership uncertainty as an opportunity to take actions they couldn't take with a functional Supreme Leader in place. The ceasefire requires someone with authority to tell the IRGC to hold. If that person isn't functional, the institutional incentives inside the IRGC may not support restraint.
This is the single-point-of-failure problem in geopolitical forecasting. Most agreements assume a certain distribution of decision-making authority, and when that assumption fails because a key actor is incapacitated or replaced or has their authority undermined, the whole model of how decisions get made breaks down.
And it's genuinely hard to forecast around because you're trying to model not just what rational actors would do but what happens when the institutional structure for making decisions is itself uncertain. The simulation can model IRGC factions as actors with their own incentives, but the relative weight of those factions changes depending on who has political authority over them, and if that's uncertain, your actor model is operating on a shaky foundation.
The Netanyahu trial is the other variable I want to flag because the council added roughly ten percentage points to Israeli-initiated break probability based on it, and the sim underweighted it.
Netanyahu's corruption trial was set to resume April thirteenth, which puts it right inside the critical window the pipeline identified as most dangerous. The historical pattern, and this is documented across multiple prior escalations, is that Netanyahu has tended to escalate military operations around legal milestones. The mechanism is debated, whether it's distraction, whether it's rallying domestic support, whether it's genuine belief that national security crises justify postponing legal proceedings, probably some combination. But the correlation is strong enough that the council treated it as a meaningful variable that the simulation had underweighted.
The sim is modeling strategic actors, and a prime minister escalating a war to manage his corruption trial is not exactly strategic behavior in the conventional sense. It's domestic political survival intersecting with military decision-making in a way that doesn't fit cleanly into a rational actor framework.
Which is one of the genuine limitations of actor-level simulation. You can model actors as having multiple objective functions, military outcomes, political survival, domestic coalition management, but calibrating the relative weights of those objectives in real time is extremely difficult. Particularly for Netanyahu specifically, where the legal pressure is unusually acute and the political coalition is unusually fragile.
Alright, I want to zoom out to the honest limitations section because I think this is where the pipeline is most valuable as a methodological object, not just as a source of numbers. What are the structural weaknesses here?
The most fundamental one is N equals one. This is a single run of a Monte Carlo simulation. Monte Carlo methods get their power from running many iterations and looking at the distribution of outcomes. A single run gives you one sample path, and you don't know how representative that sample path is. The proper output of a Monte Carlo simulation is a probability distribution over outcomes, and to get a reliable distribution you need enough runs that the distribution has stabilized. One run is not enough to trust the specific numbers, even if it's enough to identify the general direction.
So the fifty-five, twenty-two, ten, four numbers are not the kind of probabilities you'd get from a well-run Monte Carlo. They're more like a single scenario's implied probabilities.
More precisely, they're the council's synthesis after reviewing the simulation output, the live data, and their own structural analysis. The council is doing the heavy lifting on the actual probability estimates, using the simulation as one input among several. Which is actually a reasonable methodology, but it means the numbers carry the council's epistemic uncertainty, not the convergent certainty of a large Monte Carlo run.
The second limitation is the single model family problem, and I want to hear your take on this because it's an epistemically subtle issue.
Both stages were routed to Claude Sonnet 4.5 for all roles. The thirty-eight simulation actors, the referee, the six council lenses, the chairman synthesis, all the same model. The concern is groupthink, but it's a specific kind of groupthink. It's not that the model will agree with itself in an obvious way, because the snowglobe design and the lens design are both trying to force diversity of perspective. The concern is more subtle: a single model family has systematic biases in how it reasons about geopolitical crises, what evidence it weights, how it models actor rationality, what scenarios it finds plausible. Those biases will be consistent across all the roles because they're all running on the same underlying parameters.
So the diversity you're generating is diversity of analytical frame, not diversity of underlying world model.
That's the precise concern. Genuinely diverse forecasting ideally involves forecasters with different priors, different training, different cultural contexts for interpreting the same events. Running six lenses on one model is better than running one lens, but it's not the same as running six genuinely independent models. The pipeline acknowledged this as a limitation, and it's the right call to flag it explicitly.
There's also the grounding question. The pipeline was grounded on Tavily live search plus frozen ISW and RSS from April tenth. That's a meaningful constraint because the quality of the forecast is bounded by the quality of the information it's working from.
And open-source intelligence on a live crisis has significant gaps. ISW is excellent on order of battle and territorial control, but it's not going to have reliable information on Mojtaba Khamenei's medical condition or the internal deliberations of IRGC factions or what Russia is communicating privately to Israeli officials. The simulation is modeling actors based on their publicly observable behavior and stated positions, and real decision-making in this context is happening in channels that aren't publicly observable.
Which means the simulation is probably better calibrated on the structural dynamics, the incentive landscape, the historical patterns, than it is on the specific high-impact variables that are most uncertain. The things that are hardest to forecast are also the things the grounding data is weakest on.
And those are also the things that matter most for whether the forecast is correct. The Khamenei health variable is the highest-impact unknown precisely because it's both high-impact and genuinely unknown. The pipeline is being honest about that, which is the right epistemic posture.
Let's talk about the meta-question of what you actually do with these numbers. Because I think there's a real risk that probability forecasts on geopolitical crises get either over-interpreted as precise predictions or dismissed as speculation, and neither response is useful.
The actionability framing in the pipeline was interesting. It suggested that for a neutral observer, these probabilities imply the ceasefire is effectively a seventy-two-hour window to take protective actions, evacuate personnel, hedge financial positions, before a probable return to high-intensity conflict. That's a specific and concrete use case. You're not using the forecast to predict the exact mechanism of breakdown, you're using it to set a decision horizon.
The decision horizon framing is actually how professional risk managers think about probabilistic forecasts. You're not asking whether the forecast is correct, you're asking whether the probability is high enough to justify taking a protective action that has some cost. If the cost of evacuation is low relative to the cost of being caught in an escalation, and the probability of escalation is twenty-two percent at seventy-two hours, the expected value calculation is pretty clear.
And the four percent at one month number is, in some ways, the most useful single number in the whole forecast. Because it's not saying the ceasefire will definitely collapse. It's saying that the structural conditions for a durable ceasefire, verified de-escalation, resolution of the Lebanon scope question, stabilization of Iranian political authority, removal of the nuclear program as an active flashpoint, none of those conditions exist. A ceasefire without those conditions has roughly four percent survival probability at one month based on historical base rates and the current structural assessment.
Four percent is low enough that you should not be making one-month plans contingent on the ceasefire holding.
Without building in significant hedging, yes. And that's a genuinely useful output even if every specific mechanism the simulation predicted turns out to be wrong.
I want to go back to the methodological divergence between Stage A and Stage B because I think we haven't fully unpacked what it tells us about the two methods. The sim said twenty-eight percent for seventy-two-hour survival. The council said twenty-two percent. Why the gap?
My reading is that the actor simulation is, by design, modeling the possibility space for individual decision-makers to choose restraint. If you model IRGC faction leaders as having some probability of choosing to hold fire, if you model Hezbollah command as having some probability of observing the ceasefire even without explicit coverage, if you model Netanyahu's inner circle as having some probability of prioritizing the diplomatic framework over domestic political considerations, those individual probabilities compound into a higher overall survival estimate than a purely structural analysis would produce.
Because the structural analysis is saying: regardless of what individual actors want to do, the system doesn't have the properties that would allow a ceasefire to hold. The Lebanon ambiguity is unresolved. The nuclear program tension is unresolved. The IRGC authority question is unresolved. The domestic political pressures on both sides are unresolved.
And historically, ceasefires in comparable contexts, the pipeline was presumably drawing on base rates here, have very low survival rates when the underlying structural conditions are this unresolved. So the council's structural lenses are anchoring on base rates and structural features, while the simulation is allowing for the possibility that actors find ways to muddle through despite the structural fragility.
Which raises the question of which is better calibrated. And I think the honest answer is we don't know yet.
We genuinely don't. And this is where the N equals one problem really bites. To know which method is better calibrated, you need to run both methods on many similar forecasting problems and track their accuracy over time. A single episode, even a high-stakes one, tells you very little about which method is systematically better. The council being more pessimistic than the sim is an interesting data point, but it's not evidence that the council is more accurate.
There's also a deeper question about what calibration even means for geopolitical forecasts. If the ceasefire collapses in the seventy-two-hour window, that's consistent with both the sim's twenty-eight percent and the council's twenty-two percent, because both of those say it was likely to collapse. And if it somehow holds for a week, that's also consistent with both forecasts, just in the tail. You'd need a very large number of similar ceasefires to actually distinguish between a model that says twenty-two percent and one that says twenty-eight percent.
This is the fundamental epistemological challenge with geopolitical forecasting that doesn't get enough attention. The reference class problem is severe. How many ceasefires are comparable enough to the April eighth Iran-Israel-US agreement to serve as calibration data? You need ceasefires involving similar power dynamics, similar levels of structural unresolution, similar regional complexity, similar domestic political pressures on the key actors. The number of genuinely comparable historical cases is small, and each one is sufficiently unique that the comparison is contestable.
Which is actually an argument for the AI forecasting pipeline approach, even with its limitations. If the reference class is small, you want a method that can reason from first principles about the specific structural features of this situation rather than just pattern-matching to historical cases.
The simulation is trying to do exactly that. It's not saying this looks like Lebanon 2006 therefore X. It's modeling the specific actors in this specific configuration with this specific information environment and asking what emerges from their interactions. Whether it succeeds at that is a separate question, but the aspiration is right.
Last thing I want to touch on: the cost and speed metrics, because I think they're actually significant and don't get enough weight in discussions about AI forecasting.
Six to twelve dollars and eighteen minutes wall clock to run a thirty-eight-actor Monte Carlo simulation plus a six-lens council review on a live geopolitical crisis, grounded on real-time data. That's a genuinely remarkable cost structure. Traditional geopolitical analysis at this level of structural detail would involve teams of analysts, days of work, and costs orders of magnitude higher. The question isn't whether this pipeline is as good as the best human analysts, it's whether it's good enough to be useful at this price point and speed.
And the answer seems to be yes, conditional on treating the output correctly. If you're using it as one input into a broader analytical process, as a rapid structural assessment that identifies key variables and orders of magnitude, rather than as a precise prediction you act on directly, then six to twelve dollars for an eighteen-minute run is extraordinary value.
The other thing the speed enables is iteration. You can run this pipeline multiple times with different assumptions, different grounding data, different timestep resolutions, and compare the outputs. A single run is epistemically weak. Ten runs with varied parameters start to give you a sense of which conclusions are robust and which are sensitive to specific assumptions. That kind of sensitivity analysis is what you'd want before making high-stakes decisions based on this kind of forecast.
And at six to twelve dollars a run, running it ten times is still cheaper than a single hour of a senior analyst's time.
The cost curve for this kind of analysis has moved dramatically, and I think the implications for how organizations do geopolitical risk assessment are significant. You're not replacing human expertise. You're giving human experts a tool that can rapidly surface structural analysis, identify high-impact variables, and generate scenario branches that human analysts can then evaluate and refine.
The Russian-IDF evacuation signal is the best illustration of that. The simulation generated a prediction that pointed toward Israeli strikes on Iranian nuclear infrastructure. Human analysts looking at that output, combined with the real-world data about the evacuation, can do something the simulation can't: they can assess whether the behavioral signal confirms the structural prediction or is a coincidence, drawing on their understanding of how Russia communicates with Israel, what Russian strategic interests in the situation are, and what the historical precedents for this kind of pre-strike signaling look like.
The simulation generates the hypothesis. Human judgment evaluates the evidence. That's a reasonable division of labor, and the pipeline is designed in a way that's explicit about where the simulation's contribution ends and where interpretation needs to begin.
Alright, practical takeaways. What does someone actually do with this?
Three things, I think. First, if you have any operational exposure in the region, personnel, logistics, financial positions, the four percent one-month survival number is your planning horizon. You should not be making one-month commitments that depend on the ceasefire holding without significant hedging. Second, watch the April eleventh through thirteenth window specifically. The pipeline identified it as the critical inflection point, and the Russia-Bushehr evacuation signal gives that window independent corroboration. Third, track the Mojtaba Khamenei health question as the highest-leverage uncertainty. The twenty-five percentage point swing between functional and incapacitated scenarios means any credible reporting on his status should update your assessment significantly.
And from a methodological standpoint, for anyone thinking about building similar pipelines: the information hygiene in Stage A is the most technically important design decision. The snowglobe architecture, where actors only see what a referee tells them, is what separates this from just asking a model to predict geopolitical outcomes. It forces the model to simulate decision-making under realistic epistemic constraints rather than omniscient strategic analysis.
The single model family limitation is the most important thing to fix in a next iteration. Running Stage B lenses across genuinely different model families, not just different prompting strategies on the same model, would significantly strengthen the structural analysis by introducing actual diversity of world model rather than just diversity of analytical frame.
And the N equals one problem is the thing you should be most cautious about when interpreting the specific numbers. The direction of the forecast, structurally unsustainable, most dangerous window April eleventh through thirteenth, Hezbollah as most likely trigger, that's probably robust. The specific probability values are much more uncertain.
The pipeline knows this about itself, which is the right epistemic posture. A forecasting system that accurately represents its own uncertainty is more useful than one that produces confident numbers without acknowledging how they were generated.
This was a genuinely fascinating piece of work to dig into. The methodology is interesting, the findings are sobering, and the honest limitations section is almost more valuable than the forecast itself.
The best forecasting systems are the ones that tell you not just what they think will happen but why they think they might be wrong. This one does that.
Thanks as always to our producer Hilbert Flumingtop for keeping this whole operation running. Big thanks to Modal for providing the GPU credits that power the pipeline behind this show. This has been My Weird Prompts. If you're not already following us on Spotify, that's a quick fix. Until next time.
Take care, everyone.