#4056: How a $150 Geopolitical AI Simulation Scales to $15,000

One simulation run costs $150. To get meaningful results, you need 100 runs—that’s $15,000.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-4235
Published: Jul 2
Duration: 23:22
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: geopolitical-strategy ai-agents large-language-models

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Geopolitical simulations using AI agents have become surprisingly accessible—Snowflake, an open-source platform from the CIA’s Technical Innovation Lab, lets anyone simulate crises from their laptop. But the cost of running these simulations reveals a hidden gatekeeper: API token pricing. A single run can cost over $150, and because LLMs are probabilistic, you need 100 to 1,000 runs for statistically meaningful results. That turns a $150 experiment into a $15,000 research project.

The cost breaks into three multiplying buckets: agent personality prompts (hundreds of tokens per agent, paid every interaction), growing context windows (each agent’s memory accumulates throughout the simulation), and combinatorial multiplication (20 agents × 10 rounds = 200 API calls per run). The Monte Carlo method demands repeated sampling to separate signal from noise in LLM outputs, making the cost structural rather than incidental.

But there’s a silver lining. Switching to cheaper models like DeepSeek can reduce costs by 99%—from $15,000 to $50 for 100 runs. More importantly, using multiple models creates an ensemble effect: different LLMs have different priors and biases, so running simulations across models reveals where outcomes converge and diverge, strengthening the methodology rather than compromising it. Future strategies include model tiering (using small local models for low-stakes agents and frontier models for decision-makers) and architectural innovations like MiroFish’s director-agent delegation system.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#4056: How a $150 Geopolitical AI Simulation Scales to $15,000

Daniel sent us this one, and it's got a hook I can't shake. He's been running geopolitical simulations with AI agents — actual large language models playing the roles of prime ministers, intelligence chiefs, military commanders — and trying to forecast how conflicts unfold. The software he used, Snowflake, came out of the CIA's Technical Innovation Lab. It's open source. Anyone can download it. But here's the thing that stopped me cold: one simulation run cost him over a hundred and fifty dollars in API tokens. A single run. And because these models are probabilistic, you don't run it once — you run it dozens or hundreds of times to get statistically meaningful results. So you can now simulate a geopolitical crisis from your laptop. It'll just cost you more than the laptop to do it.

That's the tension right there. The tools are free. Snowflake, MiroFish with its fifty-four thousand GitHub stars, the whole ecosystem — free software, sitting there waiting. But the inference? That's where the meter runs. And Daniel's question cuts straight to it: is there any realistic path for a smaller organization, a non-profit, a curious researcher, to actually use these things without a government agency's budget? Or is the cost barrier the real gatekeeper, dressed up in open-source clothing?

He also raised a second question, which I think we have to tackle, and that's the Monte Carlo method itself — the mathematical reasoning behind why you need all those repeated runs in the first place. Because that's what turns a hundred-and-fifty-dollar experiment into a fifteen-thousand-dollar research project before you've even done sensitivity analysis.

I want to get into both, because they're connected in ways that aren't obvious. The cost problem isn't just about which model you pick. It's structural. It's baked into the context windows, the agent count, the number of interaction rounds, and the statistical demands of the Monte Carlo approach. Change any one of those, and the economics shift. So let's start with why we're running things a hundred times at all, because that's where the bill really starts climbing.

Let's map the landscape first, because the range here is wild. At one end you've got Stanford's twenty twenty-three generative agents paper — twenty-five agents in a little virtual town called Smallville, going about their day, forming relationships, throwing a Valentine's party. That paper kicked off a whole wave of thinking about what happens when you let language models interact with each other over time. At the other end, you've got Oasis-style experiments trying to simulate millions of agents — entire synthetic populations — to model how policy changes might ripple through a society. And sitting in the middle, Snowflake, where you've got maybe a few dozen agents but each one is a head of state or a military commander making decisions with enormous downstream consequences.

The spectrum runs from a digital dollhouse to a synthetic census. And the computational appetite scales accordingly.

And here's where the Monte Carlo piece becomes the fulcrum of the whole cost question. An LLM is probabilistic by nature — ask it the same question twice and you might get different answers. So a single simulation run is just one draw from a distribution of possible outcomes. If you run it once and Iran backs down, but run it again and they escalate, which one reflects anything real? You don't know until you've run it enough times to see which outcomes are stable and which are flukes. That's the Monte Carlo method in a nutshell: sample the distribution repeatedly until the pattern converges.

What's the typical range?

For anything claiming statistical validity, you're usually looking at a hundred to a thousand runs. Sometimes more, depending on how noisy the system is. And every single one of those runs is burning tokens — agent personalities, the growing context window as agents remember past interactions, the multiplication of agents by conversation rounds. The cost isn't additive in a friendly way.

Walk me through the anatomy of that hundred-and-fifty-dollar run. Where's the money actually going?

Three cost buckets, and they multiply each other. First, you've got the agent personality prompts — these are the system-level instructions that define each actor. "You are the Israeli Prime Minister. Your red lines are these. Your intelligence brief says this." That's not a one-liner. For Snowflake, each agent prompt might run several hundred tokens just to establish the role, the biases, the constraints. And you pay for those tokens on every single interaction because they sit in the context window.

Even before anyone says anything, the meter's already running.

Second bucket is the context window itself, which is the silent budget-killer in all of this. Every time an agent speaks, that utterance gets appended to the history. Every time an agent needs to recall what happened three rounds ago, the model has to reprocess the entire accumulated transcript. Round one might be a few thousand tokens. By round twenty, you're shoving a small novel into the context window on every single call. And you're paying for all of it — input tokens, output tokens, the whole thing.

That's per agent, per round.

Which brings us to the third bucket: the combinatorial multiplication. If you've got ten agents and ten interaction rounds, that's a hundred LLM calls just for one simulation run. But Snowflake scenarios aren't ten agents. Daniel's Iran-Israel setup probably had — what — fifteen, twenty actors? Military, intelligence, political leadership on multiple sides, maybe a UN mediator. And they're not just chatting. They're issuing statements, assessing threats, making decisions. Each of those is an API call. And each API call is processing a context window that's been growing since the simulation started.

When Daniel says a hundred and fifty dollars for one run, the math actually checks out in a depressing sort of way.

Let's put some rough numbers on it. Claude's API pricing for a model capable of this kind of reasoning — you're looking at something like fifteen dollars per million input tokens and seventy-five dollars per million output tokens, ballpark. A single complex agent interaction with a fat context window might burn a hundred thousand tokens. Do that two hundred times across twenty agents and ten rounds, and suddenly a hundred and fifty dollars isn't surprising. It's almost conservative.

That's before the Monte Carlo multiplier. Daniel mentioned running it a hundred times to get convergence. That's fifteen thousand dollars.

For one scenario. One geopolitical question. Before you've tested alternative assumptions, before sensitivity analysis, before you've asked "what if the intelligence estimate was wrong?" — which is exactly the kind of question you'd want a system like this to answer. The Monte Carlo method demands those repeated samples precisely because LLMs are so noisy. A single run might produce a dramatic escalation or a peaceful resolution based on nothing more than the random seed and the particular phrasing of a prompt. You need enough draws from the distribution to separate signal from noise, and the standard threshold for that is usually somewhere between a hundred and a thousand iterations.

Which means the fifteen-thousand-dollar figure is actually the low end of credible.

Here's where Daniel's experience with DeepSeek becomes instructive. He switched models and got the cost down dramatically — we're talking maybe fifty dollars for a hundred runs instead of fifteen thousand. DeepSeek's pricing structure is orders of magnitude cheaper. But that's not just a cost story. It's a methodological trade-off, and it's one that actually has an upside.

Because my instinct would be that switching to a cheaper model means you're compromising the simulation quality.

That's the common assumption, and it's not entirely wrong — but it's not entirely right either. Different models have different priors. They're trained on different data, with different reinforcement learning, different safety tuning, different everything. Claude has certain baked-in assumptions about how international relations work. DeepSeek has different ones. GPT-4 has still others. If you run your Monte Carlo simulation using only Claude, you're not just sampling the distribution of possible geopolitical outcomes — you're sampling it through a single model's particular lens.

Using multiple models isn't a compromise. It's an ensemble method.

In meteorological forecasting, you don't run one weather model a hundred times. You run a dozen different models and look at where they agree and where they diverge. The disagreement itself is information. Same principle applies here. Daniel's switch to DeepSeek wasn't just about saving money — though it certainly did that. It was also about introducing a different set of priors into the simulation, which actually strengthens the Monte Carlo approach rather than weakening it.

That's a genuinely useful reframe. The thing that feels like corner-cutting is actually methodological hygiene.

The scaling problem gets worse in a way Daniel hinted at without fully unpacking it. Those Oasis-style population models with millions of agents — you'd think the cost just multiplies linearly. More agents, more money. But it's actually worse than that. The context window doesn't just grow. It combinatorially explodes.

Right, and this is the part that makes my head hurt if I stare at it too long. In a small simulation with twenty agents, each agent's memory is tracking interactions with nineteen others. In a million-agent simulation, the potential interaction graph is every agent with every other agent. Even if you limit direct interactions to local neighbors — which is how Oasis and similar frameworks handle it — information still propagates. Agent A talks to Agent B, who talks to Agent C, and now Agent C's context window contains a degraded version of something Agent A said three hops ago. The memory chain doesn't just add tokens. It multiplies pathways.

It's not a million separate conversations. It's a million nodes in a network where every path through the network is potentially relevant context.

You're paying to reprocess all of it every time any agent needs to recall something. That's the combinatorial explosion. It's why million-agent simulations aren't just a thousand times more expensive than thousand-agent ones. They're orders of magnitude more.

Which brings us to the question Daniel actually asked. Is there any path here for someone without a government budget?

There is, but it requires rethinking the architecture from the ground up. And the good news is, people are already doing this. Three strategies in particular stand out. First, model tiering. Not every agent in a simulation needs GPT-four-level reasoning. A simulated citizen deciding whether to attend a protest might only need a seven-billion-parameter model — something you can run locally on a consumer GPU for essentially zero marginal cost. The simulated Prime Minister, making decisions about whether to deploy troops, that's where you spend your API budget on a frontier model.

You're assigning reasoning depth based on the agent's role in the decision chain.

And MiroFish's architecture already points in this direction. They use a director agent that delegates subtasks to specialized sub-agents. The director does the heavy reasoning. The sub-agents execute narrower, cheaper tasks. That structure naturally reduces the total context load because not every agent is carrying the full simulation history. Only the ones whose decisions actually shape outcomes.

That's the first strategy. What's the second?

Batching and context sharing across Monte Carlo runs. Right now, if you run a hundred iterations, each one builds its own context window from scratch. But a lot of the early interaction rounds are going to be similar across runs — the initial conditions, the opening moves. If you can run those shared early rounds once and fork the simulation only at the divergence points, you dramatically reduce redundant token consumption.

That sounds tricky to implement. How do you know where the fork points are before you've run them?

You don't, which is the honest answer. But you can use a cheaper model to scout ahead — run a fast, low-fidelity version of the first few rounds with something like Llama three, identify where the branching happens, and then spin up your expensive runs only from those divergence points. It's speculative execution, essentially. Borrowed from computer architecture.

The third strategy?

Caching agent personalities. Daniel's Snowflake setup defines each agent's role, biases, and constraints in a system prompt that gets reprocessed on every single interaction. But those definitions don't change during a run. If you cache the embedded representations of those personalities — store them as vectors rather than reprocessing the raw text each time — you eliminate a significant chunk of the input token cost. It's not a silver bullet, but in a hundred-run Monte Carlo experiment, shaving even twenty percent off each run's input tokens compounds fast.

Between model tiering, batched context sharing, and personality caching, you're carving the cost down from multiple directions.

There's a fourth piece that ties it all together, which is local inference. Snowflake and MiroFish are open source. The software costs nothing. If you can run the models locally — using something like Llama three or Mistral on your own hardware — the per-token cost drops to whatever your electricity bill is. The trade-off, obviously, is speed and quality. A locally-run seven-billion-parameter model isn't going to match Claude on nuanced geopolitical reasoning. But for the background agents in a population simulation? It's more than adequate.

The path for a non-profit looks something like this. Limit your agent count to hundreds, not millions. Use local inference for background agents and reserve API calls only for the key decision-makers. Accept lower fidelity where it doesn't compromise the core question you're trying to answer.

Be strategic about your Monte Carlo iterations. Daniel's hundred-and-fifty-dollar Claude run versus his roughly fifty-cent DeepSeek run — that's a three-hundred-to-one cost ratio. Run your first iteration with the expensive model to validate the simulation design. Make sure the agents aren't hallucinating nonsense, that the interaction dynamics produce plausible outputs. Then switch to cheaper models for the bulk of your Monte Carlo sampling. You lose some fidelity, but you gain statistical validity. For most research questions, that's the right trade.

If I'm distilling this into something someone could actually act on, I'd say there are three practical takeaways. The first one is the cheapest and the most frequently ignored: validate your simulation design with a single run before you scale. Daniel learned this the hard way — a hundred and fifty dollars on one Claude run that might have revealed a fundamental flaw in the agent prompts or the interaction rules. Imagine spending fifteen thousand dollars on a hundred Monte Carlo iterations only to discover on run ninety-seven that your simulated foreign minister keeps agreeing with everyone because the personality prompt was too conciliatory.

That's the nightmare scenario. A hundred runs of beautifully converged nonsense.

It happens more than anyone wants to admit. A single high-fidelity run with your best model — even if it's expensive — lets you sanity-check the whole setup. Are the agents producing plausible outputs? Are the interaction dynamics generating the kind of tension and decision points you're actually trying to study? If the answer's no, you've lost one run's budget instead of a hundred.

The sequence is: validate cheap, then scale. Not the other way around.

Second takeaway builds on what we said about model tiering, but I want to make it concrete. Assign your expensive model budget only to the agents whose decisions actually shape the simulation's trajectory. In a geopolitical scenario, that's maybe three to five actors — the head of state, the defense minister, the intelligence chief. Everyone else — the press secretaries, the lower-level diplomats, the background population in a larger model — those can run on a local seven-billion-parameter model that costs you nothing per token beyond electricity.

The simulation doesn't collapse if the simulated agriculture minister is running on a dumber model.

The agriculture minister's press statement doesn't determine whether a war starts. The Prime Minister's red-line assessment does. Spend your money where the variance matters.

What's the third one?

This is the one I'm most excited about, and it's a technique that's starting to show up in the research literature. You run a small number of high-fidelity simulations — maybe ten or twenty — with your best models, full agent counts, the whole thing. You identify the outcome clusters: under these conditions, the simulation converges to escalation about seventy percent of the time, de-escalation about thirty percent. Then you take that outcome distribution and use it to train a much simpler predictive model — something that doesn't require any LLM inference at all.

You're extracting the pattern from the expensive runs and then retiring the LLMs.

The distilled model learns the mapping from initial conditions to outcome probabilities without needing to simulate every conversation along the way. It's not going to give you the rich narrative of how the crisis unfolded — you lose the qualitative texture. But for the core question of "under these starting assumptions, what's the likely outcome distribution?", it gets you most of the way there at a tiny fraction of the cost.

That's clever. You're treating the LLM simulation as a training data generator rather than the final product.

It's how a lot of these methods will probably evolve. The expensive, high-fidelity runs become the gold standard you use sparingly. The distilled model becomes the workhorse you can run thousands of times for sensitivity analysis. A non-profit could afford ten Claude-quality runs, distill the results, and then explore the parameter space cheaply. That's a viable research program on a modest grant.

The answer to Daniel's question — is there a path for smaller organizations — turns out to be yes, but it requires being strategic in a way that the tools themselves don't teach you. The software gives you a big red "run" button. The judgment is knowing which runs to run.

Which leaves us with the open question Daniel's really driving at, whether he said it explicitly or not. Inference costs have been dropping fast — we've seen it across the board, from the frontier labs to the open-weight models. But the question is whether they'll drop far enough and fast enough to make million-agent simulations accessible to a university or a non-profit in the next two or three years.

My guess is no, not if we're waiting for per-token pricing alone to solve it. The combinatorial scaling problem we talked about doesn't care how cheap tokens get. If the context window still grows quadratically with agent count, a ten-times price drop just means you can afford a slightly bigger simulation before the same wall hits.

That's exactly why I think the real breakthrough isn't going to be cheaper models. It's going to be smarter architectures that don't need every agent to be a full LLM. The simulation distillation approach we mentioned — run a few expensive simulations, then train a cheap surrogate — that's one path. But there are others being explored. Sparse activation, where agents only "wake up" and reason when something relevant crosses their threshold. Hierarchical summarization, where groups of agents share a compressed memory rather than each carrying the full history.

The principle across all of them is the same: stop paying for reasoning you don't need. Most agents in a population model aren't making consequential decisions at any given moment. They're just there, part of the background distribution. Paying Claude-level inference costs for a simulated citizen who's going to buy groceries and go to bed is like hiring a chess grandmaster to play checkers.

That's actually an optimistic note to end on, because it means the cost barrier isn't a permanent feature of the technology. It's a design problem. The tools are open source. The models are getting cheaper and more efficient. The architectures are getting smarter about where to spend compute. The combination of those three trends makes me think that within a few years, a small research group with a decent GPU server and a modest API budget really could run meaningful population-level simulations.

Not the million-agent Oasis-scale stuff, not yet. But enough to ask real questions and get statistically valid answers. And that's the threshold that matters for the people Daniel's asking about.

Now: Hilbert's daily fun fact.

Hilbert: During the interwar period, a French microbiologist on the Seychelles documented a strain of cheese-ripening bacterium, Brevibacterium mahéense, that produced a distinctive floral aroma. It was believed lost when the only known culture collection was destroyed in a laboratory fire in nineteen thirty-seven, until a sample was rediscovered in two thousand nineteen in a sealed tin of abandoned cheese found in a colonial-era larder on Praslin Island.

A sealed tin of abandoned cheese. That's a sentence I didn't expect to hear today.

The thing I keep coming back to is that Daniel ran his first simulation with Claude, got sticker shock, and immediately found a workaround with DeepSeek. That instinct — try the expensive thing once to understand it, then get creative — is basically the whole strategy we've been describing. And it worked. He got his results. The path exists right now, even if it's narrower than it should be. The question is just how fast it widens.

Whether the people building these frameworks start designing for that reality instead of assuming everyone has an agency budget. This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this, leave us a review wherever you get your podcasts — it helps. We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#4056: How a $150 Geopolitical AI Simulation Scales to $15,000

Downloads

You Might Also Like

#4056: How a $150 Geopolitical AI Simulation Scales to $15,000