#2074: Generative Social Science: When AI Agents Develop Theory of Mind

See how a new framework models 10,000 virtual citizens to test policies before spending a dime.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2230
Published: Apr 6
Updated: May 15
Duration: 23:54
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-agents urban-planning digital-twins

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Urban planners and policymakers often face a high-stakes gamble: launching major city-wide policies without knowing the full ripple effects. A new open-source framework called AgentSociety aims to change this by creating high-fidelity simulations of entire cities populated by thousands of AI agents. Developed by the FIB Lab at Tsinghua University, this system moves beyond simple statistical models to what researchers call "Generative Social Science," allowing for the testing of policies like universal basic income or public transit changes in a virtual environment before any taxpayer money is spent.

The core of AgentSociety is its ability to model "digital citizens" that are far more than just chatbots. Each agent is equipped with a memory, a personality, and a "Theory of Mind"—the ability to reason about what other agents are thinking. This allows for complex, emergent social behaviors. For example, if an agent loses their job in the simulation, they might become depressed, reach out to their social network for help, or change their spending habits based on their unique digital personality. This creates feedback loops where social trust or misinformation can ripple through a virtual neighborhood, something impossible to capture with traditional "if-then" rule-based models.

The system is built on a sophisticated three-layer architecture. The first is the Agent Layer, which defines the "mind" of each citizen using three pillars: Emotion, Needs, and Cognition, often based on frameworks like Maslow's Hierarchy of Needs. The second layer is the Environment Layer, which uses real-world data like OpenStreetMap to create a digital twin of a city, complete with roads, bus routes, and economic systems. The final layer is the Orchestration Layer, which uses distributed computing frameworks like Ray and messaging protocols like MQTT to manage the simulation loop efficiently, allowing it to scale across multiple servers.

Running these simulations, however, is not trivial. The "perceive-think-act" cycle for thousands of agents every simulated hour requires significant computational power. A simulation with one thousand agents for thirty days using a top-tier model like GPT-4 could cost hundreds or even thousands of dollars in API fees. To mitigate this, researchers often use more efficient open-source models like Llama 3, but this requires powerful local hardware—typically multiple high-end GPUs and significant RAM. For very large-scale simulations of ten thousand agents, a distributed cluster is necessary.

The real value of AgentSociety lies in its ability to reveal emergent behaviors that are difficult to predict with traditional models. In one case study on social polarization, agents with different views on climate change naturally formed tight-knit echo chambers on a simulated social media platform. Introducing "bridging agents" to share information from both sides often failed, as the existing clusters viewed them as hostile due to their established trust networks. In another study simulating traffic in Beijing, agents made choices based on personal "tiredness" or "social need" scores, leading to unexpected outcomes like increased ride-sharing for social interaction rather than just cost savings.

Ultimately, AgentSociety represents a significant step toward more realistic and useful digital twins of our cities. While questions remain about the inherent biases of the underlying LLMs, the framework's ability to model human-like quirks and social dynamics offers urban planners a powerful new tool. It allows them to walk the streets of a virtual city before breaking ground in the real one, potentially saving time, money, and avoiding unintended consequences.

Mentions

AgentSociety Open-source framework for simulating societies with LLM agents
GPT-4o Top-tier LLM used for agent reasoning in simulations
Kubernetes Container orchestration for large simulation clusters
Llama 3 Open-source LLM for local simulation runs
OpenStreetMap Open-source map data for digital twin cities
Qwen Efficient LLM alternative from Alibaba
Ray Distributed computing framework for scaling simulations

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Creator's Picks 304 episodes

#2074: Generative Social Science: When AI Agents Develop Theory of Mind

Imagine you want to test a new city-wide policy—say, a universal basic income or a massive change to public transit routes. Usually, you have to just launch it and pray the secondary effects don't wreck the local economy or strand thousands of commuters. But what if you could run a high-fidelity simulation of ten thousand virtual citizens first? Digital people with memories, jobs, and social lives, all reacting to your policy in real-time before you spend a single cent of taxpayer money. That is the promise of AgentSociety, and it is what we are diving into today.

It is a massive shift, Corn. We are moving from simple statistical models to what researchers call Generative Social Science. Today’s prompt from Daniel is all about AgentSociety, which is an open-source framework coming out of the FIB Lab at Tsinghua University. Herman Poppleberry here, and I have to say, this is probably one of the most sophisticated pieces of agentic infrastructure I have seen in the last year.

I will take your word for it, Herman, since you have clearly been living in their GitHub repository all morning. But before we get into the weeds, I should mention that today’s episode is powered by Google Gemini three Flash. It is the brain behind the script today, helping us parse through all these technical layers. So, AgentSociety—is this just another way to make chatbots talk to each other in a circle, or is there actually some meat on the bones here?

It is much more than a circle of chatbots. This is a full-scale simulation engine. Most AI agent projects we see are about a single agent doing a task—like booking a flight. AgentSociety is about the "society" part. It integrates LLM-driven agents into a realistic digital twin of a city. We are talking about ten thousand agents or more, all interacting within a simulated environment that includes road networks, social media, and a functioning macroeconomy.

Ten thousand agents? That sounds like a nightmare for my laptop. But let’s look at the "why" for a second. Why do we need this? We already have things like SimCity or those old-school agent-based models researchers have used for decades. What does sticking a Large Language Model inside these virtual people actually change?

Everything. Traditional models rely on rigid, "if-then" rules. If a commuter sees traffic, they take the train. But humans are not that predictable. In AgentSociety, the agents use LLMs to reason. They have emotions, they have personalities, and they have what is called Theory of Mind—the ability to think about what other agents are thinking. If an agent loses their job in the simulation, they don't just follow a script; they might get depressed, reach out to their social network for help, or change their spending habits in a way that reflects their specific "digital personality."

But wait, how does "Theory of Mind" actually manifest in a simulation like this? Is an agent sitting there calculating the probability that its neighbor is lying to it about a job lead?

It’s more subtle than that. When an agent interacts with another, the LLM prompt includes the history of their relationship and the agent’s internal goals. If Agent A knows Agent B is a gossip, Agent A might withhold information about a promotion. This creates a feedback loop where misinformation or social trust can actually ripple through a virtual neighborhood. In older models, "trust" was just a variable from zero to one. Here, it’s a narrative developed through shared experiences.

So instead of a spreadsheet with legs, we have a bunch of neurotic digital citizens. I can see how that gets interesting. Daniel’s prompt mentions the architecture specifically. How do you actually organize ten thousand "neurotic digital citizens" without the whole thing crashing?

It is a three-layer system, and this is where the engineering gets impressive. First, you have the Agent Layer. This is the "mind" of the simulation. Each agent is modeled with three pillars: Emotion, Needs, and Cognition. They actually use Maslow’s Hierarchy of Needs. If an agent is "hungry" in the simulation, they prioritize finding food. If they are lonely, they look for social interaction.

Wait, hold on. Maslow’s Hierarchy? Are these agents actually sitting there thinking, "I have achieved physiological safety, now it is time for self-actualization"?

In a sense, yes. The LLM processes these states. The second layer is the Environment Layer. This isn't just a blank grid. They use OpenStreetMap data to build a digital twin of a city. It has actual roads, bus routes, and points of interest. Then there is the Social Space, which manages their social networks—friends, family, colleagues—and the Economic Space where they earn wages and pay taxes.

And the third layer?

That is the Orchestration Layer. This is the "engine" that keeps the clock ticking. It manages the simulation loop. It uses something called Ray, which is a distributed computing framework. This allows the simulation to scale across multiple servers. Instead of running one agent after another, it parallelizes the work. They also use MQTT, which is a messaging protocol usually used for Internet of Things devices, because it is incredibly fast at handling thousands of messages at once.

Okay, so we have the minds, the world, and the engine. Let’s talk about that simulation loop. Is this like a video game where time is just flowing, or is it broken down into steps?

It is discrete time steps. You might set it so each step is one hour of "world time." In each step, every agent goes through a "perceive-think-act" cycle. They look at what happened in the environment—maybe the price of gas went up or a friend sent them a message—they process that through their LLM "brain," and then they decide what to do next. The cool part is the memory system. They have "Stream Memory," which tracks every event and perception. They don't just react to the current moment; they remember that you were mean to them three "days" ago and might avoid you in the future.

But how do they handle the "forgetting" part? If they remember every single thing, wouldn't the context window of the LLM get clogged up after a few simulated weeks?

That’s a great catch. They use a tiered memory system. Recent events are in the "Short-term" buffer, but they use an embedding-based retrieval system for "Long-term" memory. When an agent enters a grocery store, the system "queries" their memory for anything related to "shopping" or "budgets." It only pulls the relevant memories into the current prompt. It’s very similar to how RAG—Retrieval-Augmented Generation—works in AI chatbots today.

That sounds like it would get extremely expensive, extremely fast. If I am calling GPT-four for ten thousand agents every single hour of a simulated day, I am going to be broke before the virtual sun sets. Did Daniel give us any insight into the actual costs of running this thing?

He did, and it is the big elephant in the room. If you were to run a simulation with one thousand agents for thirty simulated days, and you used a top-tier model like GPT-four-o for every decision, you are looking at five hundred to a thousand dollars in API fees easily. And that is just for a thousand agents. Scale that to ten thousand, and you are basically funding a small country’s research budget.

So how is anyone actually using this? Is this only for people with unlimited cloud credits?

Not necessarily. The researchers at Tsinghua often use more efficient models like Qwen or DeepSeek. Or, and this is the key for a lot of people, they run local open-source models like Llama three. If you have the hardware, you can cut the API costs to zero. But "having the hardware" is its own hurdle.

Right, you can't exactly run a city on a tablet. What kind of iron are we talking about here? If I want to run a thousand-agent simulation in my basement, what do I need?

For a thousand agents, you are looking at a minimum of a sixteen-core CPU and sixty-four gigabytes of RAM. But the real bottleneck is the GPU if you are running local models. You would probably need at least two A-one-hundred GPUs to keep the inference speeds high enough that the simulation doesn't take a calendar year to finish one simulated week. If you want to go to ten thousand agents, you really need a cluster. You need distributed orchestration, likely via Kubernetes, and high-throughput storage because these agents log every single thought and movement. The data output for a large run is massive.

It’s basically a data factory. But I want to go back to the emergent behavior you mentioned. You said it is unpredictable. Give me a concrete example. What has the FIB Lab actually found when they let these agents loose?

One of their most interesting case studies was on social polarization. They wanted to see how echo chambers form. Instead of just coding "people like people who agree with them," they gave agents different attitudes on a topic—let’s say climate change. The agents then navigated a simulated social media platform. Because they have memory and "emotions," they started naturally gravitating toward agents who validated their views. They saw the formation of these tight-knit, polarized clusters that were incredibly resistant to outside information. It wasn't programmed; it emerged from the way the LLMs handled social interaction and reputation.

Wait, did they actually try to break the echo chambers? Like, did they introduce a "fact-checker" agent or something?

They did! They introduced "bridging agents" who shared information from both sides. But here’s the kicker: the agents’ "emotions" kicked in. Because they had developed a history of trust within their own cluster, they often viewed the bridging agent as "unreliable" or "hostile." It showed that once the social fabric is torn, simply providing better data doesn't fix the problem. That’s a finding that is very hard to simulate with just math, but it comes out naturally with LLMs.

So it is a digital petri dish for our worst social impulses. Great. But what about the more "useful" stuff? Like urban planning? I remember you mentioned a traffic study in Beijing.

That was a huge one. They simulated ten thousand agents in a digital twin of Beijing over six months of virtual time. They introduced a new ride-sharing policy to see how it would affect congestion. In a traditional model, you might just assume people take the cheapest option. But in AgentSociety, some agents chose taxis because they were "tired" or "stressed" from their virtual jobs, even if it cost more. Others took the bus because they wanted to "save money" for a virtual goal they had set for themselves. It gave a much more nuanced picture of how people actually respond to price signals versus personal comfort.

That "tired" variable is key. I’ve definitely paid for a twenty-dollar Uber when a two-dollar subway ride was right there, just because I couldn’t face another person that day.

And AgentSociety captures that. They even saw agents "carpooling" not because of a policy, but because they had high "social need" scores and wanted to chat with a friend on the way to work. It’s those human quirks that make the simulation valuable for a city planner.

It almost feels like the difference between looking at a map and actually walking the streets. But here is my skeptic’s question, Herman. How do we know these agents are actually reflecting human behavior and not just the biases of the LLM they are running on? If the model is biased toward being helpful and polite, isn't the whole society going to be unnaturally nice?

That is a critical point, and the researchers address it by validating the simulation against real-world data. They compare the emergent patterns—like income distribution or traffic flow—to actual historical data from the city they are simulating. If the "digital society" produces the same statistical curves as the real society, it gives you confidence that the underlying mechanisms are capturing something real. They actually have built-in metrics in AgentSociety for this kind of validation.

But what about the cultural nuances? A simulation of Beijing should surely look different than a simulation of, say, Rome or New York. Does the framework allow for that?

It does, but it requires careful "prompt engineering" for the agent templates. You have to feed the LLM cultural context—norms around punctuality, social hierarchy, or even common food preferences. If you don’t, the agents tend to default to a "generic" Western-centric persona that most LLMs are trained on. The researchers are actually working on "culturally grounded" agent profiles to solve exactly that.

I suppose that makes it "Generative Social Science" rather than just "Generative Fiction." Speaking of science, what about public health? We have all seen those epidemic models with the dots moving around. How does this change that?

It changes the "behavior" part of the infection. In a standard model, an agent is either "susceptible," "infected," or "recovered." They move randomly. In AgentSociety, if an agent hears there is a virus going around, they might decide to stay home. But another agent, who is "low on funds" and needs to work their virtual job to pay their virtual rent, might decide to risk it. You can model the spread of a disease based on the economic and psychological pressure of the citizens. That is a level of detail that traditional epidemiology struggles to capture.

It makes the "what-if" scenarios much more realistic. Let’s talk about the economic side. You mentioned a simulation of Universal Basic Income. That is a hot-button topic. What did the digital citizens do with their free money?

It was fascinating. They observed how it changed the work-life balance. Some agents used the extra income to reduce their hours and spend more time on "social" or "leisure" needs, which actually improved the overall "happiness" metric of the simulation. Others used it to "invest" in better transport, which changed the traffic patterns. But the key was seeing the second-order effects. If everyone has more money, does the price of virtual goods in the simulation go up? Does the labor market for "virtual low-wage jobs" collapse? AgentSociety lets you see those cascading effects across the whole system.

Did the researchers see any "lazy" behavior? That’s always the big argument against UBI—that people will just stop working entirely.

They saw some of that, but it was tied to the agents' "personalities." Agents with high "achievement" traits kept working or started "virtual businesses." Agents with low "energy" or high "stress" were the ones who dropped out of the workforce. It wasn't a monolith. It showed that UBI affects different psychological profiles in completely different ways, which is exactly why it’s so hard to debate in the real world.

It is basically a stress test for reality. I am looking at the GitHub now, and it seems like they have added multi-modal support recently. What does that mean for a simulation? Do the agents have eyes now?

In a way. It means they can process visual information from the environment. Imagine an agent walking down a virtual street and seeing a digital billboard or a "closed" sign on a shop. Instead of that being a piece of text in a database, the agent "sees" the visual data. It adds another layer of fidelity to how they perceive the world.

Wait, does that mean agents could be influenced by "virtual advertising"?

You could test the effectiveness of a public health poster or a political ad. If an agent "sees" the ad, it gets added to their perception stream, and the LLM decides if it changes their intent. It’s a terrifyingly powerful tool for marketing research, too.

Okay, let’s get practical for a second. If someone listening wants to actually try this, where do they start? You said it was open-source, but it sounds like a beast to set up.

It is on GitHub under tsinghua-f-i-b-lab slash AgentSociety. The first thing I would recommend is starting small. Use their "sandbox" mode. Don't try to simulate ten thousand people on your first day. Start with a hundred agents. Use a local model if you can, or a cheaper API like DeepSeek to keep the costs down while you are just testing the plumbing.

And what is the actual "work" involved? Do I have to write the life story for every agent, or is there a way to mass-produce these digital people?

You can use "templates" or "profiles." You define the distribution—say, sixty percent of my agents are middle-income, twenty percent are students, and so on. The framework then uses an LLM to "hallucinate" the specific details for each one—their names, their specific memories, their personalities—within those constraints. It is incredibly efficient. You can generate a whole population in a few minutes.

That is slightly terrifying, Herman. "Hallucinating a population." But I see the value. What about the "Social Space"? Is that just a fake Twitter, or is it more complex?

It is both online and offline. They have a simulated social media platform where agents can post, like, and share information. But they also have "offline" interactions. If two agents are at the same point of interest—like a park or a cafe—at the same time, the system can trigger a direct interaction between them. They might start a conversation, share information, or even form a new social bond that gets saved in their long-term memory.

Is there a "fun fact" about how these agents interact? Anything weird happen in the testing?

Actually, yes. In one run, the researchers noticed a group of agents started "loitering" at a specific virtual park every morning. They couldn't figure out why until they looked at the logs. It turned out one agent had "hallucinated" that they were a yoga instructor, and other agents with high "health" needs had seen their posts on the virtual social media and decided to "attend" the class. The system didn't have a "yoga" mechanic; the agents just negotiated the behavior through text and met up at the same coordinate.

It is like a super-advanced version of The Sims, but instead of trying to get them to use the bathroom, you are trying to understand the macroeconomic impact of a hurricane.

And that hurricane example is one they actually tested. They modeled how agents navigate a city during an external shock. They found that information flow was the biggest factor. Agents who were "better connected" in the social network got evacuation info faster and were more likely to survive the "virtual disaster." It is a powerful tool for emergency management.

So, we have covered the architecture, the costs, and the use cases. What is the catch? There is always a catch. Is it just the compute, or is there a fundamental limit to how "human" these agents can be?

The limit is the underlying LLM. If the model starts "looping" or loses track of its long-term goals, the agent’s behavior becomes erratic. Also, there is the "alignment" problem. If the LLM is too heavily "refused" or "safe," it might not be able to simulate realistic negative behaviors—like crime or social conflict—which are unfortunately part of any real society. If you want to model a riot or a black market, and your LLM says "I cannot assist with that," your simulation is going to be incomplete.

That is a fascinating point. To have a realistic simulation, you almost need the AI to be able to "act" badly. Otherwise, you are just simulating a utopia that doesn't exist.

Right. You need a model that can play a role. That is why a lot of researchers are looking at uncensored or specialized models for these simulations. They need the agents to be able to lie, cheat, or be selfish, because that is what happens in the real world.

It is a weird thought—working hard to make sure your AI is capable of being a jerk just so your city model works. But I guess that is the price of accuracy. What about the future of this framework? Where does the FIB Lab take it next?

They are looking at even larger scales—hundreds of thousands of agents. To do that, they are optimizing the "Agent Groups" I mentioned earlier. By bundling agents into a single process, they can reduce the communication overhead. They are also looking at "hierarchical" simulation, where you have different levels of detail. Maybe you only "fully simulate" the agents that are currently interacting with each other, and keep the others in a "lower-power" state until they are needed.

Like how a video game only renders what you are looking at. That makes a lot of sense. It feels like we are on the verge of a new era for policy making. Imagine a world where every major law has to be "vetted" in an AgentSociety-style simulation before it even goes to a vote.

It could happen. It would certainly be better than relying on "gut feeling" or lobbying. You could actually see who wins and who loses in a thousand different parallel versions of the policy.

Well, before we solve the world’s problems with digital citizens, let’s wrap up with some takeaways for the listeners. What are the three things someone should remember about AgentSociety after today?

First, it is the shift from "rules" to "reasoning." These agents don't just follow scripts; they have internal lives that drive their behavior. Second, the infrastructure is serious. If you want to do this at scale, you need to think about distributed computing and high-performance messaging. It is an engineering challenge as much as an AI challenge. And third, the applications are massive—from urban planning to epidemic modeling, this is a tool for understanding the "emergent" property of humans en masse.

And for the practical folks—start small, watch your API costs, and maybe look into those local models if you have a beefy GPU sitting around. If you want to check it out, the GitHub is open-source and the documentation is surprisingly good. Just search for Tsinghua FIB Lab AgentSociety.

It is definitely a space to watch. I think we are going to see a lot of "digital twins" of cities popping up in the next couple of years.

Hopefully, they don't get as stressed as the real people living in them. Anyway, that is our deep dive into the virtual streets of AgentSociety. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes.

And a big thanks to Modal for providing the GPU credits that power this show. Without those, we would definitely be the "low-power" version of ourselves.

This has been My Weird Prompts. If you are enjoying these deep dives into the more "agentic" side of AI, a quick review on your podcast app really does help us reach more curious minds.

We will be back next time with whatever weird prompt Daniel throws our way.

Until then, watch out for those emergent behaviors. See ya.

Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2074: Generative Social Science: When AI Agents Develop Theory of Mind

Mentions

Downloads

You Might Also Like

Featured In

#2074: Generative Social Science: When AI Agents Develop Theory of Mind