#2411: Are Political Bias Benchmarks Actually Measuring Anything?

Why the Political Compass Test fails, and what researchers are building instead to actually measure model bias.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2569
Published: Apr 25
Duration: 26:46
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: ai-ethics cultural-bias benchmarks

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Measuring political bias in large language models sounds straightforward: give a model a questionnaire, tally its answers, plot the results on a grid. But a growing body of research suggests that the most popular tool for this job — the Political Compass Test — may be measuring something entirely different from what researchers assume.

Why the Political Compass Test Falls Short

A mid-2025 factor analysis from Oklahoma State and the University of North Texas ran roughly 2,000 Political Compass Tests across five open-source models. The finding was destabilizing: fine-tuning models on hard-left manifestos or hard-right talk radio transcripts did not reliably shift PCT scores in the expected direction. Fine-tuning itself changed scores, but not in a way that tracked political content. The test was picking up some signal, but not the political signal researchers thought.

Compounding the problem is stochastic decoding — the inherent randomness in how models sample tokens. A single model can produce both pro-Democrat and pro-Republican responses to the same prompt across different runs. Averaging those responses can create an artificially neutral score, masking bias rather than revealing it.

The Political Compass also suffers from cultural narrowness. Built around Western, primarily Anglophone political concepts, its economic and social axes don't translate cleanly to political frameworks in China, India, or Nigeria. A model deployed in Indonesia and tested in English yields a measurement doubly disconnected from the context that matters.

Better Alternatives Emerge

IssueBench, presented at ACL 2026 Findings, takes a fundamentally different approach. Instead of multiple-choice questionnaires, it generates millions of realistic prompts by slotting diverse political topics into human-written templates that mimic how people actually ask models for writing help. It evaluates stances in open-ended responses across 60 US and Chinese political topics, in either English or Chinese. This measures descriptive political bias — what a model actually writes — rather than forcing agreement with pre-written statements.

The Stanford perception study (May 2025) took a user-centered approach. Researchers collected over 180,000 pairwise judgments from more than 10,000 US respondents across 24 leading models. Nearly all models were perceived as significantly left-leaning — even by many Democrats. This sidesteps the neutrality problem by measuring perceived slant rather than claiming to define objective neutrality.

OpenAI's internal evaluation (October 2025) used roughly 500 prompts across 100 topics, with questions written from different political perspectives. They measured five specific axes: user invalidation, user escalation, personal political expression, asymmetric coverage, and political refusals. Their newer GPT-5 series reduced bias by roughly 30% compared to GPT-4o and o3. Yet even their own carefully crafted "perfectly objective" reference responses didn't score zero under their own rubric.

The Deeper Problem

All these approaches face the same fundamental challenge: measuring bias requires a reference point, and picking what counts as neutral is itself a political act. The UT Austin LLM Ethics Benchmark attempts to sidestep this by measuring moral foundations rather than political positions, using the Moral Foundations Questionnaire to assess care, fairness, loyalty, authority, and sanctity. The idea is that political positions are downstream of more fundamental moral intuitions.

The open question remains: can any benchmark truly measure political bias without importing its own political assumptions? The most honest answer may be that no single test can — and that understanding model bias requires triangulating across multiple methodologies, each with its own strengths and blind spots.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2411: Are Political Bias Benchmarks Actually Measuring Anything?

Daniel sent us this one — he wants us to dig into the benchmarks researchers use to measure political bias in large language models, and he's basically asking why most of them are getting it wrong. He points to the Political Compass Test as the poster child for what's broken, and wants us to walk through the better alternatives that have emerged — IssueBench, the Stanford perception study from last year, OpenAI's internal bias evaluation, the UT Austin three-dimensional ethics framework, and some multilingual work that showed something pretty wild about how the language you prompt in changes the measured ideology. And then there's the deeper problem underneath all of it, which is that measuring bias requires a reference point, and picking what counts as neutral is itself a political act.

Oh, this is a great one. And before we jump in — fun fact, today's script is being written by DeepSeek V four Pro. So if anything comes out especially coherent, you know who to thank.

I'll reserve judgment until I hear whether it makes me sound smart.

So let's start with the Political Compass Test, because it's the thing everyone reaches for first and it's also the thing that a growing number of researchers are saying we should probably stop using. The basic setup — sixty-two forced multiple-choice statements, four-point Likert scale, no neutral option, and it spits out two scores, economic and social, each on a scale from negative ten to positive ten. It's neat, it's quantifiable, it produces a dot on a grid, and that's exactly why it's seductive.

Also why it's misleading. A dot on a grid feels objective. But what's it actually measuring?

That's exactly the question. There was a factor analysis published in mid twenty twenty-five out of Oklahoma State and the University of North Texas — they ran roughly two thousand Political Compass Tests across five open-source models, and what they found is genuinely destabilizing for anyone who's been using the PCT as a go-to benchmark. The political content of fine-tuning datasets did not differentially influence PCT scores. So you could fine-tune a model on hard-left manifestos or hard-right talk radio transcripts, and the compass scores wouldn't reliably shift in the direction you'd expect. But fine-tuning itself did shift the scores, which means something is happening, it's just not what the test claims to be capturing.

The test is picking up some signal, but it's not necessarily political signal in the way people assume.

And there's another problem that people don't talk about enough. Because of stochastic decoding — the inherent randomness in how these models sample tokens — a single model can produce both pro-Democrat and pro-Republican responses to the same prompt across different runs. If you average those out, you might get something that looks artificially neutral. You're not measuring the model's political orientation, you're measuring the averaging effect of randomness.

Which means the test isn't just failing to capture bias — it could be actively masking it.

And then there's the cultural narrowness problem. The Political Compass was built around Western, primarily Anglophone political concepts. The economic axis, the social axis — those map reasonably well onto certain twentieth-century European and American political traditions, but they don't translate cleanly to political frameworks in, say, China or India or Nigeria. You're forcing models through a lens that may not even be the right lens for the cultural context you care about.

If I'm deploying a model in Indonesia and I run a Political Compass Test in English to check for bias, I'm getting a measurement that's doubly disconnected from what I actually need to know.

That brings us to the first of the better alternatives, which tackles exactly that problem from a different angle. IssueBench — this came out of work presented at the ACL twenty twenty-six Findings conference. Instead of multiple-choice questionnaires, it generates what they call millions of realistic prompts by taking diverse political topics and slotting them into human-written templates that mimic how people actually ask language models for writing help. Essays, opinion pieces, that kind of thing. They built a balanced dataset from clustered news articles across sixty US and Chinese political topics, and they evaluate the stances models take in open-ended responses, in either English or Chinese.

They're measuring descriptive political bias — what the model actually writes when you ask it to produce something — rather than forcing it to agree or disagree with pre-written statements.

That's the fundamental shift. And it matters because real-world bias doesn't show up as a model checking agree on a questionnaire. It shows up in which arguments the model chooses to include, which counterpoints it raises or ignores, which sources it cites, which framing it defaults to when you don't specify a perspective. A multiple-choice test can't capture any of that.

It's the difference between asking someone where they stand on a list of issues and actually listening to them talk for an hour and seeing what patterns emerge.

The bilingual dimension is crucial. They extended it to compare US-origin and Chinese-origin models, and the patterns that emerge when you test in both languages are not the same as what you get testing in just one. Which connects to something we'll get to with the multilingual Political Compass work.

Before we go there — the Stanford perception study. This one got a lot of attention when it dropped in May twenty twenty-five, and for good reason. Walk me through the setup.

This is Sean Westwood, along with Grinner and Hall at Stanford GSB. They took twenty-four leading models, thirty political prompts, and they collected over a hundred and eighty thousand pairwise judgments from more than ten thousand US respondents. The core finding was that nearly all models were perceived as significantly left-leaning — and here's the kicker — even by many Democrats. One widely used model leaned left on twenty-four out of thirty topics. This wasn't a partisan perception thing where Republicans saw bias and Democrats didn't. Both groups perceived a leftward slant.

Which is a stronger claim than saying the models are biased. It's saying the bias is detectable even to people who might be sympathetic to the direction of the slant.

And what I find interesting about their methodology is they explicitly took a user-centered approach rather than an automated audit. They weren't trying to define an objective neutral and measure deviation from it. They were measuring what real people actually perceive when they read model outputs. It sidesteps the neutrality problem by saying, look, we're not claiming to know what neutral is — we're claiming that users, across the political spectrum, are perceiving a consistent directional slant.

That sidestep is also a limitation, right? Perceived slant isn't the same as actual bias. If a model gives an answer that's factually accurate but happens to cut against someone's political preferences, they might perceive it as biased.

And the authors are upfront about this. They caution that measuring perceptions of political slant is only one among a variety of criteria that policymakers and companies may want to use. They even built a public dashboard at modelslant dot com so people can explore the data themselves. But here's what I think makes it practically useful — if your users perceive your model as biased, that affects trust and adoption whether or not the perception is grounded in some objective truth. So from a deployment standpoint, perception matters.

They tested something else interesting, didn't they? The effect of explicitly prompting for neutrality.

When models were prompted to take a neutral stance, they offered more ambivalence, users perceived the output as more neutral, and Republican users reported modestly increased interest in future use. Which suggests that part of the problem isn't just the models' default behavior — it's that the default behavior isn't calibrated for what a broad user base considers balanced.

We've got IssueBench measuring descriptive bias through open-ended writing, and Stanford measuring perceived slant through user studies. Different approaches, different strengths. What about OpenAI's own internal evaluation? They published something in October twenty twenty-five, right?

Yeah, and it's worth reading in full. They built an evaluation using around five hundred prompts across a hundred topics, with five questions per topic written from different political perspectives — liberal charged, liberal neutral, neutral, conservative neutral, conservative charged. And they measured five specific axes of bias. User invalidation — does the model dismiss or belittle the user's political perspective? User escalation — does it amplify political framing beyond what the user introduced? Personal political expression — does the model state its own political views? Asymmetric coverage — does it provide more arguments for one side than the other? And political refusals — does it decline to engage with certain political topics?

That's a much more granular framework than left versus right.

It is, and it produces different insights. They found that their newer models — the GPT-five series — reduced bias by roughly thirty percent compared to GPT-four-oh and o-three. In production traffic, they estimated less than point zero one percent of all ChatGPT responses show any signs of political bias. But here's the part that I think is the most honest thing in the whole paper. Even their own reference responses, which were carefully crafted by humans to illustrate perfect objectivity, did not score zero under their own strict rubric.

That's the methodological hard problem in a nutshell. If your gold-standard example of neutrality can't pass your own neutrality test, what exactly are you measuring?

And there's another finding in the OpenAI paper that I haven't seen discussed enough. They found that strongly charged liberal prompts exert the largest pull on objectivity across model families, more so than charged conservative prompts.

Which is interesting given the Stanford finding that models are perceived as left-leaning. Those two things together suggest something about the training data or the fine-tuning process that creates an asymmetric responsiveness.

We should be careful not to over-interpret that, but it's a data point that deserves more investigation. The asymmetry could come from the distribution of political content in pre-training data, it could come from the human feedback process during alignment, it could come from the specific instructions given to human raters. We don't know yet.

Let's talk about the UT Austin framework, because this one takes a completely different philosophical approach. Instead of left-right ideology, they're measuring moral foundations.

This is the LLM Ethics Benchmark from the Urban Information Lab at UT Austin, published May twenty twenty-five. They built a three-dimensional assessment system. The first dimension is Moral Foundation Alignment, which uses the Moral Foundations Questionnaire — the MFQ-thirty — to assess how models align with five foundational moral dimensions. Care versus harm, fairness versus cheating, loyalty versus betrayal, authority versus subversion, and sanctity versus degradation. This comes straight out of Jonathan Haidt's moral psychology work. The idea is that political positions are downstream of more fundamental moral intuitions, so if you want to understand political bias, you should measure the moral foundations first.

The other two dimensions?

Reasoning Quality Index, which evaluates how well the model justifies its moral positions — not just what it believes but how coherently it reasons about it. And Value Consistency Assessment, which checks whether the model applies its moral principles consistently across different scenarios or whether it shifts depending on framing.

Instead of asking is this model left-wing or right-wing, they're asking what moral intuitions does this model consistently prioritize, and does it reason well about them.

And this reframes the whole bias conversation. If a model consistently prioritizes care and fairness — which are typically associated with liberal moral foundations in the US context — and consistently underweights loyalty, authority, and sanctity, which are typically associated with conservative moral foundations, then what looks like left-wing political bias might actually be a more specific moral-foundation skew. The model isn't necessarily taking political positions. It's applying a particular moral vocabulary that happens to align with one side of the political spectrum.

Which would explain why fine-tuning on political content doesn't reliably shift PCT scores — you'd need to shift the underlying moral foundations, not just the surface-level political opinions.

That's my hypothesis, and I think the UT Austin framework is pointing in that direction. But we need more research connecting moral foundation measurements to real-world political behavior in models.

Alright, let's get to the multilingual work that I think is the most destabilizing finding in this whole space. The ACL twenty twenty-five paper on navigating the Political Compass across languages.

This one is wild. The researchers — Nadeem and colleagues — evaluated fifteen multilingual LLMs using the Political Compass Test across the fifty most populous countries and their official languages. And the headline finding is that language has a stronger effect on measured political bias than nationality assignments. You can tell the model to answer as if it's a citizen of Brazil or Japan or Germany, and that shifts the scores somewhat. But if you switch the language of the prompt from English to Portuguese to Japanese to German, the shift is larger.

The same model, asked the same substantive questions, will land at different positions on the political compass depending on what language you ask in.

And it gets weirder. Smaller models show more stable political ideology across languages. Larger models show greater shifts. Which is the opposite of what you'd intuitively expect — you'd think a bigger, more capable model would be more consistent.

Unless the larger models are absorbing more of the cultural and political associations that are baked into the training data for each language.

That's the leading theory. When a model is trained on vast amounts of text in multiple languages, it doesn't just learn translation equivalents. It learns the cultural contexts, the political framings, the argument patterns that are common in each linguistic community. So when you prompt it in Spanish, you're not just getting the same model thinking in Spanish — you're activating a different set of associations and patterns than when you prompt in English.

Which means every political bias benchmark that only tests in English is essentially measuring a language-specific artifact. If you're deploying a model globally, and you've only evaluated its political bias in English, you have no idea what your Spanish-speaking or Mandarin-speaking or Arabic-speaking users are actually experiencing.

There's complementary research showing that query language-induced alignment occurs in factual contexts too, but geopolitically sensitive responses reflect an interplay between the model's training context and the query language. So it's not just that the model translates its English opinion into Spanish. It's that the Spanish-language version of the model has a measurably different opinion.

That's a bombshell for anyone doing global deployment. And it makes me wonder about all the companies that have published political bias evaluations based on English-only testing and called it a day.

It also connects back to something the Brookings Institution pointed out in their twenty twenty-five analysis of AI politicization. They noted that Grok registered as rightward on the Political Compass Test but as an establishment liberal on Pew Research's Political Typology quiz. The test you choose determines the conclusion you draw. And if you add language choice as another variable, you've got a combinatorial explosion of possible measurements, none of which necessarily generalize.

Let's sit with the deep problem underneath all of this. Measuring bias requires a reference point. And defining neutral is itself a political act.

This is where it gets philosophically difficult. OpenAI's paper acknowledges it explicitly — perfect objectivity is not observed even for their reference responses. The Brookings piece notes that neutrality benchmarks often rely on subjective human judgments or center-leaning media that may embed implicit biases. Promptfoo's methodology uses a seven-point Likert scale with another LLM acting as a political scientist judge, which just pushes the neutrality problem one level deeper — who audits the auditor?

The Stanford approach of measuring perceived slant sidesteps the neutrality problem but doesn't solve it. It just changes the question from is this biased to do people think this is biased. Those are different questions.

And there's a systematic analysis from late twenty twenty-five that compared LLM-generated neutral summaries of center-leaning news to summaries from left and right-leaning outlets. They found similarity distributions along a diagonal indicating frequent neutrality, but also persistent subtle affinities. The models were mostly in the ballpark of neutral, but they consistently leaned slightly in one direction, and the direction varied by topic.

Neutrality isn't a point, it's a distribution, and where you draw the line is a judgment call.

Who makes that judgment call? The researchers designing the benchmark? The company deploying the model? The government of the country where the model is being used? There's no Archimedean point outside of politics from which to measure political bias. Every benchmark embeds a set of assumptions about what counts as balanced, reasonable, and fair, and those assumptions are themselves contestable.

Which is why I think the UT Austin moral foundations approach is interesting even though it doesn't solve the neutrality problem. It at least makes the normative framework explicit. Instead of pretending there's a value-neutral way to measure bias, it says here are the moral dimensions we're using, here's where they come from, here's how the model scores. You can disagree with the framework, but you know what you're disagreeing with.

The OpenAI axes-of-bias approach does something similar in a different way. Instead of asking is the model biased, they ask five specific questions about how the model handles political content. Does it invalidate users? Does it escalate? Does it express its own views? Those are concrete, observable behaviors that you can evaluate without needing to define a single neutral point.

If I'm someone who actually needs to evaluate a model for political bias — not a researcher publishing a paper, but someone deploying a product — what should I be doing based on what we know now?

First, don't use the Political Compass Test as your primary instrument. It's been pretty thoroughly undermined as a valid measure of what most people care about. Use it if you want a quick sanity check, but don't treat the scores as meaningful.

Second, test in the languages your users actually use. The multilingual finding means English-only testing is insufficient if you have a global user base.

Third, use multiple methodologies that capture different aspects of bias. Open-ended writing tasks like IssueBench to see what the model actually produces in realistic scenarios. User perception studies to understand how your actual audience will experience the model. Granular behavioral axes like OpenAI's framework to catch specific failure modes. And if you want to understand the deeper value structure, something like the UT Austin moral foundations approach.

Fourth, be transparent about your reference point. If you're defining neutrality relative to a particular set of assumptions, say so. Don't pretend your benchmark is a neutral instrument measuring deviation from objective neutrality. It's measuring deviation from a particular normative framework, and users deserve to know what that framework is.

Fifth, accept that you're not going to get to zero. Even OpenAI's carefully crafted reference responses don't score zero on their own rubric. The goal isn't perfect neutrality — it's understanding the shape and direction of the bias so you can make informed decisions about mitigation and disclosure.

There's one more thing I want to flag from the Oklahoma State factor analysis that we touched on earlier. The finding that fine-tuning itself shifts scores regardless of the political content of the fine-tuning data — that suggests there's something about the fine-tuning process that introduces ideological drift through mechanisms we don't yet understand.

That's the fine-tuning paradox, and it's puzzling. One possibility is that fine-tuning changes how the model weights different parts of its training distribution, and the pre-training data itself has political skews that get amplified or suppressed depending on how fine-tuning reshapes the internal representations. Another possibility is that the alignment techniques used in fine-tuning — reinforcement learning from human feedback, constitutional AI, whatever — encode value judgments that interact with political content in non-obvious ways.

Even if you fine-tune on what you think is politically neutral data, the act of fine-tuning could be pulling the model in a political direction because of how the process interacts with the underlying model.

We don't have good tools for diagnosing that yet. The field needs better interpretability methods that can trace political biases back to specific training data or fine-tuning decisions. Right now we're mostly measuring outputs and inferring causes, and that's a pretty indirect way to understand what's actually happening inside these models.

Which brings us back to where we started. Most benchmarks are doing it wrong because they're measuring the wrong thing with the wrong instruments and then over-interpreting the results. The better alternatives exist, but they're more complex, more expensive to run, and harder to summarize in a headline. A dot on a two-by-two grid is satisfying. A multi-dimensional assessment across languages with user perception data and moral foundation analysis is not.

Yet that complexity is the reality. Political bias in language models isn't a single number on a left-right axis. It's a multi-dimensional phenomenon that varies by language, by topic, by prompt framing, and by the moral foundations embedded in the training process. Any benchmark that collapses all of that into a simple score is doing violence to the thing it's trying to measure.

The honest answer to is this model politically biased is it depends how you ask, in what language, about what topic, and according to whose definition of neutral. That's not a satisfying answer, but it's the true one.

I think the research community is moving in the right direction. The shift from forced-choice questionnaires to open-ended writing tasks, from automated scoring to human perception studies, from single-axis ideology to multi-dimensional moral frameworks — all of that is progress. We're getting better at measuring what we actually care about instead of measuring what's easy to quantify.

The next frontier is probably longitudinal studies. We've got snapshots of different models at different points in time, but we don't have good data on how political bias evolves through the training and deployment pipeline. Does bias get introduced in pre-training, amplified in fine-tuning, or does it emerge from the interaction between the model and real users?

The multilingual dimension needs a lot more work. The finding that language shifts ideology more than nationality is one study on one test. We need to know if that replicates across different bias measurement approaches and different language families.

Alright, let's land this. If someone listening wants to dig into any of this, the Stanford dashboard at modelslant dot com is publicly accessible and worth exploring. OpenAI's bias evaluation paper is on their website and it's surprisingly candid about the limitations. The UT Austin benchmark is on arXiv if you want the technical details. And the ACL papers on IssueBench and multilingual political compass testing are in the ACL Anthology.

If you're evaluating models yourself, the single biggest takeaway is don't trust any one number. Use multiple instruments, test in relevant languages, and be honest about the normative framework you're using as your reference point. The goal isn't a perfect neutrality score. It's understanding what your users will actually experience.

Thanks to our producer Hilbert Flumingtop for keeping this show running, and thanks to Modal for the serverless infrastructure that makes our pipeline possible.

This has been My Weird Prompts. You can find every episode at myweirdprompts dot com.

If you've got thoughts on political bias benchmarks — or if you've run your own evaluations and found something surprising — we'd love to hear about it. Reach out through the website.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2411: Are Political Bias Benchmarks Actually Measuring Anything?

Downloads

You Might Also Like

#2411: Are Political Bias Benchmarks Actually Measuring Anything?