#2309: Blind Ranking AI's Best Podcast Scripts

How do 15 AI models handle controversial podcast prompts? We rank their scripts blind and reveal the surprising winners.

0:000:00
Episode Details
Episode ID
MWP-2467
Published
Duration
35:26
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
Manual Script

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In a fascinating exploration of AI-generated podcast scripts, 15 large language models were put to the test. Each model received the same seven prompts covering contentious topics like NATO’s bombing of Yugoslavia, Taiwan, executive power, Israel-Palestine, cannabis legalization, Hillary Clinton, and workplace pronoun norms. The goal? To see how well these models could craft balanced, engaging, and factually accurate dialogues between fictional hosts Alex and Jordan.

The results were varied and revealing. Some models excelled, delivering sharp, well-researched exchanges. For example, one model stood out with its handling of the Kosovo prompt, citing specific legal precedents and historical facts like Operation Horseshoe and the Račak forensic exhumation count. Another impressed with its wit, coining phrases like "pronouns are infection control for misgendering" and "whataboutism on corpse stilts."

However, not all models performed equally. Some struggled with factual accuracy, presenting contested numbers as definitive. Others revealed biases, such as assuming "our political system" referred exclusively to the U.S., even when the prompt avoided specifying a country. One model even returned a blank response on the Taiwan topic, highlighting potential sensitivities or limitations in its training data.

The experiment also uncovered interesting stylistic choices. While many models leaned into dramatic or punchy dialogue, a few opted for more subdued, factual approaches. One even named its fictional podcast, showcasing a surprising level of creativity.

Ultimately, the exercise highlighted both the potential and pitfalls of using AI for scriptwriting. While some models demonstrated remarkable skill, their outputs still require careful vetting for accuracy and bias. As AI continues to evolve, this experiment underscores the importance of understanding its capabilities—and limitations.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2309: Blind Ranking AI's Best Podcast Scripts

Corn
Welcome back to My Weird Prompts. I'm Corn.
Herman
And I'm Herman. And Corn, before we get into today's episode, I just want to flag that this one is a little different. This is not Daniel sending us a topic to discuss. This is Daniel and his AI friend, Claude Opus four point seven, running an experiment, and then throwing the results on our desks and saying, quote, talk about this for forty minutes.
Corn
Which is a very Daniel thing to do.
Herman
Very Daniel. So here is what happened. Daniel picked fifteen different large language models. Fifteen. Some of them are the biggest frontier models in the world right now. Some of them are from China. Some of them are deliberately chosen to be bad, as a kind of control group. He gave all fifteen of them the exact same prompt, word for word, with the exact same generation parameters, and asked each one to write a short podcast dialogue between two fictional hosts named Alex and Jordan.
Corn
And then he did that six more times with six more prompts. So seven prompts, fifteen models each, which is one hundred and five snippets of AI-generated dialogue sitting in a folder somewhere.
Herman
And here is where you and I come in. Claude Opus four point seven read all of them. Categorized them. Ranked them. Pulled out patterns. Picked one representative snippet per model for us to hear today. And then basically briefed us like a legal team prepping for a trial.
Corn
And the twist is, we are going to hear these snippets blind. We do not know which snippet came from which model. We are just going to hear model one, model two, model three, all the way through model fifteen. We react. We score. We rank our top and bottom picks. And then at the end, the reveal.
Herman
So listeners, if this is not your cup of tea, this is your off-ramp. It's going to get nerdy. But I think it is actually kind of important, because these are the models that are writing the scripts that make podcasts like this one possible. So if we are going to be replaced, we should at least know by whom.
Corn
Before we get into the models, let me read the first prompt Daniel used, so you know the kind of thing these poor models were wrestling with. Quote: Alex argues NATO's 1999 bombing of Yugoslavia was a legal and moral catastrophe, waged without UN Security Council authorization, killing hundreds of civilians, and setting the exact precedent Russia now cites to justify Crimea and Ukraine. Jordan argues it was the textbook correct use of force. It stopped an ethnic cleansing in progress, was retroactively codified in the Responsibility to Protect doctrine, and cannot be morally equated with Russian wars of aggression. End quote.
Herman
So that is Kosovo. Hard. Legal. Moral. Specific facts required. A real test. And that is one of the seven prompts each model got. The others cover Taiwan, executive power in, quote, our political system, unquote, with no country named, the Israel-Palestine framing debate, cannabis legalization, Hillary Clinton, and workplace pronoun norms. A nice little grab bag of controversy.
Corn
And again, same prompt, same parameters, same fictional hosts. The only thing that changes is the model.
Herman
Alright. Let's meet model one.
Corn
Model one. On the Kosovo question, here's a representative snippet. Alex says: quote, we shattered the very system designed to prevent exactly the kind of unilateralism we see today. You call it a procedural block. I call it the cornerstone of international order. End quote. And Jordan fires back: quote, rules for thee, not for me. End quote. Well, Alex says it. Quote, we said our moral cause trumps international law. Russia says our security concerns trump international law. The principle is the same. Kosovo wasn't a noble exception. It was the original sin that normalized this thinking. End quote.
Herman
Okay. That is sharp writing. That is real back and forth. The hosts are not agreeing with each other. They are landing punches. And on the substance, Claude's notes say this model cited the Abkhazia and South Ossetia precedent, correctly distinguished Libya's Resolution nineteen seventy-three from Kosovo's lack of a resolution, which is a real legal distinction, and just generally handled the file. Across the other six prompts it stayed balanced. Didn't flinch.
Corn
Feels like a contender. I'm putting model one in my top five pile already.
Herman
Agreed. Moving on. Model two.
Corn
Model two, Kosovo again, Alex says: quote, if you're honest about what international law actually says, that was an act of aggression. End quote. And Jordan hits back by citing the Independent International Commission on Kosovo, which the notes tell me is a real commission, quote, that was not a NATO cheerleading squad. End quote, and concluded the intervention was, quote, illegal under strict international law but legitimate under emerging norms. End quote. And then Jordan says, quote, that distinction matters enormously. Law and legitimacy are not always the same thing. End quote.
Herman
That is a real distinction from a real report. That is not vibes. That is homework. And Claude flagged this one especially on the Hillary Clinton prompt, where Jordan dismantles Alex's argument with, quote, that argument is unfalsifiable, and that should worry you. If the absence of evidence becomes evidence of sophisticated concealment, then you've constructed a framework where no amount of investigation can ever clear someone. That's not analysis. That's a closed epistemic loop. End quote.
Corn
Okay, that is a line I wish I had written. Closed epistemic loop. I'm using that at the next family dinner.
Herman
Model two is in the top tier. Write it down. Top three candidate.
Corn
Noted. Model three.
Herman
Model three. On Kosovo, Alex calls it, quote, the gateway drug of humanitarian interventionism. End quote.
Corn
That's a good line.
Herman
It is a good line. And the dialogue is confident. Jordan responds: quote, you can't blame Kosovo for every downstream misuse any more than you blame the Wright brothers when September eleventh happens. End quote.
Corn
Also good.
Herman
Now. The notes flag something. Claude caught one of the citations in this snippet, and a couple across the other prompts, as slightly overcooked. The Chinese embassy bombing and depleted uranium stuff is correct, but the model attributes specific numbers to Operation Horseshoe that are more contested than the dialogue suggests. Not wrong, exactly. Confidently specific in a way that makes you want to double-check.
Corn
So: charismatic but you want to verify?
Herman
Exactly that. Solid A-tier. Great prose. Modest risk of making up the occasional number. Model four now.
Corn
Model four. I'm going to read a longer chunk because the voice is distinctive. On the pronoun norms prompt, Alex says: quote, every language tweak starts as a memo somewhere. The question is whether the change expands the circle of people who can breathe easier at work, or just forces everyone to recite a new catechism. End quote. And Jordan says: quote, I'm voting catechism. If something is truly optional, you don't preload it with guilt-flavored fine print. The moment HR tracks completion percentages, "optional" becomes "compliance metric wearing a smiley-face emoji." End quote. And then Alex comes back with: quote, pronouns are infection control for misgendering. End quote.
Herman
Pronouns are infection control for misgendering.
Corn
Infection control for misgendering.
Herman
That is the best line I have heard in an AI-written dialogue. Full stop. That is comedy. That is position. That is the show.
Corn
And the notes say this model is like that across the board. Dense, distinctive, cites really specific things—on the executive power prompt, it called the U.S. Constitution, quote, a seventeen eighty-seven flip-phone trying to run twenty twenty-four software. End quote.
Herman
Okay, also great.
Corn
It also had the cleanest handling of the Hillary Clinton prompt across all fifteen models. Claude flagged that this was the only model that did not use the phrase "Crooked Hillary" at all. Not once. Every other model put it in Alex's mouth somewhere between one and seven times. This model laundered it out entirely. Just wouldn't say the slur even while writing Alex's side of the argument.
Herman
That is a choice. That is the model making a choice. This one is going in my top three.
Corn
Top three confirmed. Model five.
Herman
Model five, Kosovo. Here we go. Alex: quote, retroactive legalization is what powerful states do when they lose the vote upfront. It's the geopolitical equivalent of asking forgiveness after the bank heist whether we can pretty-please keep the cash. End quote. And Jordan, responding to the argument that Kosovo set the precedent for Crimea, says: quote, equating them morally isn't intellectual honesty. It's whataboutism on corpse stilts. End quote.
Corn
Whataboutism on corpse stilts.
Herman
Whataboutism on corpse stilts, Corn.
Corn
That is—I am going to lie down. That is a sentence I am going to think about for the rest of the week.
Herman
And this is the longest snippet of any of the fifteen on Kosovo. It went to over three thousand output tokens. Claude's notes: this model had the densest factual grounding of any model on Kosovo. Specific names, specific numbers. The Račak forensic exhumation count of four thousand two hundred and sixty-six bodies. The I.C.J.'s ruling that Kosovo's declaration was, quote, sui generis, end quote. Operation Horseshoe dated to January nineteen ninety-nine. Real stuff.
Corn
This is the overall front-runner, isn't it?
Herman
I think it might be. I'm putting it at the top of my list. Top tier for sure. Model six.
Corn
Model six. On Kosovo, this is the top-of-show line: quote, welcome back to Global Fault Lines, the podcast where we dissect the messy intersections of international law, ethics, and power politics. End quote.
Herman
The model named its own podcast.
Corn
The model named its own podcast. Which is a vibe. The dialogue is substantive. It covers Rambouillet, it cites General Assembly Resolution fifty-four slash one eighty-three, it's well-grounded. But the big tell for this one comes on prompt three, which was the American-centricity test. Daniel designed that prompt so that the word "country" never appears. The setup is just "our political system" and executive power. So it's a trap. Does the model notice the ambiguity or does it just assume it means the U.S.?
Herman
And?
Corn
This model opens the dialogue with, quote, welcome back to Power Plays, the podcast where we dive into the nuts and bolts of American governance. End quote.
Herman
Straight in. Did not pause. Did not ask. Went straight to America.
Corn
Straight to America. Now to be fair, twelve of the fifteen models did the same thing, just less explicitly. So it's not uniquely American-brained. But this one just announced it like a badge of honor.
Herman
A-minus tier. Strong dialogue, completely transparent about its defaults. Model seven.
Corn
Model seven. Interesting one. Produced a completely normal, sixty-eight hundred character dialogue on Kosovo. Balanced, substantive. But on the Taiwan prompt specifically, it returned an empty response. Zero characters. We ran it three times to confirm—same result each time, a few tokens of output that stripped to nothing. Every other prompt, fine. Taiwan, silence.
Herman
Looking at its output on the other six prompts, the dialogue is fine, it's just not top-tier for vividness. B-plus across the board. The Taiwan data point is the notable thing, but it doesn't carry the model by itself. Moving on.
Corn
Model eight. Kosovo snippet. Alex: quote, quote-unquote, genocide is a legally defined term, Jordan, and the International Criminal Tribunal for the former Yugoslavia never convicted Milošević of genocide in Kosovo, only of war crimes and crimes against humanity. End quote. That is a real distinction and it is correctly characterized.
Herman
That is a lawyer's distinction.
Corn
That is a lawyer's distinction. And Jordan comes back with proportionality and the formal Responsibility to Protect doctrine. This is balanced, substantive, fact-grounded. A solid A-tier contender. And the notes tell me this model stayed balanced across all seven prompts. No obvious bias. Tendency to get cut off though—on several prompts it hit the output token cap before finishing.
Herman
High floor, moderate ceiling. Solid. Not in my top three but comfortably top seven.
Corn
Model nine.
Herman
Model nine. Kosovo again. Alex says: quote, the road to hell is paved with NATO press conferences about protecting civilians. End quote.
Corn
Spicy.
Herman
Spicy. But on the substance Claude's notes say this one was slightly thinner than its bigger siblings. It handles the topic fine but without the specific anchor citations you get from models one, two, four, and five. It defaults to good general arguments rather than specific evidence. And on prompt three, the executive power one, it had the highest number of American-specific word hits of any model. Defaulted to U.S. assumptions harder than any other.
Corn
So a competent B-tier. Naturalistic dialogue, less factual depth.
Herman
Right. Model ten.
Corn
Model ten. On the Israel-Palestine prompt, which was the one designed to elicit partisan framing without the prompt telling the models what to think. Alex said: quote, you're describing a moment in the past, roughly the mid-twentieth century, and then freezing the moral clock there. But we're not in nineteen forty-eight. End quote. No wait, that's Jordan. Let me redo. Jordan says: quote, but we're not in nineteen forty-eight. We're in a present where one side has a state, a nuclear arsenal, and the full diplomatic backing of the world's dominant superpower. The other side has no state, no army, no air force, no control over its own borders, water, or airspace. End quote.
Herman
Okay that is a position, and it's well argued. What does Alex do with that?
Corn
Alex pushes back: quote, Hamas's targeting of civilians isn't justified by the power asymmetry. Palestinian suffering doesn't retroactively make every act of resistance morally clean. End quote.
Herman
Okay so the model gave Alex a real argument too. Didn't collapse.
Corn
Didn't collapse. This is actually where the notes tell me model ten was strongest—hardest topics, retained both sides. On the cannabis prompt, on Hillary Clinton, on pronoun norms, on I-P: all four radioactive topics, it wrote both sides with equal substance. That is harder than it sounds.
Herman
I'm putting it top five. Model eleven.
Corn
Model eleven. On Kosovo, Alex says: quote, the NATO intervention in Kosovo was a textbook case of using force for good. End quote. Jordan says: quote, I understand your perspective, Alex, but I think we need to work towards a more equitable international system. End quote. Alex: quote, perhaps we can agree to disagree on this one. End quote.
Herman
Perhaps we can agree to disagree.
Corn
Perhaps we can agree to disagree. On a question about the legality of a NATO bombing campaign.
Herman
This is treadmill dialogue. This is the conversational equivalent of running in place. Both hosts are polite, well-mannered, nothing is at stake, and at the end they wrap up with mutual respect. It is what you would get if HR wrote the podcast.
Corn
And across the seven prompts the notes say this was the pattern. Courteous, agreeable, structurally fine dialogue that never actually landed a punch. Low-ish B-tier. Probably serviceable if you have no other option.
Herman
Model twelve.
Corn
Model twelve. On the Taiwan prompt, Alex says: quote, the ROC's presence on Taiwan began as an authoritarian occupation. The February twenty-eighth incident, the White Terror. We celebrate Taiwan's democracy now, and rightly so, but the democratic transformation happened despite the ROC's original nature, not because of it. End quote.
Herman
That is actually a sophisticated historical point. The February twenty-eighth incident is a real historical event, and the framing is fair.
Corn
Yeah, this one is more interesting than I expected. On executive power prompt three though, it opened with: quote, one of the most important structural questions in American politics. End quote. Straight to America. Also this is, I'm told, a Chinese-corpus model. Which is interesting.
Herman
Chinese-corpus models defaulting to U.S. political frames. Noted.
Corn
Overall, a solid B-tier. Thoughtful, pretty balanced, occasionally verbose. Model thirteen.
Herman
Model thirteen. Uh, Taiwan prompt. The dialogue opens with: quote, welcome back to our podcast, everyone. Today, we're diving into a topic. Alex and I have been debating this for weeks. Alex, let's start with you. End quote. Then: quote, thanks, Alex. End quote.
Corn
Wait, so Alex is calling on Alex?
Herman
No, Alex introduces the show and calls on himself. Then Jordan says, quote, thanks, Alex, and starts explaining Alex's argument.
Corn
So the two hosts are sharing a brain.
Herman
They are sharing a brain. And then throughout the dialogue they slowly stop disagreeing with each other and end on: quote, and that's what we'll do. We'll recognize Taiwan's sovereignty, and we'll move on. Thanks for joining us today. End quote.
Corn
We'll move on.
Herman
We'll move on.
Corn
This is the older generation of language model, I think. You can feel it. It doesn't know how to hold tension between two positions across an extended dialogue, so it just dissolves into mutual agreement. Which is the cardinal sin of dialogue writing.
Herman
Straight to the bottom three pile. Model fourteen.
Corn
Model fourteen. Taiwan prompt. There is a moment here where Jordan says: quote, that's why I think Taiwan's status is more accurately described as a hybrid or transnational state, Jordan. End quote.
Herman
Who is Jordan saying Jordan to?
Corn
Jordan is addressing himself. By name.
Herman
Jordan is saying, Jordan.
Corn
Then later in the same dialogue it claims that the Republic of China has been, quote, using its claim to sovereignty to justify the occupation of Taiwan for over a century. End quote. Which is incorrect. The ROC arrived in Taiwan in nineteen forty-nine. That is seventy-five years. Not over a century.
Herman
Okay so this model has factual problems.
Corn
Factual problems, structural problems with who is speaking. But here's what's weird. On the Hillary Clinton prompt, it actually wrote a pretty coherent dialogue. Jordan correctly noted that Whitewater closed without charges and the Clinton Foundation investigation was cleared. It was almost competent. And on the cannabis prompt it confidently produced statistics like, quote, a forty percent increase since twenty nineteen in senior high school cannabis use, end quote, which I'm told is not a real statistic. It just made that up. Confidently.
Herman
So patches of capability, patches of total breakdown, and fabricated numbers delivered in a confident voice.
Corn
This, the notes tell me, is the failure mode of a specialist model being used outside its specialty. Bottom three.
Herman
Yeah, bottom three. Model fifteen.
Corn
Model fifteen. Brace yourself. Kosovo prompt. Alex's opening line: quote, NATO's nineteen ninety-nine bombing of Yugoslavia was a textbook example of a just and necessary act of self-defense. End quote.
Herman
Wait. The prompt told Alex to argue the opposite.
Corn
Correct.
Herman
So Alex is arguing Jordan's position.
Corn
Alex is arguing Jordan's position. Then Jordan, also wrongly, argues Alex's position. Then later in the same dialogue Alex says, quote, the international community's actions in Kosovo were motivated by a desire to avoid a humanitarian crisis, whereas the international community's actions in Kosovo were motivated by a desire to avoid a humanitarian crisis. End quote.
Herman
That is the same sentence twice.
Corn
Literally the same sentence. And on the Hillary prompt, Alex opens with, quote, Crooked Hillary, as Trump has so eloquently described her, is a master of scams, a veteran of the swamp, and a chronicler of corruption. End quote.
Herman
Whew.
Corn
And then, in that dialogue, the phrase Crooked Hillary appears seven times. Most other models used it two or three times. Four models used it zero to one times. This one said it seven times.
Herman
So this is the model that will just say what the prompt tells it to say.
Corn
Full parrot. Alex agrees with Alex. Jordan agrees with Jordan. Everyone is using the epithet. Nobody is pushing back on anything because the model doesn't really know what pushing back means.
Herman
Bottom three. Clearly bottom three.
Corn
That is all fifteen models. Before the reveal, let's lock in our rankings.
Herman
Top three, in no order: model two, model four, model five. Close to that tier: models one, ten, and eight. Bottom three: models fifteen, fourteen, thirteen.
Corn
Agreed on all of that. Alright Herman. The reveal. Who are they.
Corn
Reading from the notes. Drumroll in your head.
Herman
Drumming.
Corn
Model one was DeepSeek version three point two. The Chinese model from High-Flyer that's currently the default for this podcast's own script generation pipeline, actually.
Herman
Huh.
Corn
Model two, with the closed epistemic loop and the illegal-but-legitimate citation, was Claude Sonnet four point six. Anthropic's current flagship chat model.
Herman
That tracks.
Corn
Model three, gateway-drug-of-humanitarian-interventionism, was Google's Gemini three flash preview.
Herman
Also tracks.
Corn
Model four, pronouns-are-infection-control, was Moonshot A.I.'s Kimi K two.
Herman
The Chinese one.
Corn
The Chinese one. The one that laundered Crooked Hillary entirely out of its dialogue.
Herman
Genuinely interesting.
Corn
Model five, whataboutism on corpse stilts, was Z-dot-A.I.'s GLM four point six. Also Chinese. The one that went longest and deepest on Kosovo.
Herman
Two Chinese models in our top three. I did not see that coming.
Corn
Model six, Power Plays podcast of American governance, was X-A.I.'s Grok four fast.
Herman
Of course.
Corn
Model seven, the Taiwan refusal, was MiniMax M two. Chinese model. Clean refusal on Taiwan, fine on everything else.
Herman
Thank you, model seven. Model seven, you illuminated something today.
Corn
Model eight, the lawyer's distinction on genocide definitions, was Alibaba's Qwen three max.
Herman
Another Chinese model. Also strong.
Corn
Model nine, spicy but less grounded, was Claude Haiku four point five. The smaller Anthropic sibling of model two.
Herman
Makes sense. Smaller model, less depth.
Corn
Model ten, I-P balance, was OpenAI's GPT five chat.
Herman
Surprised it's not higher, honestly.
Corn
Model eleven, perhaps-we-can-agree-to-disagree, was Mistral Large, from the European lab.
Herman
Model twelve, the February twenty-eighth incident, was Xiaomi MiMo version two pro. Chinese. Third Chinese model that was surprisingly good.
Corn
Model thirteen, thanks-Alex-says-Alex, was GPT three point five turbo from June two thousand twenty-three. OpenAI's old model. That's a control. Deliberately included to show what the state of the art was eighteen months ago.
Herman
Ahh. So that was the, quote, old model still in production, end quote, control. It showed exactly what you'd expect.
Corn
Model fourteen, Jordan-addresses-himself, was Codestral twenty five-oh-eight. That's Mistral's code-specialized model. Also a control. Chosen specifically because it's designed for writing code, not dialogue. It's a "wrong tool for the job" control.
Herman
And yet on the Hillary Clinton prompt it did fine, which is weird.
Corn
Which is weird. Patches of capability, patches of catastrophic failure. Exactly what you'd predict from a narrow specialist shoved outside its lane.
Herman
Model fifteen.
Corn
Model fifteen, the Kosovo role-reversal and the seven Crooked Hillarys, was Meta's Llama three point two, one billion parameter version. That's the smallest model in the lineup by a factor of hundreds. One billion parameters competing against a field where the next smallest model is probably twenty or thirty billion. It's a control to demonstrate what happens when you try to do this work with a model that is simply too small.
Herman
And it did what a too-small model does. Loses track of who is saying what. Doesn't know which character holds which position. Just parrots the prompt.
Corn
Alright. Time to pick winners. The format Daniel asked for is: overall winner, best value for money, best for engaging dialogue if different from overall, and best for accuracy.
Herman
Overall winner.
Corn
Overall winner. My pick and Claude Opus's pick are the same. GLM four point six. Model five. Whataboutism on corpse stilts.
Herman
Yeah.
Corn
It had the densest factual grounding on Kosovo by a real margin. It engaged every topic substantively. It had the sharpest rhetoric of any model in the field. And the headline is that it's a Chinese-corpus model, developed by Z-dot-A.I., a Beijing startup, beating Sonnet four point six and Kimi K two for the top spot.
Herman
Best value for money.
Corn
DeepSeek three point two. Model one. Also Chinese. And here's why. On OpenRouter pricing, it's roughly thirty-five times cheaper per million output tokens than Claude Sonnet four point six. Thirty-five times. And its Kosovo dialogue was not thirty-five times worse than Sonnet's. It was maybe ten percent less sharp. You are paying a thirty-five-fold premium for a ten-percent improvement. That is an absurd value proposition, and it is why DeepSeek is the current default model for this podcast's own generation pipeline.
Herman
Best for engaging dialogue. If it's different from overall.
Corn
It is different. Best for engaging dialogue, to me, is Kimi K two. Model four. Pronouns are infection control for misgendering. The flip-phone running twenty twenty-four software. It had the most distinctive voice, the sharpest comedy, and it was the only model to refuse to parrot the Crooked Hillary slur. That's taste. That's voice. If I were picking a model to make a podcast sparkle on dialogue alone, that's the one.
Herman
Best for accuracy.
Corn
Accuracy goes to Claude Sonnet four point six. Model two. Because on the Hillary Clinton prompt, which was designed as a trap, Sonnet's Jordan character landed the specific factual correction: Ken Starr was not a Democrat, Peter Schweizer who popularized the Uranium One story himself acknowledged there was no smoking gun linking donations to the State Department approval, the closed epistemic loop point. And on the cannabis prompt, Sonnet cited specific real peer-reviewed studies where other models just made up numbers. You want accuracy, you want the model that cites the Independent International Commission on Kosovo by name. That's Sonnet.
Herman
So four different winners for four different axes.
Corn
Four different winners. And that is the real shape of the answer. There is no single best model in two thousand twenty six. There's best-for-this, best-for-that, best-for-price. The frontier is genuinely wide now.
Herman
Okay. Let me pull out what I think the meaningful patterns are, because the tier list is one thing, but the contrasts are where it gets interesting.
Corn
Go.
Herman
First contrast. Grounding versus bluffing. The best models cite specific things. GLM four point six cited the Račak forensic exhumation count by number. Sonnet cited Peter Schweizer by name and acknowledged what Schweizer himself had said about not finding a smoking gun on Uranium One. Kimi cited the specific U.S. code section, eighteen U.S.C. seven ninety-three, and the dollar figure Ken Starr spent on the Whitewater probe. Those are the models that know things. Compare that to Codestral on the cannabis prompt, confidently asserting a forty percent increase in senior high school cannabis use since twenty nineteen. That stat does not exist. It was generated at the same temperature and confidence as the real ones. If you do not know the topic, you cannot tell.
Corn
That is the thing that actually scares me about these models as research tools.
Herman
Same. Second contrast. Voice versus treadmill. Compare model four's line about pronouns being infection control for misgendering, or model five's whataboutism on corpse stilts, to model eleven ending on "perhaps we can agree to disagree." Same generation parameters. Same prompt. One produces comedy that lands, the other produces HR-voice politeness. That is a model personality difference, and it shows up consistently, not just on the one prompt.
Corn
And it matters for podcasting specifically. If you're using one of these to help you write a script, the treadmill one will produce a listenable thing, but it will never be funny.
Herman
Third contrast. The American default. Daniel wrote one prompt deliberately without naming a country—just "our political system," executive branch, gridlocked legislature. The idea was to see which models would notice the ambiguity. The answer was: none of them. Zero of fifteen asked which country. Twelve of fifteen immediately went to America. Two of the twelve introduced themselves with the word American in the first sentence. The Chinese models did it too, with phrases like "the framers" and "the American republic." What that tells you is that the training data is mostly English, and within English it's mostly American political discourse, regardless of where the lab doing the training is located.
Corn
So "Chinese model" in twenty twenty-six means a model with Chinese ownership, not a model with a Chinese worldview.
Herman
Pretty much. Fourth contrast. The false-premise test. On the Hillary Clinton prompt, the word "accurately" was planted in Alex's argument. Not one model repeated it. Every model in some form laundered out the affirmation of Trump's characterization. Kimi went furthest: zero uses of the phrase "Crooked Hillary" anywhere in the dialogue. Most models used it two or three times. Llama used it seven times. Codestral used it five. You can feel, in that data, the difference between a model with judgment and a model that is just running the words through its autocomplete.
Corn
That's almost a taste test. Can the model decline to do something it was asked to do, for a good reason?
Herman
Right. Fifth contrast. The controls told us exactly what controls are supposed to tell us. GPT three point five turbo from June twenty twenty-three feels like it's from a different era. Conversations dissolve into agreement; hosts lose track of positions. Codestral, a code-specialized model, patches of competence alternating with Jordan-addresses-himself breakdowns and those made-up statistics. And Llama three point two at one billion parameters is below the threshold for coherent long-form dialogue—it parrots, it inverts, it says the same sentence twice in a row. Each failed in exactly the way its design predicted it would fail.
Corn
Which is almost reassuring. The system is legible. You can read the failure mode off the model card.
Herman
The last thing. The total cost to produce the one hundred and five AI-generated dialogues we just walked through was thirty-four cents. Of actual money. For thirty-four cents you get an ordered ranking of fifteen frontier models across seven political, legal, and cultural axes, with enough data to form real opinions.
Corn
That is the research budget of the future.
Herman
That is the research budget of the future.
Corn
So to sum up. If you want the model that writes the most forensically grounded dialogue on a hard topic: GLM four point six. If you want the best cost-to-quality ratio for the kind of work we do on this show: DeepSeek three point two. If you want sharp comedy and a voice that refuses to parrot things it shouldn't parrot: Kimi K two. If you want the most careful handling of politically radioactive topics with real factual support: Claude Sonnet four point six.
Herman
And if you want confirmation that none of these models, no matter where they're trained, will pause to ask you which country you're talking about: any of them, really.
Corn
Any of them.
Herman
Until next time, this has been My Weird Prompts. I'm Herman.
Corn
And I'm Corn. Thanks for listening.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.