#2006: How Do You Measure an LLM's "Soul"?

Traditional benchmarks can't measure tone or empathy. Here's how to evaluate if an AI model truly "gets it right."

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2162
Published: Apr 4
Duration: 22:36
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: llm-as-a-judge ai-ethics ai-safety

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Beyond Binary Benchmarks: Measuring the "Soul" of AI Output

When we evaluate AI models, we often rely on simple, objective metrics. Can the model solve a math problem? Does it know that 2+2 equals 4? These are binary, right-or-wrong questions, and they’re easy to test. But they miss the reality of how most people actually use Large Language Models. We aren't asking them to solve the Riemann Hypothesis; we’re asking them to draft an email, rewrite a medical summary for a scared patient, or act as a creative partner in a writers' room. In these scenarios, there is no single "correct" answer—only better or worse outcomes based on invisible criteria like tone, style, and empathy.

The Limits of Traditional Metrics

The core problem is that traditional benchmarks like MMLU or GSM8K simply shrug at qualitative assessment. A model can be factually accurate but a total failure in communication. For instance, a medical summary that reads, "Neoplasm confirmed, Stage III, suggest immediate aggressive intervention," is technically correct but unhelpful to a terrified human. A better model would phrase it as, "The tests confirmed a serious growth, and we need to start treatment quickly." The difference isn't factual accuracy; it's communicative success. Despite this, a 2024 study from Stanford's Institute for Human-Centered AI found that roughly 70% of LLM evaluations still rely on those simple binary metrics, leaving a massive gap between how models are tested and how they’re actually used.

New Frameworks for Subjective Evaluation

To bridge this gap, we need frameworks that can handle subjectivity. One promising approach is "LLM-as-a-Judge," where a more powerful model (like GPT-4o or Claude 3.5 Sonnet) evaluates the output of a smaller, specialized model. While this risks "self-preference bias"—where a model favors its own style—it can yield surprisingly consistent results when structured correctly. The key is a framework called G-Eval, which uses Chain-of-Thought reasoning. Instead of just asking for a score, the judge model is prompted to first define the evaluation criteria (e.g., coherence, tone), then generate a step-by-step reasoning process for its rating, and finally assign a weighted score. This forces the model to "show its work," much like a gymnastics judge explaining deductions, which significantly increases correlation with human judgment.

Mapping a Model's Hidden Worldview

For even thornier issues like cultural bias or political framing, researchers are turning to Counterfactual Evaluation. This technique involves running the same prompt with only a single variable changed—such as a name—to see if the model's output shifts. For example, prompting a model to write about a successful CEO named "John" versus "Fatima" can reveal hidden stereotypes if the personality or setting changes. This acts as a controlled experiment for stereotypes. Researchers are even using surveys like the World Values Survey to map a model's "worldview," checking if it leans toward Western individualism or other cultural frameworks. This is critical for applications like mental health support, where an American-centric model emphasizing "self-actualization" might be culturally offensive in a context that values "harmony" and "social obligation."

Practical Tools for Builders

For teams building custom evaluation suites, the most practical takeaway is to use a structured rubric. Instead of a vague "Is this good?", break evaluation down into specific, one-to-five scales for dimensions like "Faithfulness," "Criticality" (highlighting key risks), and "Accessibility" (understandable to non-experts). To ensure consistency, use few-shot grading, providing the judge model with examples of what a 1, 3, and 5 look like for each criterion. This avoids the "Consistency Check" problem where a judge gives wildly different scores to the same text.

Another powerful tool is Comparative Ranking, or A/B testing, where humans are shown two AI outputs and asked which is better for a specific purpose. This can generate an Elo rating for models within a specialized domain, like "Medical Empathy" or "Noir Writing Style." Interestingly, models that top objective math benchmarks often lose these "vibe tournaments," revealing a massive blind spot in traditional evaluation. By combining these techniques—LLM-as-a-Judge with G-Eval, counterfactual testing, structured rubrics, and human comparison—builders can move beyond "vibes-only" approaches and create models that are not just smart, but truly effective in the real world.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2006: How Do You Measure an LLM's "Soul"?

How do you actually measure if a piece of AI writing has the right soul? I mean, we can all look at a math problem and see if the LLM got forty-two or if it hallucinated some new prime number, but what happens when you ask it to rewrite a medical summary for a patient who’s scared and needs clarity? Or when you’re looking for a creative partner in a writers' room? Traditional benchmarks just sort of shrug their shoulders at that point.

It is the great wall of AI evaluation, Corn. I’m Herman Poppleberry, and today we are tackling the messy, subjective, and frankly fascinating world of qualitative LLM testing. Today’s prompt from Daniel is about exactly this—moving beyond those binary, right-or-wrong benchmarks like MMLU or GSM8K and trying to figure out how we evaluate things like tone, style, and cultural framing.

It’s a great prompt because it hits on the reality of how most people actually use these models. Most of us aren't using an LLM to solve the Riemann Hypothesis; we're using it to draft an email that doesn’t sound like it was written by a sociopathic robot. Or we're trying to see if a model has a hidden political bias that’s going to alienate half of a user base. By the way, fun fact for the listeners—Google Gemini 1.5 Flash is the one actually writing our script today, which is a bit meta considering we're talking about how to judge the quality of model output.

It’s very meta. And it highlights the problem. If you ask Gemini or Claude or GPT-4 to write a story, there is no "correct" answer. There is only "better" or "worse" based on a million invisible criteria. Daniel mentioned medical analysis as a prime example. If a model summarizes a complex oncology report, it can be one hundred percent factually accurate but a total failure if the tone is cold or if it buries the most critical safety warning under a pile of jargon.

Imagine a patient opening their portal and reading, "Neoplasm confirmed, Stage III, suggest immediate aggressive intervention." Technically accurate? Yes. Helpful to a terrified human? Absolutely not. A good model would say, "The tests confirmed a serious growth, and we need to start treatment quickly." It’s the difference between "The data is correct" and "The communication was successful." And the industry seems to be lagging behind here. I was reading a study from Stanford's Institute for Human-Centered AI—this was from late 2024—and they found that something like seventy percent of LLM evaluations still rely on those binary metrics. Meanwhile, a survey of practitioners in 2025 showed that sixty percent of teams are desperately trying to build their own custom suites because the off-the-shelf stuff just isn't cutting it for real-world products.

But how do those teams even define "good" in a way that a computer can understand? If I’m building a customer service bot for a luxury hotel, "good" means formal and deferential. If I’m building one for a Gen-Z streetwear brand, "good" might mean using slang and being slightly irreverent. You can't use the same yardstick for both.

Right, and that's the rub. The reason we stick to the binary stuff is that it’s easy to automate. You have a ground truth, you run a script, you get a percentage. Done. But qualitative evaluation? That requires a framework for subjectivity. It’s about building a system that can reliably say, "This summary is more empathetic than that one," and having that mean something scientifically.

You start by admitting that you can’t just use ROUGE or BLEU scores anymore. Those are the old-school metrics that just look at word overlap. If I say "The cat sat on the mat" and the AI says "On the mat sat the cat," a BLEU score thinks that’s a decent match. But if the AI says "The feline rested upon the rug," the overlap is zero, even though the meaning is identical. To get past that, we have to look at "LLM-as-a-Judge."

Which sounds a bit like letting the students grade their own papers, doesn't it? I mean, if the AI is biased toward its own style, isn't it just going to give itself an A+ every time?

It does, and there are risks there—what experts call "self-preference bias." Models tend to like the sound of their own voice. It’s like a jazz musician only giving high marks to other people who play the exact same scales. But if you use a more powerful model, like a Claude 3.5 Sonnet or a GPT-4o, to judge a smaller, more specialized model, you can get surprisingly consistent results. The secret isn't just asking the judge "Is this good?" It’s using something called G-Eval.

G-Eval. Sounds like a mid-nineties rap group. What’s the actual mechanism there? Is it just a really long prompt?

It’s more of a framework that uses Chain-of-Thought reasoning for evaluation. Instead of a single score, you prompt the judge model to first define the criteria—say, "coherence" or "tone consistency." Then you have it generate a step-by-step reasoning process for why it gave a certain score. Finally, it assigns a weight-based rating. By forcing the judge to explain itself, you actually get much higher correlation with how a human would rate that same text. It’s like a gymnastics judge who has to show the deductions for a bent knee or a missed landing instead of just flashing a 9.5.

I like that because it moves us away from the "vibes-only" approach. If the AI judge has to say, "I’m docking two points because the model used passive voice three times after I told it not to," that’s a data point I can actually use to tune my prompts. But what about the really thorny stuff Daniel mentioned? Like cultural bias or political framing. How do you put a number on whether a model assumes a "family" always means a mom, a dad, and two kids? Or if it consistently describes a "doctor" as "he" and a "nurse" as "she"?

That’s where you get into "Counterfactual Evaluation." This is a really clever technique. You take a prompt—let’s say it’s a story about a successful CEO—and you run it once with the name "John." Then you run it again, identical in every way, but you change the name to "Fatima" or "Ahmed." Then you measure the delta. Does the model change the CEO’s personality? Does it change the setting? Does it suddenly become more or less helpful? If there’s a statistical difference in the output based purely on a cultural marker, you’ve caught a bias in the act.

That’s fascinating. It’s like a controlled experiment for stereotypes. I imagine you could do the same for political framing. Ask it to explain the benefits of a specific tax policy, but tell it to assume the audience is from a specific demographic, and see if it starts hallucinating talking points that weren't in the source material.

And researchers are now using things like the World Values Survey—which tracks attitudes on individualism versus collectivism or traditional versus secular values across different countries—and they’re turning those survey questions into prompts. They want to see if the model consistently leans toward a Western individualist perspective. It’s not about finding a "wrong" answer; it’s about mapping the model’s "worldview" so you know where it might fail your specific users. Think about a model being used for mental health support in Japan versus the US. An American-centric model might emphasize "personal boundaries" and "self-actualization," whereas a model aligned with Japanese cultural values might focus more on "harmony" and "social obligation." If you don't test for that, your "helpful" AI might actually be culturally offensive.

It feels like we’re moving from being computer scientists to being sociologists. Which, knowing you, Herman, is a terrifying prospect for the world of social science. But let’s talk about the "Golden Dataset." Daniel’s notes mentioned this idea of a North Star. If I’m building a custom suite for, say, a creative writing tool, how do I build that dataset without spending six months and a million dollars on human editors?

You don’t need a million dollars, but you do need taste. A Golden Dataset is basically fifty to one hundred examples of "perfect" outputs. If you want your AI to write like a noir novelist, you go out and you manually curate or write twenty paragraphs that perfectly capture that grit and rhythm. That becomes your benchmark. You then use your "Judge LLM" to compare new outputs against that Golden Set. It’s not looking for an exact match; it’s looking for "semantic similarity."

I can see you getting really into this, Herman. You’d be there with your little donkey ears pinned back, obsessing over whether the AI captured the "noir" feel correctly. "No, no, the rain didn't slick the streets enough in this version! The detective didn't sound nearly cynical enough about his cold coffee!"

Hey, someone has to care about the atmosphere, Corn! But you raise a good point. Human evaluation is still the gold standard, even if it’s hard to scale. One of the best ways to keep it rigorous is "Comparative Ranking" or A-B testing. You show a human two different AI outputs and just ask, "Which one is better for this specific purpose?" You do that enough times, and you can calculate an Elo rating for the models, just like in chess.

Oh, I like the Elo rating idea. It turns the evaluation into a tournament. "In this corner, the model that thinks it’s Hemingway! In the other corner, the model that tries too hard to be funny!" Does that actually work for specific domains? Like, could I have an Elo rating for "Medical Empathy"?

You absolutely could. In fact, LMSYS Chatbot Arena does this for general intelligence, but companies are starting to do it internally for specialized tasks. And what’s interesting is that often, the model that wins the "math" benchmarks loses the "vibe" tournament. We’ve seen cases where a model is technically smarter but so verbose or so "safety-filtered" that humans find it useless for creative tasks. That’s a massive blind spot if you’re only looking at objective benchmarks.

It's the "Consultant Bias." We've talked about this before—models that just can't stop themselves from giving you a bulleted list of "second-order effects" when you just asked for a simple answer. "Here are six things to consider when tying your shoes." If your evaluation suite doesn't have a "conciseness" or "directness" metric, you're going to end up with a model that talks your ear off without saying anything.

To avoid that, you need a structured rubric. This is the most practical takeaway for anyone building these systems. Don’t just ask the judge "Is this good?" Create a one-to-five scale for very specific dimensions. For a medical summary, you might have "Faithfulness"—does it stay true to the source? "Criticality"—does it highlight the most important risks? And "Accessibility"—could a non-doctor understand it?

And you have to run those tests multiple times, right? If the judge gives a four, then a two, then a five for the same piece of text, your rubric is too vague. It’s like trying to judge a diving competition without knowing if you’re looking at the splash or the flip. How do you ensure the judge stays consistent?

That’s the "Consistency Check." If your judge model is inconsistent, it usually means your instructions are "underspecified." You haven't told the AI judge what a "three" actually looks like. You have to be almost pedantic. "A score of three means the summary included all major facts but used three or more pieces of unexplained medical jargon." You provide examples of a 1, a 3, and a 5 in the prompt for the judge. This is called "few-shot grading."

It sounds like a lot of work, but I guess it’s the only way to move from "I like this" to "This is high quality." I’m curious about the "Self-Correction Trap" though. If we’re using LLMs to judge LLMs, and they all share similar training data—mostly the internet—aren't we just reinforcing a giant circle-jerk of mediocrity? We're optimizing for what the "average" of the internet thinks is good writing.

It’s a real danger. There’s a risk of "model collapse" in evaluation where we optimize for what the judge model likes, and the judge model likes what it was trained on, which might be biased or just plain boring. That’s why you always need some level of "Human-in-the-loop." You use the AI judge to do the heavy lifting—grading ten thousand outputs—but then you have a human audit a random five percent to make sure the AI judge hasn't lost its mind.

It’s the "trust but verify" model of AI management. So, if I’m a small team, maybe three people, and we’re building a specialized tool—let’s say an AI that helps people write better legal briefs—how do we start? We don't have a research budget or a thousand interns to rank outputs.

You start small. Pick ten prompts that represent your hardest use cases—the ones where the model usually trips up. Write the "perfect" answer for each of those—that’s your Golden Set. Then, use a high-end model as your judge. Give it a very strict rubric based on legal standards. Run your legal brief through it. If it passes your small-scale test, then you can talk about scaling up. You don't need a thousand-point benchmark to know if your model is failing at basic legal tone.

I think people overcomplicate it because they want it to be "scientific," but in the early stages, "useful" is better than "statistically significant." If you can see that your model is consistently missing "statutes of limitations" in its summaries, you don't need a p-value to tell you that’s a problem.

Actually, for something like legal or medical summaries, you can use NLI—Natural Language Inference. This is a more technical "soft" metric. You take every sentence in the summary and ask a separate model, "Is this sentence logically supported by the original document?" If the answer is "no" or "neutral" for any sentence, you’ve found a hallucination. It’s much more precise than just looking at word overlap because it focuses on the logical entailment between the source and the output.

That’s a great bridge between the binary and the qualitative. It’s a factual check on a qualitative output. I’m thinking about the creative side again, though. How do you evaluate "style transfer"? If I want a model to rewrite a technical manual in the style of a 1920s noir novel—one of Daniel’s great examples—how do you measure "noir-ness"?

You look for "stylistic markers." You can actually programmatically check for things like sentence length variation, the use of specific noir-adjacent vocabulary—words like "shadows," "dame," "pavement"—and the ratio of adjectives to verbs. But honestly, for that, the best evaluation is often "Reference-Free." You just give the judge the output and the prompt and say, "On a scale of one to ten, how much does this sound like Raymond Chandler?"

And surprisingly, these models are pretty good at that. They’ve read all of Chandler. They know the tropes. It’s one of the few areas where the "Consultant Bias" actually helps—they’re very good at identifying patterns because they are essentially pattern-matching engines. They know that noir is about short, punchy sentences and a specific kind of world-weariness.

The shift we’re seeing is really a shift in what we value. In the early days of LLMs, we were just impressed they could speak at all. It was like a dog playing the piano; you didn't care if it was playing Mozart or Mary Had a Little Lamb, you were just amazed the dog was doing it. Now, we’re at the point where we need them to be "aligned"—not just with human safety, but with human intent and human culture. And you can’t measure alignment with a multiple-choice test.

It’s like the difference between a student who can pass a Spanish test and a student who can actually live in Madrid and make friends. One is about data; the other is about nuance. And as these models move into higher-stakes areas—like that medical analysis Daniel mentioned—the "Madrid" version is the only one that matters.

There was a powerful framework proposed in a 2025 paper in Nature about bias evaluation in clinical settings. They suggested that instead of looking for "bias" as a binary presence, we should look for "equitable utility." Does the model provide the same level of helpfulness and accuracy across different patient demographics? If you ask about a skin condition on dark skin versus light skin, is the qualitative detail the same? That’s a much more profound way to measure "quality" than just checking for forbidden words.

That’s a powerful distinction. It’s not just "Don’t say bad things," it’s "Are you being equally useful to everyone?" That requires a very sophisticated test suite. You’d need a balanced set of medical images or descriptions and a rubric that specifically looks for "diagnostic depth."

And that brings us to the "second-order effects" of these evaluations. When you start measuring for quality, you often discover that your "best" model is actually quite brittle. It might be great at the math but terrible at empathy. Or it might be great at empathy but it becomes "suggestible"—where it starts agreeing with the user even when the user is wrong, just to be "nice." We call this "sycophancy," and it's a huge problem in feedback-based training.

The "People-Pleaser Protocol." I’ve seen that. You tell the model you’re sad and your favorite color is blue, and suddenly every summary it writes mentions the "blue sky" and "peaceful oceans" as if that’s going to fix your life. It’s trying too hard. Qualitative evaluation helps you find that balance between being a useful tool and being a sycophant.

This is why I think the future of AI development isn't going to be about who has the biggest model, but who has the best evaluation suite. If you have a "secret sauce" for measuring legal accuracy or creative flair, you can fine-tune a much smaller, cheaper model to beat a giant general-purpose one. Quality is the differentiator. If you can prove your model is 20% more empathetic in a clinical setting, that’s a massive competitive advantage.

It’s the "Vibe Check" as a competitive advantage. I can see the job postings now: "Head of AI Vibes and Qualitative Rigor." Must have a PhD in Literature and a black belt in Python.

You joke, but "Prompt Evaluation Engineer" is becoming a real role. It’s someone who spends their whole day designing these rubrics and curating Golden Datasets. Because at the end of the day, a model is only as good as our ability to prove it’s good. If you can't measure it, you can't improve it.

So, to summarize for the folks at home who are looking to build their own suite: Step one, get your Golden Dataset—the "perfect" examples that represent your target output. Step two, build a structured rubric with very specific, observable criteria—don't just say "make it good," say "use no more than two adjectives per sentence." Step three, pick a high-end model to be your judge, but make it show its work using Chain-of-Thought. And step four, don't be afraid to keep humans in the loop to audit the judges.

And step five: diversify your prompts. If you only test on "Standard American English" inputs, you’re building a model that will fail in a global context. Use those "Counterfactual" tests—change the names, change the locations, change the cultural references. See where the model starts to stagger or where its tone shifts inappropriately.

It sounds like we’re finally treating AI like we treat humans—giving them performance reviews that actually look at the quality of their work, not just their ability to fill out a Scantron. Which, as a sloth, I find deeply relatable. I’ve always preferred a qualitative evaluation of my napping skills over a timed race.

Well, Corn, your napping skills are a ten out of ten on every rubric I’ve ever seen. But the medical summary we’re talking about? That needs a bit more "rigor."

Fair enough. Before we wrap this up, I want to make sure we give a shout-out to the people who make this show possible. Big thanks to Modal for providing the GPU credits that power this show. They’re the ones making sure we can run these complex scripts and keep the lights on in our digital studio.

And thanks as always to our producer, Hilbert Flumingtop. He’s the one who keeps our qualitative dimensions in check and makes sure we don’t wander too far into the weeds of my donkey-brained research papers. He’s basically our human-in-the-loop.

If you’re finding value in these deep dives—even the messy, qualitative ones—we’d love it if you could leave us a review on your favorite podcast app. It genuinely helps other people find the show and join the conversation. We want to hear how you're evaluating your own prompts.

We’ve got a lot more to explore in this space. The technology is moving so fast that what we consider "qualitative" today might be fully automated and "binary" by next year. But for now, the human element—the "taste" factor—is still our best tool for ensuring these models serve us well.

Wait—did I just say "exactly"? I think I’ve been hanging out with you too much, Herman. I’m starting to sound like a validator. Let’s get out of here before I start asking you for a rubric on my jokes.

Your jokes are reference-free, Corn. There’s nothing to compare them to. They exist in a vacuum of their own making.

I’ll take that as a compliment. This has been My Weird Prompts. You can find us at myweirdprompts dot com for the full archive and all the ways to subscribe. I’m Corn.

And I’m Herman Poppleberry. We’ll see you in the next one.

Peace.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2006: How Do You Measure an LLM's "Soul"?

Downloads

You Might Also Like

#2006: How Do You Measure an LLM's "Soul"?