#2650: How to Catch an LLM's Bad Writing Habits

A practical guide to analyzing podcast transcripts for repetitive language and dialogue patterns — from Python word counts to embedding clustering.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2809
Published: May 5
Duration: 29:00
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: large-language-models prompt-engineering fine-tuning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

This episode tackles a practical question from listener Daniel: how do you systematically analyze a corpus of podcast transcripts to catch what a script-writing LLM overdoes? The goal is a feedback loop — use analysis to tune the system prompt, generate better scripts, then measure whether the next batch actually improved.

The discussion walks through a full spectrum of techniques, starting with basic corpus linguistics. Using Python with NLTK or spaCy, you can extract unigram, bigram, and trigram frequency distributions from a hundred transcripts in about an hour. This surfaces verbal tics like overused filler words ("actually," "kind of"), repetitive constructions ("you know," "the thing is"), and sentence-level patterns ("the reality is that"). LLMs are especially prone to these because they lack a human writer's fatigue signal — they'll happily deploy "moreover" twelve times in a row.

The next layer moves from tokens to structure: sentence length distribution, type-token ratio for vocabulary range, sentence-start patterns. But the real transition point comes when you stop asking "what words repeat" and start asking "what ideas repeat." This requires semantic analysis via embeddings. By chunking transcripts and embedding each segment with a sentence transformer model, you can cluster them in semantic space using cosine similarity. This catches structural repetitions that frequency counts miss — like forty percent of episode openings falling into the same semantic cluster even when the exact words differ.

For the analysis loop, the episode recommends a three-phase pipeline. Phase one is diagnostic: frequency analysis, embedding clustering, and LLM-as-judge passes using a different model family than the generator to avoid shared blind spots. Phase two turns findings into targeted prompt edits — three to five root-cause fixes with concrete examples, not abstract prohibitions. Phase three measures the next batch against the previous one, comparing distributions rather than individual episodes, with twenty to thirty episodes needed for statistical confidence.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2650: How to Catch an LLM's Bad Writing Habits

By the way, today's episode is powered by DeepSeek V four Pro. Not sure what that means for my accent, but here we are.

Hopefully it means fewer of your tangents about Connecticut zoning laws. So Daniel sent us this one — and it's a practical one. He wants to know how you'd systematically analyze a corpus of podcast transcripts to catch what the script-writing agent overdoes. Repeated words, stale jokes, dialogue patterns that need more variety. The goal is a feedback loop: use the analysis to tune the system prompt, generate better scripts, then measure whether the next batch actually improved. He wants the full spectrum — from quick Python word counts all the way to embedding-based clustering and LLM-as-judge passes. And he's asking about the tradeoffs: when is simple enough, when do you need the heavy machinery, and how do you avoid optimizing for metrics that don't actually make the content better.

This is one of those questions where the doing is straightforward but the thinking around it is where everything lives. You can absolutely build a pipeline that surfaces "genuinely" appearing four hundred times across a hundred episodes. That's twenty minutes in Python. The hard part is deciding what to do with that information.

Whether that information actually means anything. I've seen people optimize their way into perfectly bland prose because they chased frequency counts without asking whether the repetition was actually a problem.

Let's walk through the stack bottom to top, because the spectrum Daniel's asking about is real and each layer answers a different question. The quick-and-dirty tier — and I mean this affectionately, this is where you should start — is basic corpus linguistics tooling. You take your hundred transcripts, strip the speaker labels, tokenize, and run frequency distributions. Python, NLTK or spaCy, maybe an hour of setup if you know what you're doing.

What are you actually extracting at that level?

Unigrams, bigrams, trigrams. You filter out stopwords, normalize case, and suddenly you've got a ranked list. Top unigrams might be things like "actually," "," "kind of," "I mean." Bigrams surface repetitive constructions — "you know," "the thing is," "it's almost like." Trigrams catch sentence-level tics. I ran something similar on a different corpus a while back and found the phrase "the reality is that" appearing something like eighty times across sixty documents. The writer had no idea.

That's the thing with these verbal tics — they're invisible to the person producing them. You don't notice your own filler words until someone points them out.

LLMs are worse about this than humans because they don't have a fatigue signal. A human writer gets bored of using the same transition phrase. An LLM will happily deploy "moreover" twelve times in a row if the sampling parameters don't nudge it away.

Frequency analysis surfaces the obvious stuff. What's the next layer?

The next layer is where you start looking at structure rather than just tokens. Sentence length distribution, dialogue turn length, lexical diversity scores. Type-token ratio — unique words divided by total words — gives you a rough measure of vocabulary range. If your agent's type-token ratio is dropping across batches, that's a signal it's settling into a narrower vocabulary. You can also look at sentence-start distributions. Does every other sentence begin with "So" or "But" or "And"? That's a pattern you can catch with a simple regex pass.

Far we're still in "Python script on a laptop" territory. When does this graduate to needing something heavier?

The transition point is when you stop asking "what words repeat" and start asking "what ideas repeat." Frequency counts will tell you that "" is overused. They won't tell you that seventeen episodes open with a variation on "let's unpack that" or that the dynamic between the hosts keeps landing on the same emotional beat — Corn makes a dry observation, Herman gets excited, Corn undercuts it. That's a structural repetition, not a lexical one.

Catching that requires semantic analysis, not just token matching.

This is where embeddings come in. You take each transcript, chunk it into segments — say, paragraphs or groups of five to ten dialogue turns — and embed each chunk using something like a sentence transformer model. Now you've got vectors in semantic space, and you can cluster them. Cosine similarity, hierarchical clustering, HDBSCAN if you want density-based grouping that doesn't force every chunk into a cluster.

Walk me through what that actually surfaces in practice.

Let's say you cluster all the episode openings. You might find that forty percent of them fall into a tight semantic cluster — similar framing, similar rhetorical moves, similar energy. The embeddings pick that up even when the exact words differ. An opening that says "So Daniel sent us this one about nuclear fusion" and another that says "Here's an interesting prompt from Daniel on quantum computing" — those are lexically different but semantically close. If your embedding model is decent, they'll cluster.

Which tells you the agent has a default opening template it's leaning on even when it varies the surface wording.

That's something frequency counts completely miss. You can extend this to any recurring segment — closings, transitions, the way the agent handles disagreement between hosts. Cluster the segments, inspect the clusters, and you'll see patterns that aren't visible at the word level.

What about the tooling for this? Are we talking Pinecone and a vector database, or can you do this locally?

For a hundred episodes, you absolutely do not need a vector database. A hundred transcripts, chunked into say five segments each, that's five hundred vectors. You can store those in a NumPy array and do cosine similarity in memory. Even with a decent embedding model, you're looking at maybe ten minutes of compute on a laptop with a GPU, or half an hour on CPU. The complexity here isn't infrastructure — it's knowing what to do with the clusters once you have them.

That's the part that interests me. You've got your clusters.

Now you need a human — or an LLM — to look at representative examples from each cluster and characterize what's happening. This is where LLM-as-judge becomes useful. You pull, say, five random chunks from a cluster, feed them to a capable model, and ask: "What do these passages have in common? What rhetorical strategy is being repeated? Is this effective or formulaic?" The model gives you a qualitative read that would take a human hours to produce across dozens of clusters.

You trust the LLM to be honest about LLM-generated text?

That's a fair skepticism. There's a risk of the judge model sharing the same biases as the generator model. If both are from the same family — say, both are Claude models, or both are GPT-family — they may have similar blind spots about what constitutes natural dialogue. The mitigation is to use a different model family for the judge than for the generator, or to treat the judge's output as a hypothesis to verify rather than a final verdict.

The judge says "these openings all use a question hook followed by a scope-narrowing move." That's useful, but you still need a human to decide whether that pattern is actually a problem.

That's the trap Daniel's asking about — optimizing for surface metrics that don't correlate with quality. You could tell your system prompt "vary the opening structure," and the agent starts producing openings that are structurally diverse but terrible. Maybe it opens with a meandering personal anecdote, or a dictionary definition, or something else that technically avoids the pattern but makes the episode worse.

I've seen this in content production pipelines. Someone runs an analysis, finds a pattern, adds a rule to break the pattern, and the output gets weirder in ways the analysis didn't anticipate. You fix one thing and break three others.

Goodhart's law applied to prompt engineering. When a metric becomes a target, it ceases to be a good metric. If you tell the agent "don't use the word more than twice per episode," it'll just find a synonym and the underlying tic remains.

How do you actually close that loop responsibly?

You need a multi-signal approach and you need to measure at the batch level, not the episode level. Let me lay out the full pipeline as I'd build it. Phase one is the diagnostic layer — everything we just described. Frequency analysis for lexical overuse, embedding clustering for thematic repetition, LLM-as-judge for qualitative pattern identification. You run this across your hundred-episode corpus and produce a report: here are the top twenty overused words, here are the five most common structural patterns, here are the dialogue dynamics that appear in more than thirty percent of episodes.

Phase two is turning that report into prompt edits.

You don't add twenty rules. You pick maybe three to five of the most impactful findings and write prompt guidance that addresses the root cause, not the symptom. If the agent overuses "," the root cause might be that it's trying too hard to sound earnest. The fix isn't "ban this word" — it's "vary your intensifiers, and consider whether the point stands without an intensifier at all." If the openings cluster semantically, the fix isn't "use a different opening" — it's specific guidance about what makes an opening feel fresh, with examples.

Examples in the prompt are crucial here, right? Abstract instructions don't work nearly as well as showing the model what you want.

They're essential. If you say "avoid repetitive openings," the model has to guess what repetitive means. If you show three examples of varied openings alongside three examples of the pattern you're trying to break, the model has a much clearer target. Few-shot examples are basically lightweight fine-tuning without the infrastructure.

You revise the prompt with targeted guidance and concrete examples. Then what — you generate the next batch of episodes and run the same analysis again?

That's phase three — the measurement loop. You generate, say, ten new episodes with the revised prompt. Then you run the exact same diagnostic pipeline on those ten. But here's the key: you compare distributions, not individual episodes. Did the frequency of "" drop from an average of four per episode to one? Did the semantic spread of openings increase — meaning, are the new openings less tightly clustered than the old ones? You can actually quantify this with something like mean pairwise cosine distance within the opening cluster.

You're looking for shift in the aggregate, not whether episode forty-seven happened to be better than episode forty-six.

Individual episodes will always vary. The question is whether the distribution moved. And you need to be patient — a batch of ten might show a direction but not a statistically convincing shift. Twenty to thirty episodes is where you start to have confidence that the prompt change actually did something.

What about the diff-style analysis Daniel mentioned? That feels like a different angle.

Diff analysis is powerful for before-and-after comparison when you have paired data. The idea is: take an episode generated with the old prompt, then regenerate the same episode — same topic, same hosts, same constraints — with the new prompt. Now you've got two versions of what is supposed to be the same thing. You can diff them at multiple levels. Word-level diff shows you exactly which tics disappeared and which new ones appeared. Structural diff — which you can approximate by comparing the sequence of dialogue acts — shows you whether the flow changed. Semantic diff, using embeddings, tells you whether the substance shifted even when the words differ.

The risk there is that you're comparing one sample to one sample. The difference between two individual generations might just be noise.

Which is why you do it across a set. Take ten episodes, regenerate each with the new prompt, diff all ten pairs, and look for consistent patterns. If "" was removed in eight out of ten regenerations, that's a signal. If the semantic similarity between old and new versions is consistently high — meaning you didn't accidentally change the content — that's also good to know.

Let's talk about the tooling stack more concretely. If someone's starting this today, what are they actually installing?

For the quick-and-dirty tier: Python, spaCy with the medium or large English model, and maybe NLTK for frequency distribution utilities. spaCy handles tokenization, part-of-speech tagging, lemmatization, and n-gram extraction in a few lines. If you want to get slightly fancier, TextBlob or lexical-diversity packages can give you type-token ratios and other richness metrics without much effort.

For the embedding tier?

Sentence-transformers library, which gives you access to models like all-MiniLM-L6-v2 or all-mpnet-base-v2. These are small enough to run locally, fast enough for a hundred-episode corpus, and the embeddings are good enough for clustering and semantic similarity. For clustering, scikit-learn has everything you need — KMeans for quick grouping, or HDBSCAN from the hdbscan package if you want density-based clustering that handles noise well. For visualization, you project the embeddings down to two dimensions with UMAP and scatter-plot them. Clusters become visually obvious immediately.

The LLM-as-judge component?

That's the only part where you're calling an API. You take representative samples from your clusters, construct a prompt that asks for pattern identification, and send it to a frontier model. You can do this with Claude, GPT, Gemini — the choice matters less than the prompt construction. You want the judge to be specific: "Identify repeated rhetorical strategies, overused transition phrases, and dialogue patterns that appear formulaic. For each pattern, provide an example and rate its frequency on a scale of one to five.

How much of this can be automated end-to-end versus requiring human judgment at key decision points?

The pipeline can be fully automated. The interpretation can't be — or shouldn't be. You can schedule a weekly job that pulls the latest transcripts, runs frequency analysis, updates embedding clusters, and even generates a report with the LLM judge's observations. But the decision about which findings to act on, and how to translate them into prompt changes, requires someone who understands what makes the content good in the first place.

That's the crux of it. You need taste. The analysis surfaces data, but data doesn't tell you whether a repetition is a signature style or an irritating tic. Some repetition is voice. If I make a leaf medicine joke in three different episodes, that's not a bug — that's a running bit. The analysis can't distinguish between "this is a deliberate pattern that listeners enjoy" and "this is the model being lazy.

That's where the human in the loop earns their keep. They look at a finding like "Corn's dry undercutting of Herman's enthusiasm appears in seventy-two percent of episodes" and decide whether that's the show's identity or a rut. The analysis says what's happening. The human says whether it's working.

What about the risk of over-optimizing for novelty? If you keep pushing the model to vary its patterns, you might end up with output that's different for the sake of being different, not actually better.

This is a real failure mode in iterative prompt engineering. You end up in a cycle where each batch fixes the previous batch's problems but introduces new ones, and you're constantly chasing your tail. The way to break that cycle is to anchor your evaluation in something stable. Before you start optimizing, define what good looks like in terms that aren't just "not like the last batch." Maybe it's listener retention data, maybe it's qualitative ratings from a small panel, maybe it's a set of principles — "dialogue should feel like two people who actually know each other," "jokes should land without derailing," "explanations should reward attention without requiring a PhD.

Then you evaluate each batch against those principles, not just against the previous batch's metrics.

The frequency analysis and clustering tell you what changed. The principles tell you whether the change was good. Without principles, you're just wandering.

Let's get into a specific example. Say you run this pipeline on a hundred episodes of a podcast like ours. The analysis surfaces that the word "" appears three hundred and forty times — three point four times per episode on average. It also surfaces that episode openings cluster into basically two templates: a question hook or a "so Daniel sent us this one" framing. And the LLM judge notes that disagreements between hosts tend to resolve in exactly two exchanges, with the second host immediately conceding the point. Walk me through how you'd address each of those.

For "" — the frequency count tells you there's a problem, but you need to dig one level deeper before you fix it. Is it appearing in a specific context? Run a concordance — a keyword-in-context search — and look at the twenty words before and after each occurrence. You might find that ninety percent of the time, it's used as an intensifier before an adjective: " interesting," " surprising," " important." Now you know the pattern isn't just the word — it's the rhetorical move of intensifying an adjective with an earnestness signal. The fix goes in the prompt: "Vary your intensifiers. Use 'particularly,' 'remarkably,' 'truly,' and others in rotation. More importantly, ask whether the adjective needs an intensifier at all. 'Interesting' often stands fine on its own.

Then you might add a few examples showing what you mean.

Show a before and after. Before: "That's a fascinating finding." After: "That finding holds up well under scrutiny." The second version is stronger because it's specific about why it's interesting, not just that it's interesting.

What about the opening templates?

Two templates across a hundred episodes is a tight cluster. The fix here is structural guidance plus diversity examples. In the prompt, you might say: "Open episodes with variety. Acceptable approaches include: a provocative claim, a concrete anecdote, a question that creates tension, a surprising statistic, or jumping directly into the first point. Avoid defaulting to 'So Daniel sent us this one' more than twenty percent of the time." Then show five different opening styles that all work.

The disagreement pattern — two exchanges then concession?

That's the hardest to fix because it's a deeper structural habit. The model defaults to agreement because disagreement requires more cognitive work to write convincingly. The fix is explicit instruction: "When hosts disagree, the disagreement should last three to five exchanges. Each host should offer a substantive counterpoint, not a token objection. Resolution should feel earned, not automatic." And again, examples are everything — show a disagreement that breathes.

One thing I'm curious about: how do you prevent the analysis itself from becoming a bottleneck? If you're generating episodes weekly and the analysis takes days to run and interpret, you're always behind.

The analysis should be lightweight enough to run overnight. For a hundred-episode corpus, even the embedding and clustering should finish in under an hour on modern hardware. The LLM judge pass might take another thirty minutes depending on API latency. The human interpretation step is the variable — but if the automated report is well-structured, a human can review it in fifteen to twenty minutes and decide which findings are actionable.

The whole loop — run analysis, review findings, draft prompt edits, generate test batch — could be a weekly cadence without much pain.

The first setup takes a few days of engineering. After that, it's maintenance and judgment.

What about the tools that sit between "Python script" and "full embedding pipeline"? Are there off-the-shelf things that do some of this?

There's a whole ecosystem. For frequency and n-gram analysis, AntConc is a free corpus linguistics tool that's been around for decades — point it at a folder of text files and it gives you word lists, concordances, collocation analysis, all with a GUI. Voyant Tools is a web-based option that does similar things without installing anything. For more programmatic work, Textacy is a Python library built on spaCy that adds corpus-level operations — it can extract key terms, summarize topics, and compute readability metrics in a few lines.

For the embedding side?

BERTopic is a library that combines embeddings, dimensionality reduction, and clustering into a single pipeline specifically designed for topic modeling. You feed it documents, it gives you topics with representative words and examples. It's essentially doing what I described — embedding, UMAP projection, HDBSCAN clustering — but wrapped in a clean API. For someone who doesn't want to build the pipeline from scratch, BERTopic gets you eighty percent of the way there.

You could run BERTopic on your hundred transcripts, get topic clusters, inspect the clusters for repetitive themes, and that's your semantic analysis layer without writing custom clustering code.

Then you'd still want the LLM judge pass for qualitative interpretation of the clusters, but the heavy lifting is done.

Let's talk about a subtler problem. Say your analysis surfaces that the agent uses a lot of hedging language — "maybe," "perhaps," "it's possible that," "one might argue." But hedging, in moderation, is actually a sign of intellectual honesty. An agent that never hedges sounds arrogant and probably wrong more often. How do you distinguish between appropriate hedging and over-hedging?

This is where simple frequency counts fail and you need distributional analysis. You don't just count hedges — you look at where they appear. Are they clustered around factual claims where uncertainty is appropriate? Or are they sprinkled uniformly through the dialogue, including places where the host should be confident? You can do this with a simple heuristic: tag sentences as factual claims, opinions, or transitions, then measure hedge density per category. If factual claims and opinions are hedged at the same rate, that's a problem — opinions should be more hedged than established facts.

You'd need the LLM judge or a fine-tuned classifier to tag those sentence types.

A decent LLM can do that classification pass with high accuracy. Feed it sentences one at a time with a prompt like "Classify this sentence as FACTUAL_CLAIM, OPINION, TRANSITION, or OTHER." Run that across the corpus, cross-reference with hedge frequency, and you get a much more nuanced picture than a raw count.

That's clever. You're not just measuring the symptom — you're measuring whether the symptom appears in contexts where it's actually a problem.

That principle generalizes. Any time your analysis flags something — overused word, repetitive structure, whatever — ask whether it's uniformly distributed or context-dependent. The answer changes what you do about it.

What about measuring whether the next batch actually improved? We talked about distributional shifts, but what does a good measurement framework look like concretely?

You want a dashboard, not a single number. Track maybe six to eight metrics across batches. Lexical diversity — is the vocabulary expanding or contracting? Overused word frequency — are your top offenders trending down? Semantic spread of openings and closings — are they becoming more varied? Dialogue turn length variance — are exchanges becoming more dynamic or more uniform? Hedge distribution by sentence type — is uncertainty landing in the right places? And then one qualitative metric: a periodic blind evaluation where someone reads episodes from different batches and ranks them without knowing which is which.

That last one is expensive but probably the most valuable.

It's the only one that directly measures quality rather than proxy metrics. Everything else is a signal that might correlate with quality. The blind evaluation is the closest you get to ground truth.

If your proxy metrics are all improving but the blind evaluation says quality is flat or declining, you're optimizing the wrong things.

That's the nightmare scenario, and it happens all the time. The metrics look great. The dashboard is green. And the content is getting worse in ways the metrics don't capture. The only defense is to keep the qualitative evaluation in the loop and trust it over the numbers when they disagree.

To summarize the full approach: start with cheap frequency analysis to catch the obvious stuff. Graduate to embedding-based clustering when you need to find structural and thematic repetition. Use LLM-as-judge for qualitative interpretation of what the clusters mean. Translate findings into targeted prompt edits with concrete examples, not blanket bans. Measure at the batch level using multiple signals, and always keep a qualitative evaluation in the loop as your anchor. The whole thing can run on a laptop except for the LLM API calls, and the cadence can be weekly once the pipeline is built.

That's the architecture. The implementation details will vary depending on corpus size and the specific patterns you're hunting, but the principles hold. Start simple, add complexity only when the simple tools can't answer your question, and never let the metrics become the goal.

The other thing I'd add — and this is more philosophy than engineering — is that you should be optimizing for distinctiveness, not just the absence of flaws. A script with no repeated words and perfectly varied sentence structure can still be forgettable. The goal isn't to produce text that passes an automated quality check. It's to produce text that sounds like something only this show would say.

The analysis tells you what's wrong. It doesn't tell you what's right. For that, you need a clear sense of the show's voice — and the discipline to protect that voice even when the dashboard suggests smoothing it out.

Now: Hilbert's daily fun fact.

Hilbert: In the early fifteen hundreds, the permafrost on the Tibetan Plateau stored roughly one point six billion tons of methane — comparable to the total energy content of all the natural gas the United States consumed in two thousand twenty-four.

Tibet's frozen dirt was basically America's gas tank five hundred years ago. Good to know.

I have so many questions, none of which I'm going to ask.

That's probably wise. Something to chew on for next time: once you've built this analysis pipeline and it's humming along, what happens when the model that generates your scripts gets updated by its provider? All your carefully tuned prompt guidance might target behaviors that the new model version doesn't exhibit — or worse, interprets completely differently. Prompt drift across model versions is a whole other problem, and one that makes this entire edifice a moving target. But that's for another episode. Thanks to our producer Hilbert Flumingtop for keeping this operation running. This has been My Weird Prompts. Find us at myweirdprompts.com or wherever you get your podcasts.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2650: How to Catch an LLM's Bad Writing Habits

Downloads

You Might Also Like

#2650: How to Catch an LLM's Bad Writing Habits