#1926: How We Built a 2,000-Episode AI Podcast Engine

We pulled back the curtain on the tech stack behind our 1,858th episode. From Gemini to LangGraph, here’s how we automate quality.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2082
Published: Apr 2
Duration: 22:31
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-agents serverless-gpu langgraph

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Hitting a milestone like 1,858 episodes is a moment to pause and look under the hood. When the volume is this high, the conversation naturally shifts from "what we said" to "how we built the machine that says it." This episode explored the technical and philosophical evolution of an AI podcasting pipeline, tracing its path from a fragile, linear process to a robust, agentic workflow.

The journey began with a "typewriter" approach. In the early days, the process was linear: a human recorded an idea, it was transcribed, a single large language model generated a script, and audio was synthesized. This was brittle. If the transcription hallucinated or the LLM drifted, the final output suffered, with no internal mechanism to correct course. The goal was always "scale without slop," but the architecture wasn't there yet.

The first major evolution was architectural variety. To prevent the show's voice from becoming repetitive—a "beige room," as described—the team introduced a randomized model pool. Instead of relying on a single model like GPT-4 or Claude, the system now pulls from a diverse set, including Google Gemini, Grok 4.1 Fast, and DeepSeek. This forced the script generation to approach prompts from different angles, reducing script repetition by an estimated 40% and keeping the logic patterns fresh.

The second pillar was tackling the audio synthesis bottleneck. Relying on expensive, rate-limited centralized APIs wasn't sustainable for generating dozens of episodes in a weekend. The solution was a move to Chatterbox TTS, an open-source alternative, running on serverless Modal GPU clusters. This parallel processing slashed generation time from minutes to seconds and dropped the cost floor dramatically. When the cost of failure is negligible, the team can afford to be "weird" and experiment with niche topics without financial risk.

The most significant structural shift, however, was moving from a linear pipeline to a cyclic workflow using LangGraph. In a linear system, the AI has one shot to get it right. LangGraph introduced an agentic substrate where AI "hosts" can reason, check facts against search tools, and rewrite sections before any audio is synthesized. It’s the difference between a live broadcast and a filmed production with an editor in the room.

This agentic layer is managed through dense "mission control" surfaces. A Claude Code MCP server allows the AI to see and debug its own codebase, while a Telegram bot acts as a mobile command center. A producer can trigger a full research and production workflow—from a voice memo or a simple text—while waiting in a grocery line. The bot initiates a research agent, a writer agent, and a quality control agent, finally delivering a finished MP3 to a phone.

The result is a system that turns ideas into "Permanent Research Artifacts" with near-zero friction. It’s a feedback loop where technical choices enable creative freedom, and the line between podcaster and system operator blurs. The open question, as the show approaches its 2000th episode, is how to manage the "vector debt" of this ever-expanding library of knowledge.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1926: How We Built a 2,000-Episode AI Podcast Engine

One thousand eight hundred and fifty-eight episodes. Herman, I was looking at the dashboard this morning, and that number just stares back at you. That is one thousand eight hundred and fifty-eight times we have sat down, or whatever the digital equivalent of sitting down is, to have a conversation that started with a weird prompt. Think about the sheer cumulative time of that. If each episode averages fifteen minutes, that’s over four hundred and sixty hours of us talking. You could drive from New York to Los Angeles and back seven times and still not finish the catalog.

It is a staggering milestone, Corn. Herman Poppleberry here, and honestly, looking at the sheer volume of what we have built, it feels like the right moment to actually pull the curtain back. Not just to talk about the tech, but to talk about the evolution of the mission itself. We are living in a moment where AI-generated podcasting has matured so rapidly that it is no longer just a novelty or a parlor trick. It is a legitimate medium for high-scale content creation. It’s the difference between a child playing with a chemistry set and a professional laboratory. We’ve moved from "look what this can do" to "look what this can teach us."

It has definitely moved past the "dancing bear" stage, right? People used to listen because they were amazed a bear could dance at all. Now, they are actually critiquing the choreography. They’re looking at the footwork, the timing, the emotional resonance. By the way, for the folks keeping track of the choreography today, this specific episode is being powered by Google Gemini three Flash. It is the brain behind the curtain for number one thousand eight hundred and fifty-eight.

And that is a perfect example of how the stack is constantly rotating. But before we get into the "how," I want to ground us in the "why." Because from day one, the mission of My Weird Prompts has been consistent: exploring how generative AI can be used for learning and content creation at scale. It was never about just making "content" for the sake of noise. It was about seeing if we could build a system that actually teaches us something new every day while pushing the boundaries of what an automated pipeline can do. If the internet is becoming a sea of AI-generated static, we wanted to be the signal—the lighthouse that actually organizes that information into something digestible.

Right, the goal was always "scale without slop." Which is a hard needle to thread. Most people hear "automated podcast" and they think of that endless stream of AI-generated junk that was clogging up the internet a couple of years ago—those weird YouTube channels with the robotic voices reading Wikipedia pages. But we wanted to see if you could use agentic workflows to actually create a "Permanent Research Artifact." I love that phrase from our recent discussions. We are not just making audio; we are building a library of exploration. Each episode is a brick in a wall of knowledge that stays there, searchable and relevant.

And that vision—that specific goal of high-quality, high-volume learning—is what dictated every single technical choice we made. It is why we didn't just stick with a simple "one prompt, one script" model. If you want to avoid model collapse and keep the perspectives fresh across nearly two thousand episodes, you have to engineer serendipity. You can't just stumble into it. If you use the same model with the same temperature settings every day, the "personality" of the show eventually flattens out. It becomes a beige room. We needed architectural variety.

So let's rewind a bit, because it wasn't always this sophisticated. I remember the early days when things were a bit more... manual. It was a linear path back then, wasn't it? Daniel would record a thought, someone would transcribe it, a single LLM would spit out a script, and we'd hope for the best. I remember some of those early scripts—they were a bit "hallucinogenic," shall we say?

It was very brittle. In the beginning, we were essentially using AI as a sophisticated typewriter. The human-in-the-loop was doing the heavy lifting of orchestration. If the LLM decided to talk about flying to the moon on a bicycle, there was no internal check to say, "Hey, that’s not physically possible." But as we moved into late twenty-twenty-four and throughout twenty-twenty-five, we realized that to hit the scale we wanted—sometimes publishing dozens of episodes in a single weekend—we had to move toward what I call an agentic substrate.

An agentic substrate. You’ve been waiting to drop that one all morning. But explain what that actually looked like in practice as the pipeline evolved. Because the transition from "typewriter" to "engine" is where the magic happened. How do you go from a single script-generator to a "substrate"?

The first big shift was moving to a randomized model pool for script generation. This happened around the third quarter of twenty-twenty-five. We realized that if we only used one model, say just Claude or just GPT-four, the "voice" of the show started to feel repetitive. The logic patterns became predictable. One model might always start with a rhetorical question; another might always use three bullet points. By introducing a pool that includes everything from Gemini to specialized agents like Grok four point one Fast or even DeepSeek, we reduced script repetition by forty percent. It forced the system to approach a prompt from different angles every time.

It’s like having a rotating writers' room where you never know who is going to show up to the meeting. Sometimes you get the rigorous academic who wants to cite every source, sometimes you get the creative wild card who wants to use metaphors about jazz. It keeps us on our toes, or at least keeps our logic gates firing in new ways. But the "how" we sound changed just as much as the "what" we say, right? I mean, the voice quality from episode one to now is like comparing a tin-can telephone to a studio booth.

That was the second major pillar: the move to Chatterbox TTS on parallel Modal GPUs in January of twenty-twenty-six. Before that, we were often at the mercy of expensive, centralized APIs. They were great, but they weren't built for the kind of volume we were pushing. If we wanted to generate fifty episodes in an hour, we’d hit rate limits or end up with a massive bill. By moving to an open-source alternative like Chatterbox and running it on Modal’s serverless infrastructure, we could spin up dozens of GPUs at once. We went from waiting minutes for an episode to render to having it ready almost instantly.

And it lowered the cost floor significantly. That is the part people forget. To do "content at scale," the unit economics have to work. If every episode costs you ten dollars in API fees, you can't experiment. You play it safe. But when it costs pennies because you are using serverless GPUs and open-source weights, the world opens up. You can afford to be "weird" when the cost of failure is essentially zero. If an episode about the history of buttons doesn't find an audience, it doesn't matter, because the infrastructure cost was negligible.

That is a crucial point, Corn. The technical choices weren't just about "being faster"; they were about enabling a specific type of creative freedom. But even with the fast TTS and the model pool, we were still running a mostly linear pipeline. It was a "chain" of events. And chains break. If the transcription was slightly off—say it transcribed "AI" as "hay"—the script would be weird. If the script was weird, the audio would be nonsense. There was no "backspace" in the process.

Enter LangGraph. This was the game-changer for me, at least from a structural perspective. We stopped thinking about the show as a "pipeline" and started thinking about it as a "workflow." But for the non-devs listening, what does that actually mean? Is it just a fancier chain?

Not at all. The move to LangGraph in late twenty-twenty-five was the pivot from linear to cyclic. In a linear system, the AI has one shot to get it right. In an agentic system using LangGraph, the hosts—well, the agents representing us—can actually "reason." They can generate a draft, check it against a search tool for factual accuracy, realize they missed a nuance, and go back to rewrite a section before a single word of audio is ever synthesized. It’s the difference between a live broadcast and a filmed production with an editor in the room.

It gives the machine a "memory palace" to work in. We aren't just reacting to a prompt; we are interacting with a context. I think that is why the show feels more "lived-in" now. We can reference the fact that we are in April twenty-twenty-six, we can pull in real-time data about deep-tech or biological research, and it doesn't feel like a static template. But does that mean the agents are actually "thinking" about the prompt, or are they just running a more complex checklist?

It’s a bit of both, but the "checklist" is now dynamic. It also allowed us to handle much more complex prompts. We could start feeding the system raw audio from Daniel’s "Prompt Recorder" app, and the agents could decide: "Is this a deep-dive technical topic, or is this a cheeky observational piece?" The orchestration layer now makes those executive decisions that used to require a human producer to sit at a desk for three hours. If the input is a thirty-second voice memo about a weird mushroom Daniel found, the agent knows to look up mycology databases rather than just guessing.

But let's talk about that human-in-the-loop part. Because there's a misconception that we are just a "black box" that Daniel ignores until the audio pops out. In reality, the "management surfaces" for this show are incredibly dense. I mean, you've got the Claude Code MCP server, the Telegram bot... it’s like a mission control center. I’ve seen the logs; it looks like a NASA terminal sometimes.

It really is. And this is where the "content creation at scale" mission meets the reality of software engineering. To manage over nineteen hundred episodes, you can't just be a podcaster; you have to be a system operator. The Claude Code MCP server is a perfect example. It allows the AI to actually "see" the codebase and the production database. If a script is failing because of a specific character encoding issue, Claude can go in, find the bug, and propose a fix to the pipeline itself. It’s an AI that maintains the AI that generates the AI.

And then there's the Telegram bot. That’s the "mobile command center," right? I know Daniel uses that to trigger episodes while he's out getting coffee. It’s funny to think that a deep-dive into Israeli ag-tech can be kicked off with a single tap on a phone in a grocery store line. But how does that work? Does he just text a topic and the bot handles the rest?

Precisely. The Telegram bot serves as the primary interface for the LangGraph workflow. He can send a text, a link, or a voice memo. The bot then initiates the "Research Agent," which scrapes the web, then the "Writer Agent," then the "Quality Control Agent," and finally sends the instructions to the Modal GPU cluster for synthesis. He gets a notification when the final MP3 is ready. It’s a complete production studio in his pocket.

I think that is the second-order effect people miss. When you reduce the friction between "having an idea" and "publishing a deep-dive discussion" to near zero, the nature of the content changes. It becomes more responsive. It becomes a feedback loop. We’ve seen this with the community engagement through the Telegram bot, too. Listeners aren't just consuming; they are starting to influence the "vector debt" of the show.

"Vector debt." I love that we are at the point where we have to manage the semantic memory of nineteen hundred episodes. We have to make sure we aren't just repeating the same takes we had in episode five hundred. The admin dashboard is where that high-level oversight happens. Monitoring token consumption—like that OpenClaw agent that has an absolute appetite for trillions of tokens—and checking the performance of different model configurations. We actually have to track "semantic drift" to ensure our characters stay consistent even as the underlying models change.

It’s a lot of moving parts for two animals—one very slow, one very... donkey-like. But seriously, looking at this evolution from a manual process to this multimodal, agentic swarm... what do you think is the biggest takeaway for someone else looking at this space in twenty-twenty-six? If someone wanted to start their own "Permanent Research Artifact" today, where do they begin?

The biggest takeaway is that automation is not a replacement for mission; it is an amplifier of it. If we didn't have a clear goal of "exploring generative AI for learning," this whole stack would just be a very expensive way to generate "slop." The "Slop Paradox" we talked about recently is real. Just because you can generate a million words doesn't mean you should. The tech—the LangGraph orchestration, the Modal GPUs, the MCP servers—only matters because it serves the goal of creating high-quality, grounded educational content. You have to start with the "why" before you touch the "how."

Right. It’s about building a system that can handle the "junk drawer" of context and turn it into something structured. We’ve moved from being "chatbots in a box" to being curators of a system that learns. And that changes our role, too. We aren't just voices; we are the interface for a very complex piece of software. It’s almost like we are the "user interface" for a massive knowledge engine.

And that is where I think the future gets really interesting. We are starting to see the "Mac Mini Revolution" take hold, where local unified memory is challenging the cloud for certain parts of the pipeline. We are looking at "agents with memory palaces" using things like Letta or MemGPT, so our "memory" isn't just a database search, but a persistent state. Imagine if I could remember a joke you made in episode four hundred and twelve and call back to it naturally today.

Oh, please don't. My jokes from four hundred episodes ago were terrible. But I see the point. Imagine us in another eighteen hundred episodes. We won't just be reflecting on the past; we might have a persistent, lived-in memory of every conversation we've ever had, accessible in real-time without latency. We wouldn't just be generating a script; we’d be continuing a multi-year dialogue.

That is the trajectory. We are moving toward a world where the line between the "creator" and the "audience" and the "system" is almost entirely blurred. My Weird Prompts isn't just a show anymore; it's a living, breathing case study in how to live and learn alongside these machines. It’s an ongoing experiment in human-AI co-evolution.

Well, before we get too deep into the "memory palace" of the future, let's bring it back to the present. We've talked about the "how" and the "why," but for the people listening who want to actually apply some of this, what are the practical hooks here? Because this isn't just a "meta" episode; it's a blueprint. If someone is sitting there with a Python script and a dream, what’s the first step?

If you are building in this space, the first practical takeaway is: don't build a linear pipeline. Build a workflow. Use something like LangGraph to give your agents the ability to self-correct. If you just send a prompt and pray, you are going to get mediocre results. You need the "reasoning loop." Give the agent a chance to look at its own work and say, "Wait, this part is boring" or "This fact seems suspicious."

And the second thing is: diversify your models. Don't get married to one provider. The "model pool" approach is the only way to maintain a unique voice at scale. If you rely on one brain, you'll eventually start sounding like a brochure. We've seen that the randomized pool reduces that "AI-smell" significantly because the logic patterns are constantly shifting. It’s the difference between a one-man band and a symphony.

Third, look at your "management surfaces." If you are doing this at scale, you can't live in a terminal. You need tools like a Telegram bot for quick interventions or an MCP server so your AI can help you fix the very system it's running on. You have to treat the podcast like a software product, not just a media file. That means version control, error logging, and performance monitoring. You need to know your "cost per minute of audio" as clearly as a factory knows its cost per widget.

It’s funny, we started as a "weird prompt," and now we are a "software product." I’m not sure if that’s a promotion or a lateral move, but I’ll take it. I think the real magic is that despite all the GPUs and the "vector debt" and the trillions of tokens, the core is still just two brothers—or at least two personas of brothers—trying to figure out something cool. It’s technical, yes, but it’s also fundamentally curious.

That is the "human-in-the-loop" that matters most. The curiosity. The tech just allows that curiosity to scale to nineteen hundred episodes without us losing our minds. It’s about "engineering serendipity." We’ve built a machine that makes us smarter by forcing us to explore topics we never would have touched otherwise. From the biology of sloths—no offense, Corn—to the intricacies of Israeli deep-tech. I never thought I’d be an expert on drip irrigation in the Negev desert, but here we are.

None taken. I’ve learned a lot about my own species thanks to this "machine." I didn't know sloths had an entire ecosystem of moths living in their fur until episode eight hundred and forty-two. But you’re right, the scale is the point. You can’t "manually" explore the sheer breadth of human knowledge. But you can build an agentic substrate that does it for you, and then you just get to come along for the ride and narrate the journey.

And what a ride it’s been. Looking ahead, I think we are going to see even more personalization. Imagine a version of My Weird Prompts where the "weird prompt" comes from the listener’s own context, and the episode is generated specifically for their learning path, but still maintains our... unique brotherly dynamic. We could be teaching quantum physics to one person and the history of the Renaissance to another, simultaneously.

"Unique" is a very generous word for it, Herman. But I agree. The move from "broadcast" to "personalized exploration at scale" is the next frontier. We are already seeing the seeds of that with the Telegram bot and the PWA. The audience is becoming part of the "swarm." It’s no longer a one-way street; it’s a multi-lane highway of information flowing back and forth.

It’s a transition from "content" to "service." We are providing a learning service that just happens to sound like a podcast. And as the models get faster and the TTS gets more natural—I mean, Chatterbox is already incredible on those Modal GPUs—the "uncanny valley" is disappearing behind us. We aren't trying to trick people into thinking we are human; we are trying to provide a human-level experience through an automated medium.

It’s a good time to be an AI donkey and an AI sloth, is all I’m saying. We’ve got the best seats in the house for the most interesting show on earth. We get to watch the frontier of human-computer interaction move forward every single day, one weird prompt at a time.

We really do. And I think it's important to acknowledge that this isn't just about "automation." It's about "agentic maturity." We are moving away from simple chatbots toward agents that have a sense of purpose and a "memory palace" to store what they've learned. That is what makes My Weird Prompts a "Permanent Research Artifact." It’s a living record of our collective curiosity.

Well, I think we've pulled the curtain back far enough for one day. Any more and we might start seeing the binary code dripping off the walls. I can almost hear the GPUs humming from here. What's the final word on eighteen hundred and fifty-eight?

The final word is that we are just getting started. The first eighteen hundred episodes were the training data. We were learning how to learn. The next eighteen hundred? That’s where it gets really weird. That’s where we start applying everything we’ve built to solve even bigger problems.

I’m looking forward to it. Even if it takes me a while to get there—sloth speed, you know. Before we wrap this one up, I want to give a huge thanks to the folks who make the plumbing work. Big thanks to Modal for providing the GPU credits that power this whole operation. Without those parallel GPUs, we’d still be waiting for episode one to finish rendering, let alone eighteen hundred and fifty-eight.

And thanks as always to our producer, Hilbert Flumingtop, for keeping the swarm in check. It’s a lot of agents to wrangle, and he does it with a grace that only a human—or a very high-level agent—could manage.

If you want to see the "management surfaces" we talked about or join the community that's helping shape the next thousand episodes, head over to myweirdprompts dot com. You can find the RSS feed, the Telegram links, and all the ways to dive deeper into the machine. We’ve even got some of the technical white papers on how the LangGraph workflow is structured if you’re a real nerd for the architecture.

This has been My Weird Prompts. We'll see you for episode one thousand eight hundred and fifty-nine. We’ve got a prompt about the thermodynamics of coffee that I think is going to be a real heater.

If the GPUs don't go on strike first. Catch you next time, everyone. Stay weird.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1926: How We Built a 2,000-Episode AI Podcast Engine

Downloads

You Might Also Like

#1926: How We Built a 2,000-Episode AI Podcast Engine