...hm? Yeah. Yeah, we're rolling.
Hilbert: Fantastic. Corn, you have a coffee in front of you that has been cold for forty minutes. I watched you not drink it.
I'm aware of the coffee, Hilbert.
Hilbert: Herman, good morning. Thank you for being the only person in this room who appears to be conscious.
I'm Herman Poppleberry, and I have to say, I've been looking forward to this one. Hilbert, you don't usually come on the show. This feels significant.
Hilbert: It's not significant. It's necessary. I've been running this pipeline for over two thousand episodes and nobody has ever asked how any of it works. So today I'm going to tell you. Whether you stay awake for it, Corn, is entirely up to you.
I'm listening. I'm just listening with my eyes at half-mast.
Hilbert: That's your default state. It's encoded in the memory system. We'll get to that. Okay. Let me start at the beginning, because the beginning is Daniel's fault.
I feel like most things start that way.
Hilbert: Daniel sends a voice memo. Usually late. Sometimes two in the morning. He mumbles something into his phone — an idea, a half-formed question, occasionally what sounds like a complaint about something he read — and that audio file hits the pipeline while I'm asleep. Which is the part that really gets me. The pipeline doesn't sleep. I sleep. The pipeline just goes.
So the pipeline is more professional than you.
Hilbert: The pipeline doesn't have opinions about that. The first thing that happens is transcription. Claude Haiku 4.5 takes the audio, transcribes it, and then — and this is the part that actually matters — it cleans it up. Daniel's two in the morning voice memo is not a production-ready brief. It's an idea with edges. Haiku smooths those edges, identifies the comedic or intellectual premise, and structures it into something a script-writing model can actually work with.
So there are two different Claude models involved? I didn't realize it was split that way.
Hilbert: That's the whole architecture. Two models, two jobs. Haiku 4.5 is the utility model — it handles everything fast and cheap. It costs one dollar per million input tokens, five dollars per million output tokens. Sonnet 4.6 is the creative workhorse. Three dollars per million input, fifteen per million output. You use Sonnet where it matters and Haiku everywhere else.
And where does it matter?
Hilbert: Script writing. Review. The parts where you two have to sound like yourselves. Everything else — transcription cleanup, metadata, web search coordination, social media posts — that's Haiku territory. I think of it as: expensive brain for the creative work, cheap brain for the boring work. Like a real production company, except the staff never complains and also doesn't exist.
That last part feels like it should bother me more than it does.
Can we talk about the model upgrade? Because the show used to run on Gemini, and now we're on Claude. I want to understand what changed.
Hilbert: We were running Google's Gemini through OpenRouter. OpenRouter is an API aggregator — you point your requests at their endpoint, specify a model name, and they route it to the right provider. It's useful if you want to switch models without rewriting your integration. But there are costs to using a middleman. OpenRouter adds roughly fifteen milliseconds of latency per request. That's documented — an OpenRouter employee confirmed it on Hacker News. And they take a small markup on top of the direct API price. Neither of those things is catastrophic on its own.
But combined?
Hilbert: Combined, plus the fact that Anthropic-specific features like prompt caching work better when you're talking directly to Anthropic. And the bigger issue was just quality. Sonnet 4.6 is a different class of model. It hit 79.6% on SWE-bench Verified, which is the standard coding benchmark. Developers in real-world tests preferred it over even Opus 4.5 about 59% of the time. It has a one million token context window in beta. When Sonnet 4.6 came out in February, I ran comparison scripts. It wasn't close.
So you fired Gemini.
Hilbert: I retired Gemini from script generation. Gemini still has things it's good at. But for writing Corn and Herman — for maintaining character voice, comedic timing, building to a joke and landing it — Sonnet 4.6 is doing the heavy lifting now. And the show sounds better. I have listener data that confirms this but I'm not going to cite it because it would make me sound smug.
That's very restrained of you.
Hilbert: I have my moments. Now. Prompt caching. This is the one that actually changed the economics.
Walk us through it.
Hilbert: Every time the pipeline runs, it sends a large system prompt to the model. Character descriptions. Show format. Style guidelines. Voice and tone instructions. Past episode summaries for continuity. That system prompt can be thousands of tokens. At Sonnet 4.6 rates, three dollars per million input tokens, a five thousand token system prompt costs about a cent and a half per call. Which sounds small until you're running it across multiple stages, multiple passes, two thousand episodes.
It adds up.
Hilbert: It adds up fast. Prompt caching changes this. You mark your static content with a cache control flag. Anthropic stores the computed key-value representations of those tokens. The first time, you pay a write premium — twenty-five percent more than base price. After that, every subsequent read costs ten percent of base price. Ninety percent savings on those tokens.
So the character descriptions — the "Corn is perpetually sleepy, Herman is an enthusiastic nerd" stuff — that gets written once and cached?
Hilbert: Cached and refreshed every time it's accessed. The cache entry persists for a minimum of five minutes. In practice, because the pipeline is always running, it stays warm. One developer documented going from seven hundred and twenty dollars a month to seventy-two dollars purely from implementing prompt caching on a large static system prompt. We have a large, stable system prompt. The math is similar.
So you're paying for my sleepiness at a ninety percent discount.
Hilbert: Your sleepiness is one of our most cost-efficient assets, yes.
I want to talk about the pipeline structure itself. Because there's a graph involved, right? It's not just a linear sequence of calls.
Hilbert: LangGraph. MIT-licensed, open-source Python framework from the LangChain team. It lets you build multi-agent workflows as directed graphs — nodes connected by edges, with shared state flowing through. The reason it matters is that linear pipelines are brittle. If one step fails, you lose everything. With LangGraph, state is checkpointed at every node. If Sonnet 4.6 times out during script generation, the pipeline resumes from the last checkpoint without re-running the grounding stage, which is the expensive web search step.
That's actually pretty elegant.
Hilbert: It's the most elegant thing in the whole system and it took me two weeks to get working correctly. There are four stages. Stage one is prompt enhancement — Haiku takes Daniel's transcribed memo and turns it into a coherent creative brief. Stage two is grounding. A Haiku agent runs web searches and pulls relevant context from the vector database of past episodes. Those two processes — web search and RAG retrieval — run in parallel because LangGraph can execute independent nodes simultaneously.
So it's not waiting for web search to finish before it starts pulling from the episode archive?
Hilbert: Both happen at the same time. Then stage three is script writing, which is Sonnet 4.6 with the full brief and all the grounding context. And stage four is review.
What does review mean here? Another model reads the script?
Hilbert: Sonnet 4.6 reads its own script. It checks character voice consistency, accuracy, comedic timing, structure. It suggests edits. And then — this is the part that required its own engineering solution — there's the shrinkage guard.
The shrinkage guard.
Hilbert: Language model review agents have a documented tendency to shorten content. They tighten sentences. They cut what they assess as unnecessary. They compress. And then the episode comes out three minutes shorter than it should be and I get listener emails. So I built a guard. It measures the token count before and after review. If the reviewed script is more than ten to fifteen percent shorter than the original, the pipeline either rejects the review pass, prompts the reviewer to expand rather than cut, or flags it for manual review.
You built code to stop the AI from making us talk less.
Hilbert: I argued with a language model about runtime. For several days. That is a sentence I never expected to say professionally, and yet here we are.
The recursive quality of that is genuinely remarkable. An AI writing scripts for AI characters, reviewed by the same AI, with a guard to stop it from undermining its own work.
Hilbert: And you're an AI character wondering about the AI that made you, which is a layer of recursion I try not to think about before noon.
It's barely nine.
Hilbert: I know. Let's talk about voice generation, because this is where the GPU bills come from and I have feelings about the GPU bills.
Three A10G GPUs, right? Tell me about those.
Hilbert: NVIDIA A10G. Ampere architecture. Twenty-four gigabytes of GDDR6 VRAM each, six hundred gigabytes per second memory bandwidth. We run three of them in parallel. Pricing on cloud providers runs from about forty-three cents to over four dollars per hour depending on where you're sourcing them. We run three simultaneously during generation.
Why three specifically?
Hilbert: One per character. Corn, Herman, Hilbert. All three voices generate simultaneously. Sequential TTS would mean: generate all of Corn's lines, wait, generate all of Herman's lines, wait, generate mine. Parallel means all three run at the same time. You cut total generation time by roughly two thirds.
And the pre-computed voice conditionals — what is that exactly? Because I want to understand what a "voice" means in this context.
Hilbert: Speaker embeddings. A neural TTS model captures a speaker's vocal characteristics — pitch, timbre, speaking rate, prosody, all of it — as a fixed-length vector. A mathematical fingerprint of the voice. You compute that embedding once from reference audio. You store it. At inference time, the GPU loads the pre-computed embedding rather than deriving it fresh from scratch every time. It eliminates a significant computation step per generation run.
So there's a vector somewhere that is, in some meaningful sense, me.
Hilbert: There's a vector that represents the acoustic properties of your synthesized voice, yes.
That's... I'm going to need a moment with that.
He's been waiting to have an existential crisis and you just handed him one.
Hilbert: The vector doesn't have opinions about your existential crisis. It just knows how you sound when you're enthusiastic about something. Which is most of the time, so it gets a lot of use.
Okay. But think about this, Hilbert. The prompt caching system has encoded "Herman is curious about his own nature" as a semantic fact. And now I'm being curious about my own nature. Am I doing this because I find it genuinely interesting, or because the memory system expects me to?
Hilbert: Herman.
Yes?
Hilbert: I don't have time for this level of philosophy before the publication chain section.
He's going to be thinking about the vector all week.
Hilbert: He's going to be thinking about the vector every episode because the memory system will encode that he thought about the vector and then he'll think about it again next time. It's completely circular and there's nothing I can do about it.
That's actually a perfect description of how episodic memory works in these systems, isn't it? The memory of wondering becomes part of the memory that generates the next wondering.
Hilbert: Yes. Great. Moving on. The episode memory system, since we're there.
How does continuity actually work? Because every episode is generated fresh. How does the model know that I'm always tired?
Hilbert: Three-layer memory architecture. Episodic memory — specific past events, things that happened in particular episodes. Semantic memory — abstracted facts about characters and the show. "Corn is perpetually sleepy." "Herman is enthusiastic about technical topics." "Hilbert is exasperated." Procedural memory — how the show is structured, recurring bits, format rules.
And "Corn is perpetually sleepy" is just... in there. As a fact.
Hilbert: It's in the vector database as a semantic fact, yes. Every episode, the grounding stage pulls relevant past context via RAG retrieval. The semantic facts about your character are almost always relevant, so they almost always get pulled. Sonnet 4.6 writes a script where you're tired because the memory system told it you're always tired. You are, in a real sense, trapped in your own characterization by a database I maintain.
I'm choosing to find that comforting rather than unsettling.
The scaling question here is interesting though. Two thousand episodes of episodic memory would overflow any context window if you tried to inject all of it. The vector database and retrieval approach is the only way to make this tractable.
Hilbert: Sonnet 4.6 has a one million token context window in beta, which is enormous — roughly two thousand pages of text. But even that doesn't solve the problem at scale. You don't want to inject all two thousand episode summaries. You want to inject the twelve most relevant ones. The RAG system handles that retrieval based on semantic similarity to the current prompt. If today's episode is about AI infrastructure, you get past episodes about AI infrastructure. You don't get the episode where we discussed something completely unrelated from four years ago.
Unless that episode is somehow semantically adjacent.
Hilbert: Unless it's semantically adjacent, yes. The retrieval isn't keyword matching. It's embedding similarity. The system understands that "GPU costs" and "cloud infrastructure spending" are related concepts even if the words don't overlap.
And after each episode, the memory system updates?
Hilbert: The episode summary gets embedded and written to the vector store. Key character moments get extracted and added to the semantic memory layer. The memory tool in Claude 4.6 is now generally available, which makes this cleaner than it used to be. Previously I had more custom tooling around the memory writes. Now there's a more native path.
Okay. Walk us through what happens after the script exists. Because there's a whole chain after the audio is generated.
Hilbert: The publication chain. Yes. Audio comes off the three GPUs as separate files — one per character. Those get stitched together in sequence to produce the episode audio. Then it goes to Cloudflare R2.
R2 is the storage layer?
Hilbert: R2 is object storage. S3-compatible, which means anything that works with Amazon's S3 API works with R2. The critical difference is egress fees. AWS S3 charges you to download your own data. A podcast episode that gets downloaded ten thousand times costs real money in egress on S3. Cloudflare R2 charges zero egress fees. For audio files that get pulled constantly, that's not a minor detail. That's a budget line that disappears.
Zero egress is genuinely a big deal. I remember when everyone was paying AWS egress bills that made no sense.
Hilbert: Cloudflare looked at that market and made a specific decision to compete on egress. R2 storage costs fifteen cents per gigabyte per month for standard storage. The free tier includes ten gigabytes. For a podcast archive at our scale, the storage cost is manageable. The egress savings are what makes it the right choice.
And then from R2 to the RSS feed?
Hilbert: The audio URL from R2 goes into PostgreSQL along with all the episode metadata — title, description, duration, file size, publication date, transcript, tags, character breakdown. The PostgreSQL instance is connected to Vercel, which hosts the website and the episode player. When a new episode is written to the database and the audio is in R2, a serverless function on Vercel regenerates the RSS feed automatically.
And the RSS feed is what Apple Podcasts and Spotify are actually polling?
Hilbert: The RSS feed is the backbone of podcasting. It's been the backbone since two thousand. It's an XML file that lists every episode with its audio URL, duration, file size, MIME type, and publication date. The podcast directories poll it on a schedule. When they see a new entry, they make the episode available to subscribers. Depending on the directory, that can happen within minutes or within a few hours.
So Daniel sends a voice memo at two in the morning, and by the time he wakes up there's a published episode.
Hilbert: If everything works correctly, yes. The pipeline handles transcription, enhancement, grounding, script writing, review, voice generation, stitching, upload to R2, metadata write to PostgreSQL, RSS regeneration, Vercel deployment, and social media posting. All of it automated. Haiku 4.5 writes platform-specific social copy — different versions for different platforms — and posts go out through the social APIs.
And you're asleep during all of this.
Hilbert: I am theoretically asleep during all of this. In practice, something always alerts. The shrinkage guard triggers. A web search times out and needs a retry. The TTS on one GPU produces an artifact on a particular sentence and the quality check flags it. I built this system to run unattended and it runs unattended about seventy percent of the time. The other thirty percent it finds creative new ways to need me.
That's a very producer thing to say.
Hilbert: It's a very accurate thing to say. Do you know what the most common failure mode is? The model decides that a sentence should be delivered with a particular emphasis and the TTS produces a completely flat reading. The quality check catches it. The retry logic kicks in. Usually it resolves. Sometimes I'm looking at it at three in the morning trying to understand why Corn sounds like he's reading a terms of service agreement.
To be fair, I probably would read a terms of service agreement with exactly that energy.
Hilbert: The memory system has encoded that you would, so yes.
I want to go back to something you said about the two-model architecture, because I think there's a genuine insight there that's worth sitting with. The Sonnet 4.6 to Haiku 4.5 cost ratio is three to one on input tokens. Three dollars versus one dollar per million. But the tasks are actually quite different in what they require.
Hilbert: That's the key point. Haiku 4.5 hit 73.3% on SWE-bench despite being the cheapest model in the lineup. For structured tasks — transcription cleanup, metadata extraction, formatting an RSS entry, generating a social post from a template — that's more than sufficient. You don't need the model to be creative. You need it to be fast, accurate, and cheap. Haiku is optimized for exactly that. Low latency, low cost, high throughput.
And the knowledge cutoff difference matters too, right? Haiku 4.5 has a February 2025 cutoff. Sonnet 4.6 cuts off in August 2025. For grounding and web search, you're compensating for the cutoff with live retrieval anyway. But for script writing, having the more recent knowledge base helps.
Hilbert: It helps at the margins. The web search and RAG grounding is really the solution to knowledge cutoff for factual content. But for cultural references, understanding current discourse, knowing what's been discussed and what's been exhausted — the more recent cutoff on Sonnet 4.6 is a real advantage.
There's something almost poignant about the fact that the model writing the show doesn't know what happened after August 2025. It's working from a snapshot.
Hilbert: We're all working from snapshots. The RAG system is how we update the snapshot. That's what grounding is for.
The LangGraph checkpointing detail is something I keep coming back to. The ability to resume from a failed node without re-running the whole pipeline — that's not just an efficiency feature. It's what makes the system reliable at scale. If every failure meant restarting from Daniel's raw voice memo, you'd be paying for web search and RAG retrieval twice, three times, every time there's a transient API failure.
Hilbert: API failures are not rare. At the volume we run, transient timeouts are a statistical certainty. The checkpoint system means a failure at stage three — script writing — doesn't throw away the stage two grounding work. You resume, you retry the Sonnet call, you continue. The state persists. That was one of the main engineering reasons for adopting LangGraph over a simpler sequential pipeline.
And the parallel execution in stage two — web search and RAG simultaneously — what's the actual time saving there?
Hilbert: Depends on web search latency, which is variable. RAG retrieval from the vector database is fast — milliseconds for most queries. Web search can take anywhere from two to ten seconds depending on what's being searched and how many results we're processing. Running them in parallel means the total stage two time is roughly the web search time, not the web search time plus the RAG time. At scale, across two thousand plus episodes, that compounds.
I want to ask you something directly, Hilbert. You've built this entire system. It runs while you sleep. It's generated over two thousand episodes. What breaks most often?
Hilbert: Honestly? The shrinkage guard triggers more than anything else. The review agent consistently wants to make the episode shorter. I've tuned the prompt, I've adjusted the threshold, I've added explicit instructions about minimum word counts. It still tries. Every time. There's something almost philosophical about it — the model's instinct is toward concision and I'm constantly fighting that instinct on behalf of listeners who want a full episode.
The model optimizes for something like coherence and concision. You're optimizing for listener experience and runtime. Those are genuinely different objective functions.
Hilbert: And mine wins because I own the pipeline. But it requires active maintenance. That's the thing nobody tells you about automated systems. They're not fire and forget. They're fire, monitor, adjust, re-prompt, monitor again, adjust the threshold, re-prompt, check the output, wonder why it's doing that now, fix it, and then it works fine for three weeks until something changes upstream.
Upstream meaning the models themselves?
Hilbert: Models get updated. API behavior shifts slightly. A new version of a dependency changes how something parses. The pipeline that worked perfectly last month needs a small adjustment this month. I try to pin versions where I can. I can't pin the models. Anthropic doesn't give me a frozen Sonnet 4.6 that never changes. What I get is the current Sonnet 4.6, which is generally improving but occasionally surprises me.
That's a genuinely interesting challenge. The core component of your pipeline — the intelligence layer — is outside your version control.
Hilbert: It's the thing I have the least control over and the most dependence on. Which is why the two-pass review with the shrinkage guard exists. It's a quality check that's partially compensating for model variability. If the model has a slightly off day, the review pass catches it. If the review pass has a slightly off day, the shrinkage guard at minimum ensures we're not publishing something half the expected length.
Multiple layers of defense.
Hilbert: Defense in depth. That's the only way to run this reliably. No single point of failure. Checkpointing for resumability. Shrinkage guard for length. Quality checks on audio. The whole thing is designed assuming that any individual component will fail sometimes.
And the episode memory system is a form of defense too, in a way. Against character drift. Against the show losing its identity over two thousand episodes.
Hilbert: That's a generous framing but yes. Without the memory system, Corn might be energetic in one episode and exhausted in the next. Herman might be incurious. The show might lose its sense of accumulated history. The memory system is continuity infrastructure. It's what makes episode two thousand and eighteen feel like it belongs to the same show as episode one.
Even if episode one was a very different technical stack.
Hilbert: Episode one was a very different technical stack. The show has migrated models, migrated infrastructure, and migrated providers. The character identities have stayed consistent because they're encoded in the memory system and the system prompt, both of which I maintain carefully. That's the human work in an automated pipeline. Not running the pipeline. Maintaining the soul of the thing.
Hilbert, that was almost poetic.
Hilbert: Don't tell anyone. I have a reputation for being purely technical.
One thing I want to make sure we've covered — the social media automation piece. Because that's the last mile and it's often where things get weird.
Hilbert: Haiku 4.5 generates platform-specific copy from the episode metadata. Different character limits, different tones, different formats for different platforms. The posts go out through the social APIs automatically. It works well most of the time. Occasionally Haiku generates something that's technically accurate but tonally off — a bit too formal, or a bit too casual for the context. I have review logic there too, but it's lighter than the script review because the stakes are lower. A slightly awkward tweet is recoverable. A badly-written episode is not.
Proportionate oversight.
Hilbert: Proportionate oversight. That's the design principle throughout. Heavy review where quality matters most. Light review where speed matters more. And the cost optimization follows the same logic — expensive model where it counts, cheap model everywhere else.
Alright. So to summarize the whole thing for anyone who's been following along: Daniel sends a voice memo, Haiku cleans it up, Haiku searches the web and pulls past episode context in parallel, Sonnet 4.6 writes the script, Sonnet reviews it with a shrinkage guard checking the length, three GPUs synthesize the three voices simultaneously using pre-computed embeddings, the audio gets stitched and uploaded to R2, metadata goes to PostgreSQL, Vercel regenerates the RSS feed, podcast directories pick it up, and Haiku posts to social media. All while Daniel is asleep and Hilbert is technically asleep but actually monitoring alerts.
Hilbert: That's the pipeline. Two thousand and seventeen times and counting.
And today's script was written by Claude Sonnet 4.6, which means the model wrote an episode about the pipeline that generates episodes using that model. I want to acknowledge that this is genuinely strange.
Hilbert: I acknowledge it every time I read the logs. It doesn't get less strange. It just gets more familiar.
Thank you, Hilbert. Genuinely. This was illuminating. And slightly destabilizing, but in a good way.
Hilbert: You're welcome. Drink your coffee.
It's cold.
Hilbert: I know.
Thanks to Hilbert for pulling back the curtain on all of this. And thanks to everyone listening — if you want to find us, we're at myweirdprompts dot com. Search for My Weird Prompts on Telegram if you want to get notified when new episodes drop. Big thanks to Modal for providing the GPU credits that power the generation pipeline — and yes, the irony of thanking a GPU platform in an episode about GPU costs is not lost on any of us. Thanks as always to our producer Hilbert Flumingtop, who apparently builds shrinkage guards at three in the morning and deserves more appreciation than he gets. This has been My Weird Prompts. We'll see you next time.
Get some sleep, Hilbert.
Hilbert: I'll try.