My Weird Prompts

A Voice-to-Podcast Pipeline

Technical Architecture White Paper

Daniel Rosehill — April 2026 — Version 2.0

Pipeline V4 – Chatterbox TTS on Modal

Download PDF

19-page PDF version of this white paper for offline reading.

Episode Dataset

Full episode archive on Hugging Face: transcripts, metadata, audio URLs, and embeddings.

Abstract

My Weird Prompts is a fully automated podcast pipeline that transforms prompts into polished, multi-voice podcast episodes. A user submits a topic — via the Recorder PWA, Telegram bot, or MCP admin server — and within 15–20 minutes a complete episode is published with AI-generated dialogue, original cover art, loudness-normalised audio, show notes, and an RSS feed entry. The pipeline supports over a dozen episode formats including standard discussions, multi-voice panels, roundtables, debates, news briefings, and interviews.

The system is built on a serverless architecture using Modal for GPU compute, Cloudflare R2 for object storage, Neon PostgreSQL for metadata, and Vercel for static site hosting. GPU compute is generously sponsored by Modal, whose support makes this experiment possible.

This document describes the production architecture as of April 2026, including the LangGraph pipeline, the multi-model A/B testing strategy, the parallel TTS architecture, the safety and fault tolerance mechanisms, and the cost model.

Project Links

Website: myweirdprompts.com Dataset: Hugging Face Recorder: recorder.myweirdprompts.com

1. Introduction

1.1 The Concept

My Weird Prompts (MWP) is an experiment in full-stack AI automation: a podcast where every episode begins with a single voice memo and ends as a published, multi-voice audio show — with no manual editing in between.

The show features two AI hosts:

Corn — a relaxed, knowledgeable sloth who leads the conversation
Herman — an enthusiastic, curious donkey who asks follow-up questions

Prompts are submitted by Daniel Rosehill (the show's producer) via three interfaces: a mobile Progressive Web App (Recorder), a Telegram bot (@myweirdpromptsbot), or the MCP admin server integrated with Claude Code. The pipeline transcribes the prompt, researches the topic, writes a full dialogue script, generates cover art, synthesises speech with cloned voices, assembles a broadcast-ready episode, and publishes it to the web and podcast platforms.

1.2 Design Goals

The pipeline was designed around several principles:

Zero human editing — every stage is automated, from transcription to publication.
Broadcast-quality output — loudness-normalised to EBU R128 (-16 LUFS), with proper intro/outro, disclaimer, and credits.
Fail-open safety — non-critical failures (cover art, polish pass) degrade gracefully rather than aborting the episode.
Cost efficiency — the entire pipeline runs for under $0.50 per episode, using A10G GPUs and competitive LLM pricing via OpenRouter.
Full observability — progress tracking, email notifications, job queuing, and recovery storage for failed episodes.

1.3 The Cast

Character	Description
Corn (Host)	A laid-back sloth with deep knowledge across topics. Leads conversations with measured insight and dry humour.
Herman (Co-host)	An energetic donkey who asks the questions listeners are thinking. Brings enthusiasm and follow-up curiosity.
Daniel (Producer)	The human behind the curtain. Submits voice prompts and maintains the pipeline. Occasionally acknowledged by the hosts.

2. System Architecture

Isometric view of the serverless cloud architecture: compute, storage, and hosting components

2.1 High-Level Flow

The production system consists of six deployed components connected by webhooks and shared storage:

2.2 Deployment Topology

Component	Platform	URL
Recorder PWA	VPS (Docker)	recorder.myweirdprompts.com
Pipeline Webhook	Modal (serverless)	modal.run/.../webhook/generate
TTS Workers	Modal (A10G GPUs)	Internal (3 parallel workers)
Telegram Bot	Modal (serverless)	@myweirdpromptsbot
MCP Admin Server	Local (stdio)	Claude Code integration
Frontend Website	Vercel (SSG)	myweirdprompts.com
Admin CMS	Vercel (Next.js)	admin.myweirdprompts.com
Object Storage	Cloudflare R2	episodes.myweirdprompts.com
Database	Neon PostgreSQL	Serverless Postgres
Archival Storage	Wasabi S3	EU-Central-2 bucket

2.3 Infrastructure Stack

The pipeline uses exclusively serverless and managed services, with no dedicated servers beyond the recorder VPS:

Compute: Modal (serverless containers with GPU scheduling)
Storage: Cloudflare R2 (S3-compatible, zero egress fees), Wasabi (archival)
Database: Neon PostgreSQL (serverless, auto-scaling)
Hosting: Vercel (static site generation from Astro)
CI/CD: GitHub Actions (auto-deploy on push to main)
LLM Gateway: OpenRouter (unified API for multi-model A/B testing)
Telegram Bot: Modal (serverless webhook for @myweirdpromptsbot)
MCP Server: Local stdio server for Claude Code admin integration
DNS/CDN: Cloudflare (custom domains, caching)

3. Pipeline Stages

Each episode passes through 12 stages. The full pipeline runs in a single Modal container (orchestrator) that spawns GPU workers for TTS. Total wall-clock time is typically 15–20 minutes.

Stage 1: Audio Ingestion & Validation

The pipeline receives an audio URL (typically from the Recorder PWA via Cloudflare R2) and performs initial validation:

Download: HTTP GET with 120-second timeout and retry with exponential backoff
Size check: Files under 1 KB are rejected as invalid
Format support: MP3, WAV, WebM, OGG, FLAC, AAC, M4A (max 50 MB)

The audio is saved to a Modal shared volume for processing.

Stage 2: Transcription

The raw audio is transcribed using Google Gemini's multimodal API (model: gemini-3-flash-preview). Rather than a pure speech-to-text service, Gemini listens to the audio and produces a cleaned transcript:

Removes filler words (um, uh, like, you know)
Eliminates false starts and repetitions
Preserves core meaning, tone, and intent
Supports disambiguation hints for technical terms

This multimodal approach captures nuances that pure ASR misses — tone, emphasis, and context.

Stage 3: Research Coordination

A grounding agent researches the topic using multiple sources:

Tavily web search for current events and facts
pgvector RAG against past episodes for semantic deduplication
Episode memory for cross-references to recent episodes
Extracts key topics, entities, and generates search queries

All grounding sub-stages fail-open — if any source is unavailable, the pipeline continues with whatever context it has.

Stage 4: Episode Planning

A dedicated planning agent (xiaomi/mimo-v2-flash via OpenRouter) creates a structured episode outline before script generation:

Segment breakdown with specific points to cover
Key facts and data to incorporate
Misconceptions to address
Cross-episode references from the episode memory system
Tone and pacing guidance

The plan is formatted as a structured prompt section that the script generator follows as a roadmap. This produces more coherent, well-structured episodes than unguided generation.

The planning agent fails open — if it returns invalid JSON or errors, the pipeline continues without a plan.

Stage 5: Script Generation

The core creative step. Uses a randomised pool of frontier LLMs via OpenRouter for A/B testing across model families:

Model	Family
xiaomi/mimo-v2-pro	Xiaomi
deepseek/deepseek-v3.2	DeepSeek
minimax/minimax-m2.7	MiniMax
google/gemini-3-flash-preview	Google

The model receives:

Grounding context from web search and past episode RAG
Episode plan providing the structural roadmap
Episode memory including the 3 most recent episodes for cross-references
Date context ensuring the model uses the correct current date

The target output is a diarized dialogue script (~3,750 words / ~25 minutes) in the format:

 CORN: [dialogue text]
HERMAN: [dialogue text] 

Key parameters: max_tokens=8000, temperature=0.8.

Why multi-model? Rotating script generation across model families prevents the show from developing a single model's stylistic fingerprint. Each family brings different strengths — some excel at humour, others at technical depth. The review pass always uses a different family than the writer, creating genuine editorial tension.

Stage 6: Script Review (Pass 1)

A cross-family LLM review using a model from a different family than the script writer (candidates: DeepSeek, Qwen, Xiaomi MiMo). The review is additive-only — it adds depth but never truncates:

Fact-checking: Verifies claims against live web sources
Plan adherence: Ensures all planned segments are covered
Depth check: Adds substance where the script is thin
TTS compliance: Fixes formatting that would confuse text-to-speech

The review agent receives the full script, original transcript, and episode plan. It returns the edited script as raw text (no JSON wrapping).

Safety mechanisms:

Shrinkage guard: Rejects edits that reduce the script by more than 20%
Minimum length: Rejects output under 1,000 characters
Fail-open: Returns the original script if anything goes wrong

Parameters: temperature=0.4, max_tokens=10000.

Stage 7: Script Polish (Pass 2)

A deterministic cleanup pass using regex-based verbal tic removal plus light LLM polish:

Verbal tic removal: Reduces overuse of "Exactly", "Absolutely", "That's a great point"
Sign-off cleanup: Ensures no questions or new topics after goodbye
Flow improvement: Smooths transitions and pacing
TTS final check: Catches remaining formatting issues

This pass does not change facts or substance — only dialogue naturalness.

Safety mechanisms:

Shrinkage guard: Rejects output if script shrinks by more than 15%
Fail-open: Returns the original script on any error

Parameters: temperature=0.3, max_tokens=10000.

Two-pass document review pipeline with fact-checking and quality gates

Stage 8: Metadata Generation

Uses xiaomi/mimo-v2-flash (via OpenRouter) to generate episode metadata from the final script:

Title: Concise, engaging episode title
Slug: URL-safe identifier
Description: 2–3 sentence summary
Excerpt: One-line teaser (for social media)
Tags: Dynamic taxonomy from a registry of canonical tags
Category/Subcategory: Hierarchical classification
Image prompt: Description for cover art generation
Embedding: Semantic vector for similarity search

Tags are generated using a taxonomy-aware system that maintains consistency across episodes and prevents tag sprawl.

Stage 9: Cover Art Generation

Uses Fal AI (fal-ai/flux/schnell) to generate a unique cover image:

Model receives the image prompt from metadata generation
Generates one cover art variant
Image uploaded to Cloudflare R2

Cover art is non-critical — if generation fails, the pipeline continues with a default cover image (graceful degradation).

Parallel GPU workers processing text-to-speech segments simultaneously

Stage 10: Text-to-Speech (Parallel GPU Workers)

The most compute-intensive stage. Uses Chatterbox TTS (regular, not Turbo) running on Modal A10G GPUs.

Architecture

Orchestrator (CPU) Splits segments across workers

↓

Worker 1 (A10G) ~27 segments

Worker 2 (A10G) ~27 segments

Worker 3 (A10G) ~26 segments

↓

Results merged, concatenated with ffmpeg

Key Optimisations

Pre-computed voice conditionals: Voice embeddings are computed once and cached in R2, eliminating 5–10 seconds of processing per segment.
Parallel workers: Segments are distributed across 3 A10G GPU workers (configurable). Each worker loads the model once and processes its entire batch, amortising model loading cost.
Chunk splitting: Long segments (>250 characters) are split at sentence boundaries to avoid Chatterbox's ~40-second audio output limit.

Quality Choice: Regular vs Turbo

The pipeline uses Chatterbox Regular rather than Chatterbox Turbo, despite Turbo being faster. Testing showed Regular produces ~95% fewer TTS hallucinations (random word injection, phrase repetition, audio artifacts). For long-form content like podcast episodes, quality is worth the speed tradeoff.

Failure Handling

Segments that fail TTS are tracked but don't abort the episode
20% failure threshold: If more than 20% of segments fail, the entire episode is aborted to prevent short/broken output
Failed segments produce silence gaps rather than corrupted audio

Stage 11: Audio Assembly

The final audio is assembled from pre-recorded show elements and generated content:

Order	Component
1	Intro jingle (pre-recorded music)
2	AI-generated disclaimer
3	"Here's Daniel's prompt!" announcement
4	Original user prompt audio
5	Whoosh transition sound
6	AI dialogue (Corn & Herman)
7	LLM credit announcement
8	TTS engine credit announcement
9	Outro jingle

Processing pipeline:

All components converted to consistent format (44.1 kHz, mono, 16-bit PCM)
Concatenated via ffmpeg
Single-pass EBU R128 loudness normalisation to -16 LUFS with -1.5 dB true peak
Encoded as MP3 at 96 kbps (transparent for speech, ~50% smaller than 192k)

Stage 12: Publication & Distribution

The final stage publishes the episode across multiple systems:

Cloudflare R2: Audio file, cover art, transcript PDF, and waveform peaks uploaded
Neon PostgreSQL: Episode metadata, tags, category, embedding, and transcript inserted
Vercel Deploy Hook: Triggers a rebuild of the Astro static site (with retry logic, up to 3 attempts)
Wasabi S3: Full episode backed up to archival storage
n8n Webhook: Post-publication webhook triggers downstream syndication (Telegram, social media)

Quality gates before publication:

Duration check: Episodes under 10 minutes are rejected (ffprobe validation, with file-size fallback)
Script length: Minimum 2,000 words required before TTS
Segment count: Minimum 10 dialogue segments required

4. Safety & Fault Tolerance

The pipeline is designed to be resilient to partial failures. Most stages fail open, and critical failures are caught and reported.

4.1 Fail-Open Architecture

Several pipeline stages are non-critical and degrade gracefully:

Stage	On Failure	Impact
Episode Planning	Continue without plan	Less structured script
Research Coordinator	Continue without research	Relies on model knowledge
Script Review (Pass 1)	Use original script	No fact-checking pass
Script Polish (Pass 2)	Use reviewed script	May have verbal tics
Cover Art	Use default cover image	Generic episode artwork
Waveform Peaks	Skip peaks	No waveform visualisation
Wasabi Backup	Skip archival	No off-site backup
Prompt Backup	Skip prompt archive	Prompt not archived

4.2 Quality Gates

Hard failures that prevent publication:

Script too short: < 2,000 words (model returned truncated or refused response)
Too few segments: < 10 dialogue segments (script didn't match expected format)
Episode too short: < 10 minutes duration (TTS failure produced short audio)
TTS failure rate: > 20% of segments failed (systemic TTS problem)
Audio download failure: File < 1 KB or download timeout > 120s

4.3 Shrinkage Guards

Both editing passes include shrinkage guards to prevent the LLM from accidentally truncating the script:

Pass 1 (Review): Rejects output if more than 20% shorter than input
Pass 2 (Polish): Rejects output if more than 15% shorter than input

This was implemented after early testing showed that review agents sometimes returned drastically shortened "corrected" scripts.

4.4 Recovery Storage

If an episode passes all quality gates but fails during publication (R2 upload failure, database error), the complete episode is saved to a recovery folder in R2:

All generated files (audio, cover art, script, metadata) are preserved
Recovery script (pipeline/scripts/recover_episodes.py) can republish failed episodes
Error notifications are sent via email with recovery path details

4.5 Zombie Job Prevention

A top-level try/except around the entire pipeline ensures that all crashes result in the job being marked as failed in the database. Before this was implemented, pre-publication crashes would leave jobs in "running" status indefinitely.

4.6 Notification System

Generation started: Email sent when script generation begins (includes title)
Error notification: Email sent on any failure (includes error details and recovery path)
Job status API: Real-time progress via /status/{job_id} endpoint

5. Cost Analysis

The pipeline is designed for minimal per-episode cost. All compute runs on serverless infrastructure with no fixed costs beyond domain registration.

5.1 Per-Episode Cost Breakdown

Service	Cost	Notes
Modal TTS (3 × A10G)	~$0.28	3 workers × ~5 min × $1.10/hr
Modal Orchestrator (CPU)	~$0.01	~15 min × $0.04/hr
OpenRouter LLM (script + review)	~$0.05	Multi-model pool, competitive pricing
Transcription + utilities	Minimal	Gemini multimodal + lightweight models
Fal AI (cover art)	~$0.01	Flux Schnell, single image
Cloudflare R2	Free	Free egress, minimal storage
Neon PostgreSQL	Free	Within free-tier limits
Vercel	Free	Hobby plan sufficient
Total per Episode	~$0.30–0.45	Varies with episode length

5.2 GPU Pricing Reference

GPU	Per Second	Per Hour
T4	$0.000164	~$0.59
A10G (current)	$0.000306	~$1.10
L4	$0.000222	~$0.80
A100 (40 GB)	$0.001012	~$3.64

The A10G was chosen as the best balance of speed and cost — roughly 2× faster than T4 at nearly the same total cost per episode, since wall-clock time halves.

5.3 Monthly Cost at Scale

At the current publication rate of approximately 5–10 episodes per week:

Weekly compute: $1.50–4.00
Monthly compute: $6–16
Annual compute: $72–192

Modal generously sponsors the show with GPU credits, covering compute costs for episode generation.

6. Technology Stack

Category	Service	Role
LLM Gateway	OpenRouter	Unified API for all LLM calls: script generation (randomised model pool), review, planning, metadata, tagging
Script Models	Xiaomi MiMo, DeepSeek, MiniMax, Gemini	Randomised pool for A/B testing across model families
Utility Models	xiaomi/mimo-v2-flash	Planning, metadata, tagging, embeddings (via OpenRouter)
Transcription	Gemini 3 Flash Preview	Multimodal audio transcription
TTS	Chatterbox Regular	Voice-cloned speech synthesis (3 parallel A10G GPU workers)
Image Generation	Fal AI (Flux Schnell)	Cover art generation
Compute	Modal	Serverless GPU containers, job queuing, parallel workers
Object Storage	Cloudflare R2	Audio, images, transcripts, voice conditionals, show elements
Archival Storage	Wasabi S3	Long-term episode and prompt backup
Database	Neon PostgreSQL	Episode metadata, job tracking, tags, embeddings
Web Framework	Astro	Static site generator for podcast website
Web Hosting	Vercel	Static site hosting with deploy hooks
Admin CMS	Next.js	Episode management, storage cleanup, deploy triggers
Recorder	FastAPI + Vanilla JS	PWA for voice recording and upload
Telegram Bot	Modal (serverless)	@myweirdpromptsbot — text/voice episode submission with inline keyboard for episode type selection
MCP Admin Server	Python (stdio)	Claude Code integration for episode generation, management, analytics, and admin operations
Pipeline Orchestration	LangGraph	StateGraph-based pipeline with modular nodes for each stage
Audio Processing	FFmpeg	Format conversion, concatenation, loudness normalisation
CI/CD	GitHub Actions	Auto-deploy pipeline, recorder, and website on push to main
Notifications	Resend	Email notifications for generation status and errors
Syndication	n8n	Post-publication webhook for Telegram and social media

7. Lessons Learned

7.1 Chatterbox Regular vs Turbo

When the pipeline first adopted Chatterbox TTS, the Turbo variant was used for speed. Testing revealed that Turbo produces significantly more hallucinations: random word injection, phrase repetition, and audio artifacts. Switching to Chatterbox Regular eliminated approximately 95% of these issues. The occasional hallucination that still occurs with Regular is minor enough to go unnoticed in a 20–25 minute episode.

Takeaway: For long-form audio content, model quality matters more than speed. A hallucination-free 12-minute TTS pass is better than an 8-minute pass with artifacts scattered throughout.

7.2 Parallel Workers + Cached Conditionals

Two optimisations reduced TTS time from 36+ minutes to approximately 10 minutes:

Pre-computed voice conditionals: Processing voice samples on every segment added 5–10 seconds of overhead each. Pre-computing embeddings once and caching them in R2 eliminates this entirely.
Parallel workers: Instead of processing 80 segments sequentially on one GPU, distributing them across 3 A10G workers (configurable) provides near-linear speedup. Each worker loads the model once and processes its entire batch.

Takeaway: For embarrassingly parallel workloads like segment-level TTS, the overhead of distributing work across workers is negligible compared to the speedup. Modal's starmap API makes this trivially easy.

7.3 The Two-Pass Editing System

The pipeline originally used a single verification agent (Perplexity Sonar via OpenRouter) to fact-check scripts. This caused a production failure when the agent returned a 169-word "corrected script" instead of the full 4,000-word script. The pipeline published this truncated output as an episode.

The replacement two-pass system was designed with specific safeguards:

Raw output only: Both passes return the complete script as raw text, not wrapped in JSON (which was causing truncation)
Shrinkage guards: Automatic rejection if the script shrinks too much
Fail-open: Both passes return the original script on any error
Cross-family review: The review model is always from a different family than the script writer, creating genuine editorial tension rather than self-reinforcement

Takeaway: When an LLM is editing another LLM's output, explicit length validation is essential. Agents will sometimes "summarise" instead of "edit" if not carefully constrained.

7.4 Episode Memory

The pipeline includes an episode memory system that provides context about recent episodes for cross-referencing. After experimentation, the context window was limited to only the 3 most recent episodes:

More episodes led to excessive cross-references that felt forced
The hosts now direct listeners to search the website for older episodes
Semantic search finds contextually relevant past episodes (not just chronologically recent ones)

Takeaway: More context is not always better. A focused, relevant subset produces more natural references than a comprehensive history.

Appendix: Pipeline Stage Summary

No.	Stage	Model / Tool	Fail Mode
1	Audio Ingestion	HTTP + ffprobe	Hard fail
2	Transcription	Gemini 3 Flash Preview	Hard fail
3	Research / Grounding	Tavily + pgvector RAG	Fail-open
4	Episode Planning	xiaomi/mimo-v2-flash (OpenRouter)	Fail-open
5	Script Generation	Randomised model pool (OpenRouter)	Hard fail
6	Review	Cross-family LLM (OpenRouter)	Fail-open
7	Polish	Regex + LLM cleanup	Fail-open
8	Metadata	xiaomi/mimo-v2-flash (OpenRouter)	Hard fail
9	Cover Art	Fal AI (Flux Schnell)	Fail-open
10	TTS	Chatterbox Regular (3× A10G GPU)	20% threshold
11	Audio Assembly	FFmpeg	Hard fail
12	Publication	R2 + Neon + Vercel	Recovery storage

← Back to Episodes About the Show →