Archive: This documents the V3 pipeline (Chatterbox TTS via fal.ai). See Pipeline for the current implementation.
Episode Generation Pipeline
The complete workflow from voice prompt to published podcast episode
V3 introduces 4 voice-cloned characters, call-in segments, sketchy ads, voice sample caching, and optimized parallel TTS processing.
Pipeline Overview
Architecture
Voice Input
Audio prompts recorded via Voicenotes app or any voice memo
Script Generation
Gemini 2.5 Flash transcribes and generates diarized 4-speaker dialogue (20-40 min)
Voice Synthesis
Chatterbox TTS via fal.ai with 4 cloned voices and parallel processing
Cover Art
Flux Schnell generates episode artwork via fal.ai
Audio Assembly
FFmpeg concatenates intro, prompt, dialogue, and outro
Publishing
Cloudinary CDN + Wasabi backup + Astro blog + Neon PostgreSQL
The Cast
V3 introduces a full cast of 4 voice-cloned characters, each with distinct personalities and roles in the show:
Corn
The Curious Host
- Enthusiastic sloth who asks probing questions
- Keeps conversation accessible with relatable examples
- Sometimes gets ahead of himself
- Plays devil's advocate when needed
Herman Poppleberry
The Expert
- Knowledgeable donkey with deep insights
- Provides technical details and authoritative explanations
- Has strong opinions and pushes back on Corn
- Can be a bit pedantic at times
Jim from Ohio
The Crotchety Caller
- Skeptical listener who calls in with complaints
- Grumpy, slightly cantankerous, always finds something to disagree with
- Peppers calls with random off-topic remarks
- Classic curmudgeon energy
Larry
The Sketchy Advertiser
- Delivers ads for dubious, made-up products
- Over-the-top infomercial energy
- Never provides contact info or website
- Every ad ends with "BUY NOW!"
Pipeline Stages
Voice Capture & Format Conversion
Audio prompts are placed in the prompts/to-process/ queue. The pipeline:
- Converts to consistent 44.1kHz WAV format
- Preserves original audio without destructive processing
- Simple format conversion ensures reliable concatenation
V3 Design Philosophy
Unlike V2 which attempted silence removal and audio cleanup, V3 uses a "pass-through" approach for prompt audio. This prevents any risk of losing speech content from aggressive processing.
Technical Implementation
ffmpeg -y -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 1 output.wav Transcription & Script Generation
Gemini Flash receives the audio and generates a complete diarized podcast script featuring all four characters:
- Corn & Herman - Main discussion (with friendly disagreements)
- Jim from Ohio - One call-in segment per episode
- Larry - One sketchy ad break per episode
Scripts are dynamically sized (20-40 minutes) based on topic complexity.
Dynamic Episode Length
Script length is determined by topic complexity:
- 20-25 minutes (3000-3750 words) - Simple topics, single focused questions
- 25-35 minutes (3750-5250 words) - Multi-faceted topics requiring explanation
- 35-40 minutes (5250-6000 words) - Complex topics with rich history or controversy
Episode Structure
Each script follows a consistent structure:
- Opening Hook - Welcome and topic introduction
- Topic Introduction - Why listeners should care
- Core Discussion Part 1 - Deep exploration with host disagreements
- Larry's Ad Break - One sketchy product ad
- Core Discussion Part 2 - Continued exploration with examples
- Jim's Call-In - Crotchety caller segment
- Practical Takeaways - Real-world applications
- Closing Thoughts - Future implications and sign-off
Host Dynamics
Corn and Herman don't always agree - this creates engaging tension:
- Herman occasionally corrects Corn or challenges oversimplifications
- Corn defends his positions and points out when Herman is being too technical
- Friendly disagreements get resolved through discussion
- Light teasing between the hosts adds personality
Diarization Format
The script follows strict diarization for TTS parsing:
Corn: Welcome to My Weird Prompts! I'm Corn, and as always...
Herman: Yeah, and I think what's interesting is that most coverage...
Corn: I mean, I wouldn't go that far - some outlets covered it well.
Herman: Ehh, I'd push back on that. The surface level stuff, sure...
...
Larry: Are you tired of feeling like your life is missing something?
Introducing MindBoost Ultra... BUY NOW!
...
Jim: Yeah, this is Jim from Ohio. I gotta say, you're overcomplicating it. Voice Sample Caching & Upload
New in V3: Voice samples are cached to avoid redundant uploads:
- MD5 hash of each voice sample file tracks changes
- Cached URLs reused across multiple episode generations
- Cache invalidated automatically when voice samples are updated
Voice Sample Configuration
Each character uses a unique voice clone sample stored in config/voices/:
VOICE_SAMPLES = {
"Corn": VOICES_DIR / "corn" / "wav" / "corn-1min.wav",
"Herman": VOICES_DIR / "herman" / "wav" / "herman-1min.wav",
"Jim": VOICES_DIR / "jim-v2" / "clip-Jim-2025_12_08.wav",
"Larry": VOICES_DIR / "larry" / "clip-Larry-2025_12_08.wav",
} Caching Implementation
The cache is stored in .voice_cache.json and maps speaker+filename to uploaded URL with hash:
def upload_voice_samples() -> dict[str, str]:
cache = {}
if VOICE_CACHE_FILE.exists():
cache = json.load(open(VOICE_CACHE_FILE))
for speaker, sample_path in VOICE_SAMPLES.items():
file_hash = get_file_hash(sample_path)
cache_key = f"{speaker}:{sample_path.name}"
if cache_key in cache and cache[cache_key]['hash'] == file_hash:
# Use cached URL
uploaded_urls[speaker] = cache[cache_key]['url']
else:
# Upload and cache
url = fal_client.upload_file(str(sample_path))
cache[cache_key] = {'url': url, 'hash': file_hash} Voice Synthesis (TTS)
The script is converted to audio using Chatterbox TTS with instant voice cloning:
- 4 distinct cloned voices (Corn, Herman, Jim, Larry)
- Long segments automatically chunked at sentence boundaries (max 500 chars)
- 6 parallel workers for fal.ai cloud API calls
- Exponential backoff retry with 3 attempts for transient failures
- Checkpointing: failed runs resume from last successful segment
Script Parsing
The diarized script is parsed into segments, handling all four speakers:
def parse_diarized_script(script: str) -> list[dict]:
pattern = r'^(Corn|Herman|Jim|Larry):\s*(.+?)(?=^(?:Corn|Herman|Jim|Larry):|\Z)'
matches = re.findall(pattern, script, re.MULTILINE | re.DOTALL)
return [{'speaker': speaker, 'text': text.strip()}
for speaker, text in matches if text.strip()] Intelligent Chunking
Chatterbox TTS has a ~40 second output limit. Long segments are split at sentence boundaries:
MAX_CHARS_PER_TTS_REQUEST = 500 # Conservative limit
def chunk_long_text(text: str, max_chars: int = 500) -> list[str]:
if len(text) <= max_chars:
return [text]
# Split on sentence boundaries (. ! ?)
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = ""
for sentence in sentences:
if current_chunk and len(current_chunk) + len(sentence) > max_chars:
chunks.append(current_chunk.strip())
current_chunk = sentence
else:
current_chunk += " " + sentence if current_chunk else sentence
return chunks Parallel Processing with Retry
TTS generation uses a thread pool with exponential backoff retry:
MAX_TTS_WORKERS_CLOUD = 6 # Concurrent fal.ai API calls
TTS_MAX_RETRIES = 3
TTS_BASE_DELAY = 1.0 # seconds
def synthesize_segment_task(args) -> tuple[int, Path, Exception | None]:
for attempt in range(TTS_MAX_RETRIES):
try:
synthesize_with_chatterbox(segment['text'], voice_ref, output_path)
return (i, output_path, None)
except Exception as e:
# Don't retry permanent errors (auth, invalid input)
if 'unauthorized' in str(e).lower():
break
# Exponential backoff for transient errors
delay = min(TTS_BASE_DELAY * (2 ** attempt), 30.0)
time.sleep(delay)
return (i, output_path, last_error) Checkpointing
Existing segment files are skipped on retry, enabling recovery from failures:
# Check for existing segments
for i, segment in enumerate(segments):
segment_path = temp_dir / f"segment_{i:04d}_{speaker}.mp3"
if segment_path.exists() and segment_path.stat().st_size > 0:
existing_segments[i] = segment_path # Skip
else:
tasks.append((i, segment, voice_ref, segment_path)) Metadata & Cover Art
Generated in parallel with TTS (doesn't need to wait for audio):
- Episode title and description via Gemini 2.5 Flash
- Blog post article (~800-1200 words) summarizing the episode
- Image prompt for cover art generation
- 3 cover art variants via Flux Schnell (generated in parallel)
Metadata Generation
Gemini analyzes the generated script to produce:
- Episode Title - Catchy, max 60 characters
- Short Description - 2-3 sentence teaser for podcast apps
- Blog Post - 800-1200 word article summarizing the episode
- Image Prompt - Theme description for cover art generation
Cover Art Generation
Flux Schnell generates 3 variants in parallel, explicitly forbidding text elements:
enhanced_prompt = f"""Professional podcast episode cover art,
modern clean design, visually striking. IMPORTANT: Do NOT include
any text, words, letters, numbers, typography, titles, labels, or
writing of any kind. No signs, no logos with text, no speech bubbles.
Pure visual imagery only. Theme: {image_prompt}"""
# Generate all variants in parallel
with ThreadPoolExecutor(max_workers=3) as executor:
futures = [executor.submit(_generate_single_cover_art, (i, enhanced_prompt))
for i in range(3)] Audio Assembly
FFmpeg concatenates the final episode in order:
- Intro jingle (pre-normalized)
- AI disclaimer
- User's original voice prompt
- AI-generated dialogue (with Jim & Larry segments)
- Outro jingle (pre-normalized)
Segment Structure
Pre-Normalized Show Elements
V3 uses pre-normalized jingles from show-elements/mixed/normalized/ to reduce processing:
if intro_jingle and intro_jingle.exists():
normalized = NORMALIZED_JINGLES_DIR / intro_jingle.name
audio_files.append(normalized if normalized.exists() else intro_jingle) Concatenation Process
# Convert all to consistent format for concat
for audio_file in audio_files:
cmd = ["ffmpeg", "-y", "-i", str(audio_file),
"-ar", "44100", "-ac", "1", "-c:a", "pcm_s16le",
str(prepared_path)]
subprocess.run(cmd, check=True)
# Concatenate all segments
cmd = ["ffmpeg", "-y", "-f", "concat", "-safe", "0",
"-i", str(filelist_path), "-c:a", "pcm_s16le",
str(concat_path)] Loudness Optimization
Final audio mastering ensures consistent playback across all platforms:
- EBU R128 loudness normalization to -16 LUFS
- True peak limiting to -1.5 dBTP
- Single-pass normalization (faster than V2's two-pass)
EBU R128 Standard
- Target Loudness - -16 LUFS (optimal for podcasts)
- Loudness Range - 11 LU (natural dynamics preserved)
- True Peak - -1.5 dBTP (headroom for lossy encoding)
Technical Implementation
# Single-pass loudness normalization + MP3 encoding
cmd = ["ffmpeg", "-y", "-i", str(concat_path),
"-af", "loudnorm=I=-16:TP=-1.5:LRA=11",
"-c:a", "libmp3lame", "-b:a", "192k",
str(output_path)] Publishing
The complete episode is published to multiple destinations:
- Cloudinary - CDN hosting for audio and images
- Astro Blog - Markdown post with embedded player
- Neon PostgreSQL - Episode metadata and transcript
- Wasabi - S3-compatible archive backup
CDN Upload (Cloudinary)
result = cloudinary.uploader.upload(
str(file_path),
resource_type=resource_type,
folder="my-weird-prompts/episodes",
public_id=file_path.stem,
overwrite=True,
) Blog Post Generation
An Astro-compatible markdown file is generated with YAML frontmatter, embedded audio player, and full transcript.
Archive Backup (Wasabi)
All assets backed up to Wasabi S3 for long-term storage:
- Original prompt audio preserved
- Generated script (JSON + plain text)
- Final episode audio
- All cover art variants
Environment Variables
The pipeline requires the following environment variables. Create a .env file in the backend directory:
# ======================
# AI Services
# ======================
# Google Gemini - Script generation & transcription
GEMINI_API_KEY=your_gemini_api_key_here
# fal.ai - TTS (Chatterbox) and image generation (Flux)
FAL_KEY=your_fal_api_key_here
# ======================
# Media Hosting
# ======================
# Cloudinary - CDN for audio and images
CLOUDINARY_CLOUD_NAME=your_cloud_name
CLOUDINARY_API_KEY=your_api_key
CLOUDINARY_API_SECRET=your_api_secret
# ======================
# Storage & Database
# ======================
# Wasabi S3-compatible storage (backup)
WASABI_ACCESS_KEY=your_wasabi_access_key
WASABI_SECRET_KEY=your_wasabi_secret_key
WASABI_BUCKET=myweirdprompts
WASABI_REGION=eu-central-2
WASABI_ENDPOINT=https://s3.eu-central-2.wasabisys.com
# Neon PostgreSQL database
POSTGRES_URL=postgres://user:pass@host/database?sslmode=require
# ======================
# Optional: Local TTS
# ======================
# Set to "true" to use local Chatterbox Docker instead of fal.ai
USE_LOCAL_TTS=false
CHATTERBOX_URL=http://localhost:8881 Core Pipeline Code
The main entry point processes prompts from the queue:
def generate_podcast_episode(prompt_audio_path: Path, episode_name: str = None) -> Path:
"""
Generate a complete podcast episode from a user's audio prompt.
V3 Optimized Workflow:
1. Upload voice samples (cached) + generate script + process prompt (parallel)
2. Parse script into 4-speaker segments (Corn, Herman, Jim, Larry)
3. Generate metadata/cover art + dialogue TTS (parallel - metadata doesn't need audio)
4. Assemble final episode (intro + disclaimer + prompt + dialogue + outro)
5. Publish to Cloudinary CDN + Neon database + Wasabi archive
6. Auto-deploy to Vercel via git push
7. Cleanup episode folders after successful publish
"""
# Initialize clients and verify voice samples exist
gemini_client = get_gemini_client()
get_fal_client()
# Step 1: Three parallel operations for maximum efficiency
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
voice_future = executor.submit(upload_voice_samples) # Uses MD5 cache
script_future = executor.submit(transcribe_and_generate_script, gemini_client, prompt_audio_path)
prompt_future = executor.submit(process_prompt_audio, prompt_audio_path, processed_prompt_path)
voice_refs = voice_future.result() # Cached URLs or fresh uploads
script = script_future.result() # 3000-6000 word diarized script
processed_prompt_path = prompt_future.result() # Simple WAV conversion
# Step 2: Parse diarized script into segments for TTS
segments = parse_diarized_script(script) # Handles all 4 speakers
# Step 3: Heavy parallel - metadata+cover generation alongside TTS
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
def generate_metadata_and_cover():
metadata = generate_episode_metadata(gemini_client, script)
cover_art_paths = generate_cover_art(metadata['image_prompt'], episode_dir, num_variants=3)
return metadata, cover_art_paths
metadata_future = executor.submit(generate_metadata_and_cover)
audio_future = executor.submit(generate_dialogue_audio, segments, episode_dir, voice_refs, use_local_tts)
metadata, cover_art_paths = metadata_future.result() # Title, description, blog post, cover art
dialogue_audio_path = audio_future.result() # All TTS segments concatenated
# Step 4: Assemble final episode with pre-normalized show elements
concatenate_episode(dialogue_audio_path, episode_path, user_prompt_audio=processed_prompt_path,
intro_jingle=intro_jingle, disclaimer_audio=DISCLAIMER_PATH, outro_jingle=outro_jingle)
# Step 5: Publish - parallel uploads to Cloudinary + database insert + Wasabi backup
publish_episode(episode_dir, episode_path, metadata, cover_art_paths, script)
upload_episode_to_wasabi(episode_dir, episode_path)
return episode_path Queue Processing & Auto-Deployment
The pipeline includes a complete queue system with automatic deployment:
Queue Processing
Audio prompts placed in prompts/to-process/ are automatically processed sequentially. Successfully processed prompts are deleted from the queue.
Auto-Deployment
After processing, new blog posts are committed to git and pushed. Vercel auto-deploys from the main branch.
Cleanup
Episode generation folders are cleaned up after successful publish, freeing disk space while content lives in Cloudinary/Wasabi.
# Process all prompts in queue (default)
python generate_episode.py
# Process a single audio file
python generate_episode.py prompt.mp3
# Check queue and output status
python generate_episode.py --status
# Clean up published episode folders
python generate_episode.py --cleanup
# Force clean ALL episode folders (use with caution)
python generate_episode.py --force-cleanup Dependencies
Python Packages
pip install google-genai python-dotenv fal-client cloudinary psycopg2-binary boto3 Pillow requests System Requirements
ffmpeg ffprobe Required for audio processing, concatenation, and loudness normalization
Stack Components
The V3 pipeline is powered by a modern stack of cloud services and open-source tools:
Astro
Static site generator for the blog frontend
fal.ai
TTS (Chatterbox) and image generation (Flux) APIs
OpenRouter
LLM gateway for model access and routing
Flux
AI image generation for episode cover art
Cloudinary
CDN hosting for audio and image assets
Neon
Serverless PostgreSQL for episode metadata
Wasabi
S3-compatible storage for long-term backup
Tavily
Web search API for external information retrieval
Source Code
The complete pipeline code is open source and available on GitHub:
Previous Versions
Documentation of earlier pipeline iterations is preserved for reference: