Archive: This documents the V3 pipeline (Chatterbox TTS via fal.ai). See Pipeline for the current implementation.

Episode Generation Pipeline

The complete workflow from voice prompt to published podcast episode

V3 introduces 4 voice-cloned characters, call-in segments, sketchy ads, voice sample caching, and optimized parallel TTS processing.

Pipeline Overview

Production process diagram showing the pipeline stages

Architecture

Voice Input

Audio prompts recorded via Voicenotes app or any voice memo

Script Generation

Gemini 2.5 Flash transcribes and generates diarized 4-speaker dialogue (20-40 min)

Voice Synthesis

Chatterbox TTS via fal.ai with 4 cloned voices and parallel processing

Cover Art

Flux Schnell generates episode artwork via fal.ai

Audio Assembly

FFmpeg concatenates intro, prompt, dialogue, and outro

Publishing

Cloudinary CDN + Wasabi backup + Astro blog + Neon PostgreSQL

The Cast

V3 introduces a full cast of 4 voice-cloned characters, each with distinct personalities and roles in the show:

🦥

Corn

The Curious Host

Enthusiastic sloth who asks probing questions
Keeps conversation accessible with relatable examples
Sometimes gets ahead of himself
Plays devil's advocate when needed

🫏

Herman Poppleberry

The Expert

Knowledgeable donkey with deep insights
Provides technical details and authoritative explanations
Has strong opinions and pushes back on Corn
Can be a bit pedantic at times

📞

Jim from Ohio

The Crotchety Caller

Skeptical listener who calls in with complaints
Grumpy, slightly cantankerous, always finds something to disagree with
Peppers calls with random off-topic remarks
Classic curmudgeon energy

📺

Larry

The Sketchy Advertiser

Delivers ads for dubious, made-up products
Over-the-top infomercial energy
Never provides contact info or website
Every ad ends with "BUY NOW!"

Pipeline Stages

Voice Capture & Format Conversion

Audio prompts are placed in the prompts/to-process/ queue. The pipeline:

Converts to consistent 44.1kHz WAV format
Preserves original audio without destructive processing
Simple format conversion ensures reliable concatenation

V3 Design Philosophy

Unlike V2 which attempted silence removal and audio cleanup, V3 uses a "pass-through" approach for prompt audio. This prevents any risk of losing speech content from aggressive processing.

Technical Implementation

ffmpeg -y -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 1 output.wav

Transcription & Script Generation

Gemini Flash receives the audio and generates a complete diarized podcast script featuring all four characters:

Corn & Herman - Main discussion (with friendly disagreements)
Jim from Ohio - One call-in segment per episode
Larry - One sketchy ad break per episode

Scripts are dynamically sized (20-40 minutes) based on topic complexity.

Dynamic Episode Length

Script length is determined by topic complexity:

20-25 minutes (3000-3750 words) - Simple topics, single focused questions
25-35 minutes (3750-5250 words) - Multi-faceted topics requiring explanation
35-40 minutes (5250-6000 words) - Complex topics with rich history or controversy

Episode Structure

Each script follows a consistent structure:

Opening Hook - Welcome and topic introduction
Topic Introduction - Why listeners should care
Core Discussion Part 1 - Deep exploration with host disagreements
Larry's Ad Break - One sketchy product ad
Core Discussion Part 2 - Continued exploration with examples
Jim's Call-In - Crotchety caller segment
Practical Takeaways - Real-world applications
Closing Thoughts - Future implications and sign-off

Host Dynamics

Corn and Herman don't always agree - this creates engaging tension:

Herman occasionally corrects Corn or challenges oversimplifications
Corn defends his positions and points out when Herman is being too technical
Friendly disagreements get resolved through discussion
Light teasing between the hosts adds personality

Diarization Format

The script follows strict diarization for TTS parsing:

 Corn: Welcome to My Weird Prompts! I'm Corn, and as always...
Herman: Yeah, and I think what's interesting is that most coverage...
Corn: I mean, I wouldn't go that far - some outlets covered it well.
Herman: Ehh, I'd push back on that. The surface level stuff, sure...
...
Larry: Are you tired of feeling like your life is missing something?
Introducing MindBoost Ultra... BUY NOW!
...
Jim: Yeah, this is Jim from Ohio. I gotta say, you're overcomplicating it. 

Voice Sample Caching & Upload

New in V3: Voice samples are cached to avoid redundant uploads:

MD5 hash of each voice sample file tracks changes
Cached URLs reused across multiple episode generations
Cache invalidated automatically when voice samples are updated

Voice Sample Configuration

Each character uses a unique voice clone sample stored in config/voices/:

 VOICE_SAMPLES = {
    "Corn": VOICES_DIR / "corn" / "wav" / "corn-1min.wav",
    "Herman": VOICES_DIR / "herman" / "wav" / "herman-1min.wav",
    "Jim": VOICES_DIR / "jim-v2" / "clip-Jim-2025_12_08.wav",
    "Larry": VOICES_DIR / "larry" / "clip-Larry-2025_12_08.wav",
} 

Caching Implementation

The cache is stored in .voice_cache.json and maps speaker+filename to uploaded URL with hash:

 def upload_voice_samples() -> dict[str, str]:
    cache = {}
    if VOICE_CACHE_FILE.exists():
        cache = json.load(open(VOICE_CACHE_FILE))

    for speaker, sample_path in VOICE_SAMPLES.items():
        file_hash = get_file_hash(sample_path)
        cache_key = f"{speaker}:{sample_path.name}"

        if cache_key in cache and cache[cache_key]['hash'] == file_hash:
            # Use cached URL
            uploaded_urls[speaker] = cache[cache_key]['url']
        else:
            # Upload and cache
            url = fal_client.upload_file(str(sample_path))
            cache[cache_key] = {'url': url, 'hash': file_hash} 

Voice Synthesis (TTS)

The script is converted to audio using Chatterbox TTS with instant voice cloning:

4 distinct cloned voices (Corn, Herman, Jim, Larry)
Long segments automatically chunked at sentence boundaries (max 500 chars)
6 parallel workers for fal.ai cloud API calls
Exponential backoff retry with 3 attempts for transient failures
Checkpointing: failed runs resume from last successful segment

Script Parsing

The diarized script is parsed into segments, handling all four speakers:

 def parse_diarized_script(script: str) -> list[dict]:
    pattern = r'^(Corn|Herman|Jim|Larry):\s*(.+?)(?=^(?:Corn|Herman|Jim|Larry):|\Z)'
    matches = re.findall(pattern, script, re.MULTILINE | re.DOTALL)

    return [{'speaker': speaker, 'text': text.strip()}
            for speaker, text in matches if text.strip()] 

Intelligent Chunking

Chatterbox TTS has a ~40 second output limit. Long segments are split at sentence boundaries:

 MAX_CHARS_PER_TTS_REQUEST = 500  # Conservative limit

def chunk_long_text(text: str, max_chars: int = 500) -> list[str]:
    if len(text) <= max_chars:
        return [text]

    # Split on sentence boundaries (. ! ?)
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if current_chunk and len(current_chunk) + len(sentence) > max_chars:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
        else:
            current_chunk += " " + sentence if current_chunk else sentence

    return chunks 

Parallel Processing with Retry

TTS generation uses a thread pool with exponential backoff retry:

 MAX_TTS_WORKERS_CLOUD = 6  # Concurrent fal.ai API calls
TTS_MAX_RETRIES = 3
TTS_BASE_DELAY = 1.0  # seconds

def synthesize_segment_task(args) -> tuple[int, Path, Exception | None]:
    for attempt in range(TTS_MAX_RETRIES):
        try:
            synthesize_with_chatterbox(segment['text'], voice_ref, output_path)
            return (i, output_path, None)
        except Exception as e:
            # Don't retry permanent errors (auth, invalid input)
            if 'unauthorized' in str(e).lower():
                break
            # Exponential backoff for transient errors
            delay = min(TTS_BASE_DELAY * (2 ** attempt), 30.0)
            time.sleep(delay)
    return (i, output_path, last_error) 

Checkpointing

Existing segment files are skipped on retry, enabling recovery from failures:

 # Check for existing segments
for i, segment in enumerate(segments):
    segment_path = temp_dir / f"segment_{i:04d}_{speaker}.mp3"

    if segment_path.exists() and segment_path.stat().st_size > 0:
        existing_segments[i] = segment_path  # Skip
    else:
        tasks.append((i, segment, voice_ref, segment_path)) 

Metadata & Cover Art

Generated in parallel with TTS (doesn't need to wait for audio):

Episode title and description via Gemini 2.5 Flash
Blog post article (~800-1200 words) summarizing the episode
Image prompt for cover art generation
3 cover art variants via Flux Schnell (generated in parallel)

Metadata Generation

Gemini analyzes the generated script to produce:

Episode Title - Catchy, max 60 characters
Short Description - 2-3 sentence teaser for podcast apps
Blog Post - 800-1200 word article summarizing the episode
Image Prompt - Theme description for cover art generation

Cover Art Generation

Flux Schnell generates 3 variants in parallel, explicitly forbidding text elements:

 enhanced_prompt = f"""Professional podcast episode cover art,
modern clean design, visually striking. IMPORTANT: Do NOT include
any text, words, letters, numbers, typography, titles, labels, or
writing of any kind. No signs, no logos with text, no speech bubbles.
Pure visual imagery only. Theme: {image_prompt}"""

# Generate all variants in parallel
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(_generate_single_cover_art, (i, enhanced_prompt))
               for i in range(3)] 

Audio Assembly

FFmpeg concatenates the final episode in order:

Intro jingle (pre-normalized)
AI disclaimer
User's original voice prompt
AI-generated dialogue (with Jim & Larry segments)
Outro jingle (pre-normalized)

Segment Structure

Intro Jingle ~8s

AI Disclaimer ~12s

User's Voice Prompt Variable

AI Dialogue (w/ Jim & Larry) 20-40 min

Outro Jingle ~10s

Pre-Normalized Show Elements

V3 uses pre-normalized jingles from show-elements/mixed/normalized/ to reduce processing:

 if intro_jingle and intro_jingle.exists():
    normalized = NORMALIZED_JINGLES_DIR / intro_jingle.name
    audio_files.append(normalized if normalized.exists() else intro_jingle) 

Concatenation Process

 # Convert all to consistent format for concat
for audio_file in audio_files:
    cmd = ["ffmpeg", "-y", "-i", str(audio_file),
           "-ar", "44100", "-ac", "1", "-c:a", "pcm_s16le",
           str(prepared_path)]
    subprocess.run(cmd, check=True)

# Concatenate all segments
cmd = ["ffmpeg", "-y", "-f", "concat", "-safe", "0",
       "-i", str(filelist_path), "-c:a", "pcm_s16le",
       str(concat_path)] 

Loudness Optimization

Final audio mastering ensures consistent playback across all platforms:

EBU R128 loudness normalization to -16 LUFS
True peak limiting to -1.5 dBTP
Single-pass normalization (faster than V2's two-pass)

EBU R128 Standard

Target Loudness - -16 LUFS (optimal for podcasts)
Loudness Range - 11 LU (natural dynamics preserved)
True Peak - -1.5 dBTP (headroom for lossy encoding)

Technical Implementation

 # Single-pass loudness normalization + MP3 encoding
cmd = ["ffmpeg", "-y", "-i", str(concat_path),
       "-af", "loudnorm=I=-16:TP=-1.5:LRA=11",
       "-c:a", "libmp3lame", "-b:a", "192k",
       str(output_path)] 

Publishing

The complete episode is published to multiple destinations:

Cloudinary - CDN hosting for audio and images
Astro Blog - Markdown post with embedded player
Neon PostgreSQL - Episode metadata and transcript
Wasabi - S3-compatible archive backup

CDN Upload (Cloudinary)

 result = cloudinary.uploader.upload(
    str(file_path),
    resource_type=resource_type,
    folder="my-weird-prompts/episodes",
    public_id=file_path.stem,
    overwrite=True,
) 

Blog Post Generation

An Astro-compatible markdown file is generated with YAML frontmatter, embedded audio player, and full transcript.

Archive Backup (Wasabi)

All assets backed up to Wasabi S3 for long-term storage:

Original prompt audio preserved
Generated script (JSON + plain text)
Final episode audio
All cover art variants

Environment Variables

The pipeline requires the following environment variables. Create a .env file in the backend directory:

.env

# ======================
# AI Services
# ======================

# Google Gemini - Script generation & transcription
GEMINI_API_KEY=your_gemini_api_key_here

# fal.ai - TTS (Chatterbox) and image generation (Flux)
FAL_KEY=your_fal_api_key_here

# ======================
# Media Hosting
# ======================

# Cloudinary - CDN for audio and images
CLOUDINARY_CLOUD_NAME=your_cloud_name
CLOUDINARY_API_KEY=your_api_key
CLOUDINARY_API_SECRET=your_api_secret

# ======================
# Storage & Database
# ======================

# Wasabi S3-compatible storage (backup)
WASABI_ACCESS_KEY=your_wasabi_access_key
WASABI_SECRET_KEY=your_wasabi_secret_key
WASABI_BUCKET=myweirdprompts
WASABI_REGION=eu-central-2
WASABI_ENDPOINT=https://s3.eu-central-2.wasabisys.com

# Neon PostgreSQL database
POSTGRES_URL=postgres://user:pass@host/database?sslmode=require

# ======================
# Optional: Local TTS
# ======================

# Set to "true" to use local Chatterbox Docker instead of fal.ai
USE_LOCAL_TTS=false
CHATTERBOX_URL=http://localhost:8881

Core Pipeline Code

The main entry point processes prompts from the queue:

generate_episode.py (main workflow)

def generate_podcast_episode(prompt_audio_path: Path, episode_name: str = None) -> Path:
    """
    Generate a complete podcast episode from a user's audio prompt.

    V3 Optimized Workflow:
    1. Upload voice samples (cached) + generate script + process prompt (parallel)
    2. Parse script into 4-speaker segments (Corn, Herman, Jim, Larry)
    3. Generate metadata/cover art + dialogue TTS (parallel - metadata doesn't need audio)
    4. Assemble final episode (intro + disclaimer + prompt + dialogue + outro)
    5. Publish to Cloudinary CDN + Neon database + Wasabi archive
    6. Auto-deploy to Vercel via git push
    7. Cleanup episode folders after successful publish
    """
    # Initialize clients and verify voice samples exist
    gemini_client = get_gemini_client()
    get_fal_client()

    # Step 1: Three parallel operations for maximum efficiency
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        voice_future = executor.submit(upload_voice_samples)  # Uses MD5 cache
        script_future = executor.submit(transcribe_and_generate_script, gemini_client, prompt_audio_path)
        prompt_future = executor.submit(process_prompt_audio, prompt_audio_path, processed_prompt_path)

        voice_refs = voice_future.result()      # Cached URLs or fresh uploads
        script = script_future.result()          # 3000-6000 word diarized script
        processed_prompt_path = prompt_future.result()  # Simple WAV conversion

    # Step 2: Parse diarized script into segments for TTS
    segments = parse_diarized_script(script)  # Handles all 4 speakers

    # Step 3: Heavy parallel - metadata+cover generation alongside TTS
    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        def generate_metadata_and_cover():
            metadata = generate_episode_metadata(gemini_client, script)
            cover_art_paths = generate_cover_art(metadata['image_prompt'], episode_dir, num_variants=3)
            return metadata, cover_art_paths

        metadata_future = executor.submit(generate_metadata_and_cover)
        audio_future = executor.submit(generate_dialogue_audio, segments, episode_dir, voice_refs, use_local_tts)

        metadata, cover_art_paths = metadata_future.result()  # Title, description, blog post, cover art
        dialogue_audio_path = audio_future.result()            # All TTS segments concatenated

    # Step 4: Assemble final episode with pre-normalized show elements
    concatenate_episode(dialogue_audio_path, episode_path, user_prompt_audio=processed_prompt_path,
                        intro_jingle=intro_jingle, disclaimer_audio=DISCLAIMER_PATH, outro_jingle=outro_jingle)

    # Step 5: Publish - parallel uploads to Cloudinary + database insert + Wasabi backup
    publish_episode(episode_dir, episode_path, metadata, cover_art_paths, script)
    upload_episode_to_wasabi(episode_dir, episode_path)

    return episode_path

Queue Processing & Auto-Deployment

The pipeline includes a complete queue system with automatic deployment:

Queue Processing

Audio prompts placed in prompts/to-process/ are automatically processed sequentially. Successfully processed prompts are deleted from the queue.

Auto-Deployment

After processing, new blog posts are committed to git and pushed. Vercel auto-deploys from the main branch.

Cleanup

Episode generation folders are cleaned up after successful publish, freeing disk space while content lives in Cloudinary/Wasabi.

CLI Usage

# Process all prompts in queue (default)
python generate_episode.py

# Process a single audio file
python generate_episode.py prompt.mp3

# Check queue and output status
python generate_episode.py --status

# Clean up published episode folders
python generate_episode.py --cleanup

# Force clean ALL episode folders (use with caution)
python generate_episode.py --force-cleanup

Dependencies

Python Packages

pip install google-genai python-dotenv fal-client cloudinary psycopg2-binary boto3 Pillow requests

System Requirements

ffmpeg ffprobe

Required for audio processing, concatenation, and loudness normalization

Stack Components

The V3 pipeline is powered by a modern stack of cloud services and open-source tools:

Astro

Static site generator for the blog frontend

fal.ai

TTS (Chatterbox) and image generation (Flux) APIs

OpenRouter

LLM gateway for model access and routing

Flux

AI image generation for episode cover art

Cloudinary

CDN hosting for audio and image assets

Neon

Serverless PostgreSQL for episode metadata

Wasabi

S3-compatible storage for long-term backup

Tavily

Web search API for external information retrieval

Source Code

The complete pipeline code is open source and available on GitHub:

View Pipeline on GitHub

Previous Versions

Documentation of earlier pipeline iterations is preserved for reference:

Pipeline V2

2-speaker dialogue (Corn & Herman), silence detection, audio preprocessing