Archive: This documents the V2 pipeline. See Pipeline V4 for the current implementation.

Episode Generation Pipeline

The complete workflow from voice prompt to published podcast episode

Historical Reference

Pipeline Overview

Production process diagram showing the pipeline stages

Architecture

Voice Input

Audio prompts recorded via Voicenotes app or any voice memo

Script Generation

Gemini 2.5 Flash transcribes and generates diarized dialogue

Voice Synthesis

Chatterbox TTS with cloned voices (Corn & Herman)

Cover Art

Flux Schnell generates episode artwork via fal.ai

Audio Assembly

FFmpeg concatenates intro, prompt, dialogue, and outro

Publishing

Cloudinary CDN + Astro blog + PostgreSQL database

Pipeline Stages

Voice Capture & Preprocessing

Audio prompts are placed in the prompts/to-process/ queue. The pipeline automatically:

Detects and trims leading/trailing silence
Compresses long pauses to maintain pacing
Normalizes audio levels for consistent output

Silence Detection & Trimming

Uses FFmpeg's silencedetect filter to identify silence periods at the beginning and end of recordings. Configurable threshold (-50dB default) and minimum duration (0.5s) parameters allow fine-tuning for different recording environments.

Pause Compression

Long pauses within speech (common in voice memos) are compressed to maintain natural pacing. Pauses exceeding 2 seconds are reduced to 0.8 seconds using intelligent splice points that preserve speech rhythm.

Technical Implementation

 ffmpeg -i input.mp3 -af "silenceremove=1:0:-50dB:1:5:-50dB, compand=attacks=0.3:points=-80/-900|-45/-15|-27/-9|0/-7|20/-7:gain=5" output.mp3 

Transcription & Script Generation

Gemini 2.5 Flash receives the audio and generates a complete diarized podcast script featuring two AI hosts:

Corn - The curious, enthusiastic host who asks probing questions
Herman - The knowledgeable expert who provides deep insights

Scripts are dynamically sized (20-40 minutes) based on topic complexity.

System Prompt Architecture

The Gemini system prompt is carefully engineered to produce consistent, high-quality diarized scripts. Key directives include:

Strict Diarization Format - Every line must begin with exactly CORN: or HERMAN: followed by their dialogue. No narrative text, stage directions, or speaker labels in other formats.
Character Coherence - Each host maintains consistent personality traits throughout. Corn stays curious and energetic; Herman remains thoughtful and authoritative.
No Cross-Talk Markers - Interruptions and overlapping speech are avoided since TTS processes one speaker at a time.
Natural Conversation Flow - Responses should feel like genuine dialogue, not scripted Q&A. Hosts can agree, disagree, build on ideas, and ask follow-up questions.

Character Definitions

Corn

The Curious Host

Asks probing questions that guide the conversation
Expresses genuine enthusiasm and wonder
Summarizes complex points for the audience
Occasionally plays devil's advocate
Uses conversational, accessible language

Herman

The Expert

Provides deep technical insights and context
Draws connections to broader concepts
Cites research and real-world examples
Acknowledges uncertainty when appropriate
Balances depth with accessibility

Dynamic Length Calculation

Script length is determined by topic complexity and the user's prompt duration:

Short prompts (under 30s) - 15-25 minute episodes, focused exploration
Medium prompts (30s-2min) - 25-35 minute episodes, thorough coverage
Long prompts (2min+) - 35-45 minute episodes, deep dive with tangents

Output Validation

The pipeline validates script output before proceeding:

Every line must match the ^(CORN|HERMAN): .+$ pattern
Minimum 50 exchanges required (ensures substantive content)
Speaker alternation ratio checked (neither host should dominate)
Average line length validated (prevents single-word responses)

Voice Synthesis (TTS)

The script is converted to audio using Chatterbox TTS with instant voice cloning:

Voice samples uploaded to fal.ai (or local Docker container)
Long segments automatically chunked to avoid output limits
Parallel processing for faster generation

Voice Cloning Setup

Each host uses a unique cloned voice. Voice reference samples (10-30 seconds of clean speech) are uploaded at pipeline start and cached for the session.

Corn's Voice - Warm, energetic, slightly higher pitch
Herman's Voice - Calm, measured, authoritative tone

Intelligent Chunking Strategy

Chatterbox TTS has output length limits. Long dialogue segments are automatically chunked:

Maximum 500 characters per TTS request
Split at sentence boundaries when possible
Preserve natural speech patterns across chunks
Chunks are rejoined with 50ms crossfade to hide seams

Parallel Processing

TTS generation is the slowest pipeline stage. To optimize:

Up to 4 concurrent TTS requests to fal.ai
Segments queued by speaker to maximize cache hits
Failed requests automatically retried with exponential backoff
Progress tracked per-segment for accurate ETA

Metadata & Cover Art

Generated in parallel with TTS:

Episode title and description via Gemini
Blog post article (~800-1200 words)
3 cover art variants via Flux Schnell

Metadata Generation

Gemini analyzes the generated script to produce:

Episode Title - Catchy, descriptive, SEO-friendly (max 80 chars)
Short Description - 2-3 sentence hook for podcast apps
Full Description - Detailed summary with timestamps
Tags/Categories - Auto-classified for filtering

Cover Art Generation

Flux Schnell generates 3 variant cover images based on episode themes:

1024x1024 resolution (podcast standard)
Consistent visual style with show branding
Theme-appropriate imagery based on topic
Best variant selected automatically via CLIP scoring

Audio Assembly

FFmpeg concatenates the final episode in order:

Intro jingle
AI disclaimer
User's original voice prompt
AI-generated dialogue
Outro jingle

Segment Structure

Each episode follows a consistent structure:

Intro Jingle ~8s

AI Disclaimer ~12s

User's Voice Prompt Variable

AI Dialogue 15-45 min

Outro Jingle ~10s

Concatenation Process

FFmpeg handles the audio assembly with careful attention to transitions:

Format Normalization - All segments converted to 44.1kHz stereo before joining
Crossfades - 200ms crossfade between major sections for smooth transitions
Gap Insertion - 500ms silence between prompt and dialogue for natural pacing
Concat Demuxer - Used for lossless joining of pre-normalized segments

Technical Implementation

 ffmpeg -f concat -safe 0 -i segments.txt -c:a libmp3lame -b:a 192k -ar 44100 -ac 2 episode.mp3 

Loudness Optimization

Final audio mastering ensures consistent playback across all platforms:

EBU R128 loudness normalization to -16 LUFS
True peak limiting to -1.5 dBTP
Dynamic range compression for mobile listening

Why Loudness Normalization Matters

Different audio sources (jingles, TTS, user recordings) have vastly different volume levels. Without normalization:

Listeners constantly adjust volume between segments
Quiet segments get lost on mobile/in cars
Loud segments cause distortion or listener fatigue
Podcast platforms may reject or re-encode non-compliant audio

EBU R128 Standard

We follow the EBU R128 broadcast standard, widely adopted by podcast platforms:

Target Loudness - -16 LUFS (optimal for podcasts)
Loudness Range - 5-10 LU (natural dynamics preserved)
True Peak - -1.5 dBTP (headroom for lossy encoding)
Measurement - Integrated loudness across entire episode

Two-Pass Processing

Loudness normalization requires two passes:

Analysis Pass - Measure current loudness (LUFS), peak levels, and dynamic range
Normalization Pass - Apply calculated gain adjustment with true peak limiting

 # Pass 1: Analyze
ffmpeg -i episode.mp3 -af loudnorm=I=-16:TP=-1.5:LRA=11:print_format=json -f null -

# Pass 2: Normalize with measured values
ffmpeg -i episode.mp3 -af loudnorm=I=-16:TP=-1.5:LRA=11:measured_I=-23.5:measured_TP=-4.2:measured_LRA=8.1:measured_thresh=-34.2:offset=-0.5:linear=true output.mp3 

Quality Assurance

Post-normalization verification ensures compliance:

Final loudness must be within ±0.5 LU of target
No true peaks exceeding -1.0 dBTP
Clipping detection (zero samples at max)
Duration verification (no truncation)

Publishing

The complete episode is published to multiple destinations:

Cloudinary - CDN hosting for audio and images
Astro Blog - Markdown post with embedded player
Neon PostgreSQL - Episode metadata and transcript
Wasabi - S3-compatible archive backup

CDN Upload (Cloudinary)

Audio and images are uploaded to Cloudinary for global CDN distribution:

Audio - MP3 uploaded with podcast-specific transformations
Cover Art - Multiple sizes generated (1400x1400, 600x600, 300x300)
URLs - Stable URLs returned for embedding

Blog Post Generation

An Astro-compatible markdown file is generated:

YAML frontmatter with all metadata
Embedded audio player component
Full transcript with speaker labels
Related episode links (when applicable)

Database Recording

Episode metadata stored in Neon PostgreSQL for querying:

Full-text search on transcripts
Tag-based filtering
Analytics tracking (play counts, duration listened)
RSS feed generation source

Archive Backup

All assets backed up to Wasabi S3 for long-term storage:

Original prompt audio preserved
Generated script (JSON + plain text)
Final episode audio
All cover art variants

Environment Variables

The pipeline requires the following environment variables. Create a .env file in the backend directory:

.env

# ======================
# AI Services
# ======================

# Google Gemini - Script generation & transcription
GEMINI_API_KEY=your_gemini_api_key_here

# fal.ai - TTS (Chatterbox) and image generation (Flux)
FAL_KEY=your_fal_api_key_here

# ======================
# Media Hosting
# ======================

# Cloudinary - CDN for audio and images
CLOUDINARY_CLOUD_NAME=your_cloud_name
CLOUDINARY_API_KEY=your_api_key
CLOUDINARY_API_SECRET=your_api_secret

# ======================
# Storage & Database
# ======================

# Wasabi S3-compatible storage (backup)
WASABI_ACCESS_KEY=your_wasabi_access_key
WASABI_SECRET_KEY=your_wasabi_secret_key
WASABI_BUCKET=myweirdprompts
WASABI_REGION=eu-central-2
WASABI_ENDPOINT=https://s3.eu-central-2.wasabisys.com

# Neon PostgreSQL database
POSTGRES_URL=postgres://user:pass@host/database?sslmode=require

# ======================
# Optional: Local TTS
# ======================

# Set to "true" to use local Chatterbox Docker instead of fal.ai
USE_LOCAL_TTS=false
CHATTERBOX_URL=http://localhost:8881

Core Pipeline Code

The main entry point processes prompts from the queue:

generate_episode.py (excerpt)

def generate_podcast_episode(
    prompt_audio_path: Path,
    episode_name: str = None,
) -> Path:
    """
    Generate a complete podcast episode from a user's audio prompt.

    Workflow:
    1. Upload voice samples + generate script (parallel)
    2. Parse script into segments
    3. Generate metadata + dialogue audio (parallel)
    4. Generate cover art (parallel with TTS)
    5. Assemble final episode
    6. Publish to Cloudinary, blog, and database
    """
    # Initialize clients
    gemini_client = get_gemini_client()
    get_fal_client()

    # Step 1: Parallel - upload voices, generate script, process prompt
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        voice_future = executor.submit(upload_voice_samples)
        script_future = executor.submit(
            transcribe_and_generate_script,
            gemini_client,
            prompt_audio_path
        )
        prompt_future = executor.submit(
            process_prompt_audio,
            prompt_audio_path,
            processed_prompt_path
        )

        voice_refs = voice_future.result()
        script = script_future.result()
        processed_prompt_path = prompt_future.result()

    # Step 2: Parse diarized script
    segments = parse_diarized_script(script)

    # Step 3: Parallel - metadata/cover + TTS
    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        metadata_future = executor.submit(generate_metadata_and_cover)
        audio_future = executor.submit(
            generate_dialogue_audio,
            segments,
            episode_dir,
            voice_refs
        )

        metadata, cover_art_paths = metadata_future.result()
        dialogue_audio_path = audio_future.result()

    # Step 4: Assemble final episode
    concatenate_episode(
        dialogue_audio=dialogue_audio_path,
        output_path=episode_path,
        user_prompt_audio=processed_prompt_path,
        intro_jingle=intro_jingle,
        disclaimer_audio=DISCLAIMER_PATH,
        outro_jingle=outro_jingle,
    )

    # Step 5: Publish
    publish_episode(episode_dir, episode_path, metadata, cover_art_paths, script)

    return episode_path

Dependencies

Python Packages

pip install google-genai python-dotenv fal-client cloudinary psycopg2-binary boto3 Pillow requests

System Requirements

ffmpeg ffprobe

Required for audio processing and assembly

Source Code

The complete pipeline code is open source and available on GitHub:

View Pipeline on GitHub