Archive: This documents the V2 pipeline. See Pipeline V4 for the current implementation.

V2

Episode Generation Pipeline

The complete workflow from voice prompt to published podcast episode

Historical Reference

Pipeline Overview

Production process diagram showing the pipeline stages

Architecture

Voice Input

Audio prompts recorded via Voicenotes app or any voice memo

Script Generation

Gemini 2.5 Flash transcribes and generates diarized dialogue

Voice Synthesis

Chatterbox TTS with cloned voices (Corn & Herman)

Cover Art

Flux Schnell generates episode artwork via fal.ai

Audio Assembly

FFmpeg concatenates intro, prompt, dialogue, and outro

Publishing

Cloudinary CDN + Astro blog + PostgreSQL database

Pipeline Stages

1

Voice Capture & Preprocessing

Audio prompts are placed in the prompts/to-process/ queue. The pipeline automatically:

  • Detects and trims leading/trailing silence
  • Compresses long pauses to maintain pacing
  • Normalizes audio levels for consistent output

Silence Detection & Trimming

Uses FFmpeg's silencedetect filter to identify silence periods at the beginning and end of recordings. Configurable threshold (-50dB default) and minimum duration (0.5s) parameters allow fine-tuning for different recording environments.

Pause Compression

Long pauses within speech (common in voice memos) are compressed to maintain natural pacing. Pauses exceeding 2 seconds are reduced to 0.8 seconds using intelligent splice points that preserve speech rhythm.

Technical Implementation

ffmpeg -i input.mp3 -af "silenceremove=1:0:-50dB:1:5:-50dB, compand=attacks=0.3:points=-80/-900|-45/-15|-27/-9|0/-7|20/-7:gain=5" output.mp3
2

Transcription & Script Generation

Gemini 2.5 Flash receives the audio and generates a complete diarized podcast script featuring two AI hosts:

  • Corn - The curious, enthusiastic host who asks probing questions
  • Herman - The knowledgeable expert who provides deep insights

Scripts are dynamically sized (20-40 minutes) based on topic complexity.

System Prompt Architecture

The Gemini system prompt is carefully engineered to produce consistent, high-quality diarized scripts. Key directives include:

  • Strict Diarization Format - Every line must begin with exactly CORN: or HERMAN: followed by their dialogue. No narrative text, stage directions, or speaker labels in other formats.
  • Character Coherence - Each host maintains consistent personality traits throughout. Corn stays curious and energetic; Herman remains thoughtful and authoritative.
  • No Cross-Talk Markers - Interruptions and overlapping speech are avoided since TTS processes one speaker at a time.
  • Natural Conversation Flow - Responses should feel like genuine dialogue, not scripted Q&A. Hosts can agree, disagree, build on ideas, and ask follow-up questions.

Character Definitions

Corn

The Curious Host

  • Asks probing questions that guide the conversation
  • Expresses genuine enthusiasm and wonder
  • Summarizes complex points for the audience
  • Occasionally plays devil's advocate
  • Uses conversational, accessible language
Herman

The Expert

  • Provides deep technical insights and context
  • Draws connections to broader concepts
  • Cites research and real-world examples
  • Acknowledges uncertainty when appropriate
  • Balances depth with accessibility

Dynamic Length Calculation

Script length is determined by topic complexity and the user's prompt duration:

  • Short prompts (under 30s) - 15-25 minute episodes, focused exploration
  • Medium prompts (30s-2min) - 25-35 minute episodes, thorough coverage
  • Long prompts (2min+) - 35-45 minute episodes, deep dive with tangents

Output Validation

The pipeline validates script output before proceeding:

  • Every line must match the ^(CORN|HERMAN): .+$ pattern
  • Minimum 50 exchanges required (ensures substantive content)
  • Speaker alternation ratio checked (neither host should dominate)
  • Average line length validated (prevents single-word responses)
3

Voice Synthesis (TTS)

The script is converted to audio using Chatterbox TTS with instant voice cloning:

  • Voice samples uploaded to fal.ai (or local Docker container)
  • Long segments automatically chunked to avoid output limits
  • Parallel processing for faster generation

Voice Cloning Setup

Each host uses a unique cloned voice. Voice reference samples (10-30 seconds of clean speech) are uploaded at pipeline start and cached for the session.

  • Corn's Voice - Warm, energetic, slightly higher pitch
  • Herman's Voice - Calm, measured, authoritative tone

Intelligent Chunking Strategy

Chatterbox TTS has output length limits. Long dialogue segments are automatically chunked:

  • Maximum 500 characters per TTS request
  • Split at sentence boundaries when possible
  • Preserve natural speech patterns across chunks
  • Chunks are rejoined with 50ms crossfade to hide seams

Parallel Processing

TTS generation is the slowest pipeline stage. To optimize:

  • Up to 4 concurrent TTS requests to fal.ai
  • Segments queued by speaker to maximize cache hits
  • Failed requests automatically retried with exponential backoff
  • Progress tracked per-segment for accurate ETA
4

Metadata & Cover Art

Generated in parallel with TTS:

  • Episode title and description via Gemini
  • Blog post article (~800-1200 words)
  • 3 cover art variants via Flux Schnell

Metadata Generation

Gemini analyzes the generated script to produce:

  • Episode Title - Catchy, descriptive, SEO-friendly (max 80 chars)
  • Short Description - 2-3 sentence hook for podcast apps
  • Full Description - Detailed summary with timestamps
  • Tags/Categories - Auto-classified for filtering

Cover Art Generation

Flux Schnell generates 3 variant cover images based on episode themes:

  • 1024x1024 resolution (podcast standard)
  • Consistent visual style with show branding
  • Theme-appropriate imagery based on topic
  • Best variant selected automatically via CLIP scoring
5

Audio Assembly

FFmpeg concatenates the final episode in order:

  1. Intro jingle
  2. AI disclaimer
  3. User's original voice prompt
  4. AI-generated dialogue
  5. Outro jingle

Segment Structure

Each episode follows a consistent structure:

Intro Jingle ~8s
AI Disclaimer ~12s
User's Voice Prompt Variable
AI Dialogue 15-45 min
Outro Jingle ~10s

Concatenation Process

FFmpeg handles the audio assembly with careful attention to transitions:

  • Format Normalization - All segments converted to 44.1kHz stereo before joining
  • Crossfades - 200ms crossfade between major sections for smooth transitions
  • Gap Insertion - 500ms silence between prompt and dialogue for natural pacing
  • Concat Demuxer - Used for lossless joining of pre-normalized segments

Technical Implementation

ffmpeg -f concat -safe 0 -i segments.txt -c:a libmp3lame -b:a 192k -ar 44100 -ac 2 episode.mp3
6

Loudness Optimization

Final audio mastering ensures consistent playback across all platforms:

  • EBU R128 loudness normalization to -16 LUFS
  • True peak limiting to -1.5 dBTP
  • Dynamic range compression for mobile listening

Why Loudness Normalization Matters

Different audio sources (jingles, TTS, user recordings) have vastly different volume levels. Without normalization:

  • Listeners constantly adjust volume between segments
  • Quiet segments get lost on mobile/in cars
  • Loud segments cause distortion or listener fatigue
  • Podcast platforms may reject or re-encode non-compliant audio

EBU R128 Standard

We follow the EBU R128 broadcast standard, widely adopted by podcast platforms:

  • Target Loudness - -16 LUFS (optimal for podcasts)
  • Loudness Range - 5-10 LU (natural dynamics preserved)
  • True Peak - -1.5 dBTP (headroom for lossy encoding)
  • Measurement - Integrated loudness across entire episode

Two-Pass Processing

Loudness normalization requires two passes:

  1. Analysis Pass - Measure current loudness (LUFS), peak levels, and dynamic range
  2. Normalization Pass - Apply calculated gain adjustment with true peak limiting
# Pass 1: Analyze ffmpeg -i episode.mp3 -af loudnorm=I=-16:TP=-1.5:LRA=11:print_format=json -f null - # Pass 2: Normalize with measured values ffmpeg -i episode.mp3 -af loudnorm=I=-16:TP=-1.5:LRA=11:measured_I=-23.5:measured_TP=-4.2:measured_LRA=8.1:measured_thresh=-34.2:offset=-0.5:linear=true output.mp3

Quality Assurance

Post-normalization verification ensures compliance:

  • Final loudness must be within ±0.5 LU of target
  • No true peaks exceeding -1.0 dBTP
  • Clipping detection (zero samples at max)
  • Duration verification (no truncation)
7

Publishing

The complete episode is published to multiple destinations:

  • Cloudinary - CDN hosting for audio and images
  • Astro Blog - Markdown post with embedded player
  • Neon PostgreSQL - Episode metadata and transcript
  • Wasabi - S3-compatible archive backup

CDN Upload (Cloudinary)

Audio and images are uploaded to Cloudinary for global CDN distribution:

  • Audio - MP3 uploaded with podcast-specific transformations
  • Cover Art - Multiple sizes generated (1400x1400, 600x600, 300x300)
  • URLs - Stable URLs returned for embedding

Blog Post Generation

An Astro-compatible markdown file is generated:

  • YAML frontmatter with all metadata
  • Embedded audio player component
  • Full transcript with speaker labels
  • Related episode links (when applicable)

Database Recording

Episode metadata stored in Neon PostgreSQL for querying:

  • Full-text search on transcripts
  • Tag-based filtering
  • Analytics tracking (play counts, duration listened)
  • RSS feed generation source

Archive Backup

All assets backed up to Wasabi S3 for long-term storage:

  • Original prompt audio preserved
  • Generated script (JSON + plain text)
  • Final episode audio
  • All cover art variants

Environment Variables

The pipeline requires the following environment variables. Create a .env file in the backend directory:

.env
# ======================
# AI Services
# ======================

# Google Gemini - Script generation & transcription
GEMINI_API_KEY=your_gemini_api_key_here

# fal.ai - TTS (Chatterbox) and image generation (Flux)
FAL_KEY=your_fal_api_key_here

# ======================
# Media Hosting
# ======================

# Cloudinary - CDN for audio and images
CLOUDINARY_CLOUD_NAME=your_cloud_name
CLOUDINARY_API_KEY=your_api_key
CLOUDINARY_API_SECRET=your_api_secret

# ======================
# Storage & Database
# ======================

# Wasabi S3-compatible storage (backup)
WASABI_ACCESS_KEY=your_wasabi_access_key
WASABI_SECRET_KEY=your_wasabi_secret_key
WASABI_BUCKET=myweirdprompts
WASABI_REGION=eu-central-2
WASABI_ENDPOINT=https://s3.eu-central-2.wasabisys.com

# Neon PostgreSQL database
POSTGRES_URL=postgres://user:pass@host/database?sslmode=require

# ======================
# Optional: Local TTS
# ======================

# Set to "true" to use local Chatterbox Docker instead of fal.ai
USE_LOCAL_TTS=false
CHATTERBOX_URL=http://localhost:8881

Core Pipeline Code

The main entry point processes prompts from the queue:

generate_episode.py (excerpt)
def generate_podcast_episode(
    prompt_audio_path: Path,
    episode_name: str = None,
) -> Path:
    """
    Generate a complete podcast episode from a user's audio prompt.

    Workflow:
    1. Upload voice samples + generate script (parallel)
    2. Parse script into segments
    3. Generate metadata + dialogue audio (parallel)
    4. Generate cover art (parallel with TTS)
    5. Assemble final episode
    6. Publish to Cloudinary, blog, and database
    """
    # Initialize clients
    gemini_client = get_gemini_client()
    get_fal_client()

    # Step 1: Parallel - upload voices, generate script, process prompt
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        voice_future = executor.submit(upload_voice_samples)
        script_future = executor.submit(
            transcribe_and_generate_script,
            gemini_client,
            prompt_audio_path
        )
        prompt_future = executor.submit(
            process_prompt_audio,
            prompt_audio_path,
            processed_prompt_path
        )

        voice_refs = voice_future.result()
        script = script_future.result()
        processed_prompt_path = prompt_future.result()

    # Step 2: Parse diarized script
    segments = parse_diarized_script(script)

    # Step 3: Parallel - metadata/cover + TTS
    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        metadata_future = executor.submit(generate_metadata_and_cover)
        audio_future = executor.submit(
            generate_dialogue_audio,
            segments,
            episode_dir,
            voice_refs
        )

        metadata, cover_art_paths = metadata_future.result()
        dialogue_audio_path = audio_future.result()

    # Step 4: Assemble final episode
    concatenate_episode(
        dialogue_audio=dialogue_audio_path,
        output_path=episode_path,
        user_prompt_audio=processed_prompt_path,
        intro_jingle=intro_jingle,
        disclaimer_audio=DISCLAIMER_PATH,
        outro_jingle=outro_jingle,
    )

    # Step 5: Publish
    publish_episode(episode_dir, episode_path, metadata, cover_art_paths, script)

    return episode_path

Dependencies

Python Packages

pip install google-genai python-dotenv fal-client cloudinary psycopg2-binary boto3 Pillow requests

System Requirements

ffmpeg ffprobe

Required for audio processing and assembly

Source Code

The complete pipeline code is open source and available on GitHub: