Archive: This documents the V2 pipeline. See Pipeline V4 for the current implementation.
Episode Generation Pipeline
The complete workflow from voice prompt to published podcast episode
Historical Reference
Pipeline Overview
Architecture
Voice Input
Audio prompts recorded via Voicenotes app or any voice memo
Script Generation
Gemini 2.5 Flash transcribes and generates diarized dialogue
Voice Synthesis
Chatterbox TTS with cloned voices (Corn & Herman)
Cover Art
Flux Schnell generates episode artwork via fal.ai
Audio Assembly
FFmpeg concatenates intro, prompt, dialogue, and outro
Publishing
Cloudinary CDN + Astro blog + PostgreSQL database
Pipeline Stages
Voice Capture & Preprocessing
Audio prompts are placed in the prompts/to-process/ queue. The pipeline automatically:
- Detects and trims leading/trailing silence
- Compresses long pauses to maintain pacing
- Normalizes audio levels for consistent output
Silence Detection & Trimming
Uses FFmpeg's silencedetect filter to identify silence periods at the beginning and end of recordings. Configurable threshold (-50dB default) and minimum duration (0.5s) parameters allow fine-tuning for different recording environments.
Pause Compression
Long pauses within speech (common in voice memos) are compressed to maintain natural pacing. Pauses exceeding 2 seconds are reduced to 0.8 seconds using intelligent splice points that preserve speech rhythm.
Technical Implementation
ffmpeg -i input.mp3 -af "silenceremove=1:0:-50dB:1:5:-50dB, compand=attacks=0.3:points=-80/-900|-45/-15|-27/-9|0/-7|20/-7:gain=5" output.mp3 Transcription & Script Generation
Gemini 2.5 Flash receives the audio and generates a complete diarized podcast script featuring two AI hosts:
- Corn - The curious, enthusiastic host who asks probing questions
- Herman - The knowledgeable expert who provides deep insights
Scripts are dynamically sized (20-40 minutes) based on topic complexity.
System Prompt Architecture
The Gemini system prompt is carefully engineered to produce consistent, high-quality diarized scripts. Key directives include:
- Strict Diarization Format - Every line must begin with exactly
CORN:orHERMAN:followed by their dialogue. No narrative text, stage directions, or speaker labels in other formats. - Character Coherence - Each host maintains consistent personality traits throughout. Corn stays curious and energetic; Herman remains thoughtful and authoritative.
- No Cross-Talk Markers - Interruptions and overlapping speech are avoided since TTS processes one speaker at a time.
- Natural Conversation Flow - Responses should feel like genuine dialogue, not scripted Q&A. Hosts can agree, disagree, build on ideas, and ask follow-up questions.
Character Definitions
Corn
The Curious Host
- Asks probing questions that guide the conversation
- Expresses genuine enthusiasm and wonder
- Summarizes complex points for the audience
- Occasionally plays devil's advocate
- Uses conversational, accessible language
Herman
The Expert
- Provides deep technical insights and context
- Draws connections to broader concepts
- Cites research and real-world examples
- Acknowledges uncertainty when appropriate
- Balances depth with accessibility
Dynamic Length Calculation
Script length is determined by topic complexity and the user's prompt duration:
- Short prompts (under 30s) - 15-25 minute episodes, focused exploration
- Medium prompts (30s-2min) - 25-35 minute episodes, thorough coverage
- Long prompts (2min+) - 35-45 minute episodes, deep dive with tangents
Output Validation
The pipeline validates script output before proceeding:
- Every line must match the
^(CORN|HERMAN): .+$pattern - Minimum 50 exchanges required (ensures substantive content)
- Speaker alternation ratio checked (neither host should dominate)
- Average line length validated (prevents single-word responses)
Voice Synthesis (TTS)
The script is converted to audio using Chatterbox TTS with instant voice cloning:
- Voice samples uploaded to fal.ai (or local Docker container)
- Long segments automatically chunked to avoid output limits
- Parallel processing for faster generation
Voice Cloning Setup
Each host uses a unique cloned voice. Voice reference samples (10-30 seconds of clean speech) are uploaded at pipeline start and cached for the session.
- Corn's Voice - Warm, energetic, slightly higher pitch
- Herman's Voice - Calm, measured, authoritative tone
Intelligent Chunking Strategy
Chatterbox TTS has output length limits. Long dialogue segments are automatically chunked:
- Maximum 500 characters per TTS request
- Split at sentence boundaries when possible
- Preserve natural speech patterns across chunks
- Chunks are rejoined with 50ms crossfade to hide seams
Parallel Processing
TTS generation is the slowest pipeline stage. To optimize:
- Up to 4 concurrent TTS requests to fal.ai
- Segments queued by speaker to maximize cache hits
- Failed requests automatically retried with exponential backoff
- Progress tracked per-segment for accurate ETA
Metadata & Cover Art
Generated in parallel with TTS:
- Episode title and description via Gemini
- Blog post article (~800-1200 words)
- 3 cover art variants via Flux Schnell
Metadata Generation
Gemini analyzes the generated script to produce:
- Episode Title - Catchy, descriptive, SEO-friendly (max 80 chars)
- Short Description - 2-3 sentence hook for podcast apps
- Full Description - Detailed summary with timestamps
- Tags/Categories - Auto-classified for filtering
Cover Art Generation
Flux Schnell generates 3 variant cover images based on episode themes:
- 1024x1024 resolution (podcast standard)
- Consistent visual style with show branding
- Theme-appropriate imagery based on topic
- Best variant selected automatically via CLIP scoring
Audio Assembly
FFmpeg concatenates the final episode in order:
- Intro jingle
- AI disclaimer
- User's original voice prompt
- AI-generated dialogue
- Outro jingle
Segment Structure
Each episode follows a consistent structure:
Concatenation Process
FFmpeg handles the audio assembly with careful attention to transitions:
- Format Normalization - All segments converted to 44.1kHz stereo before joining
- Crossfades - 200ms crossfade between major sections for smooth transitions
- Gap Insertion - 500ms silence between prompt and dialogue for natural pacing
- Concat Demuxer - Used for lossless joining of pre-normalized segments
Technical Implementation
ffmpeg -f concat -safe 0 -i segments.txt -c:a libmp3lame -b:a 192k -ar 44100 -ac 2 episode.mp3 Loudness Optimization
Final audio mastering ensures consistent playback across all platforms:
- EBU R128 loudness normalization to -16 LUFS
- True peak limiting to -1.5 dBTP
- Dynamic range compression for mobile listening
Why Loudness Normalization Matters
Different audio sources (jingles, TTS, user recordings) have vastly different volume levels. Without normalization:
- Listeners constantly adjust volume between segments
- Quiet segments get lost on mobile/in cars
- Loud segments cause distortion or listener fatigue
- Podcast platforms may reject or re-encode non-compliant audio
EBU R128 Standard
We follow the EBU R128 broadcast standard, widely adopted by podcast platforms:
- Target Loudness - -16 LUFS (optimal for podcasts)
- Loudness Range - 5-10 LU (natural dynamics preserved)
- True Peak - -1.5 dBTP (headroom for lossy encoding)
- Measurement - Integrated loudness across entire episode
Two-Pass Processing
Loudness normalization requires two passes:
- Analysis Pass - Measure current loudness (LUFS), peak levels, and dynamic range
- Normalization Pass - Apply calculated gain adjustment with true peak limiting
# Pass 1: Analyze
ffmpeg -i episode.mp3 -af loudnorm=I=-16:TP=-1.5:LRA=11:print_format=json -f null -
# Pass 2: Normalize with measured values
ffmpeg -i episode.mp3 -af loudnorm=I=-16:TP=-1.5:LRA=11:measured_I=-23.5:measured_TP=-4.2:measured_LRA=8.1:measured_thresh=-34.2:offset=-0.5:linear=true output.mp3 Quality Assurance
Post-normalization verification ensures compliance:
- Final loudness must be within ±0.5 LU of target
- No true peaks exceeding -1.0 dBTP
- Clipping detection (zero samples at max)
- Duration verification (no truncation)
Publishing
The complete episode is published to multiple destinations:
- Cloudinary - CDN hosting for audio and images
- Astro Blog - Markdown post with embedded player
- Neon PostgreSQL - Episode metadata and transcript
- Wasabi - S3-compatible archive backup
CDN Upload (Cloudinary)
Audio and images are uploaded to Cloudinary for global CDN distribution:
- Audio - MP3 uploaded with podcast-specific transformations
- Cover Art - Multiple sizes generated (1400x1400, 600x600, 300x300)
- URLs - Stable URLs returned for embedding
Blog Post Generation
An Astro-compatible markdown file is generated:
- YAML frontmatter with all metadata
- Embedded audio player component
- Full transcript with speaker labels
- Related episode links (when applicable)
Database Recording
Episode metadata stored in Neon PostgreSQL for querying:
- Full-text search on transcripts
- Tag-based filtering
- Analytics tracking (play counts, duration listened)
- RSS feed generation source
Archive Backup
All assets backed up to Wasabi S3 for long-term storage:
- Original prompt audio preserved
- Generated script (JSON + plain text)
- Final episode audio
- All cover art variants
Environment Variables
The pipeline requires the following environment variables. Create a .env file in the backend directory:
# ======================
# AI Services
# ======================
# Google Gemini - Script generation & transcription
GEMINI_API_KEY=your_gemini_api_key_here
# fal.ai - TTS (Chatterbox) and image generation (Flux)
FAL_KEY=your_fal_api_key_here
# ======================
# Media Hosting
# ======================
# Cloudinary - CDN for audio and images
CLOUDINARY_CLOUD_NAME=your_cloud_name
CLOUDINARY_API_KEY=your_api_key
CLOUDINARY_API_SECRET=your_api_secret
# ======================
# Storage & Database
# ======================
# Wasabi S3-compatible storage (backup)
WASABI_ACCESS_KEY=your_wasabi_access_key
WASABI_SECRET_KEY=your_wasabi_secret_key
WASABI_BUCKET=myweirdprompts
WASABI_REGION=eu-central-2
WASABI_ENDPOINT=https://s3.eu-central-2.wasabisys.com
# Neon PostgreSQL database
POSTGRES_URL=postgres://user:pass@host/database?sslmode=require
# ======================
# Optional: Local TTS
# ======================
# Set to "true" to use local Chatterbox Docker instead of fal.ai
USE_LOCAL_TTS=false
CHATTERBOX_URL=http://localhost:8881 Core Pipeline Code
The main entry point processes prompts from the queue:
def generate_podcast_episode(
prompt_audio_path: Path,
episode_name: str = None,
) -> Path:
"""
Generate a complete podcast episode from a user's audio prompt.
Workflow:
1. Upload voice samples + generate script (parallel)
2. Parse script into segments
3. Generate metadata + dialogue audio (parallel)
4. Generate cover art (parallel with TTS)
5. Assemble final episode
6. Publish to Cloudinary, blog, and database
"""
# Initialize clients
gemini_client = get_gemini_client()
get_fal_client()
# Step 1: Parallel - upload voices, generate script, process prompt
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
voice_future = executor.submit(upload_voice_samples)
script_future = executor.submit(
transcribe_and_generate_script,
gemini_client,
prompt_audio_path
)
prompt_future = executor.submit(
process_prompt_audio,
prompt_audio_path,
processed_prompt_path
)
voice_refs = voice_future.result()
script = script_future.result()
processed_prompt_path = prompt_future.result()
# Step 2: Parse diarized script
segments = parse_diarized_script(script)
# Step 3: Parallel - metadata/cover + TTS
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
metadata_future = executor.submit(generate_metadata_and_cover)
audio_future = executor.submit(
generate_dialogue_audio,
segments,
episode_dir,
voice_refs
)
metadata, cover_art_paths = metadata_future.result()
dialogue_audio_path = audio_future.result()
# Step 4: Assemble final episode
concatenate_episode(
dialogue_audio=dialogue_audio_path,
output_path=episode_path,
user_prompt_audio=processed_prompt_path,
intro_jingle=intro_jingle,
disclaimer_audio=DISCLAIMER_PATH,
outro_jingle=outro_jingle,
)
# Step 5: Publish
publish_episode(episode_dir, episode_path, metadata, cover_art_paths, script)
return episode_path Dependencies
Python Packages
pip install google-genai python-dotenv fal-client cloudinary psycopg2-binary boto3 Pillow requests System Requirements
ffmpeg ffprobe Required for audio processing and assembly
Source Code
The complete pipeline code is open source and available on GitHub: