#speech-recognition
37 episodes
#2754: Why Your Dictation Setup Might Be Wrong
Modern ASR is shockingly robust. The biggest predictor of accuracy? How well your audio matches its training data.
#2643: How Stenographers Type 300 Words Per Minute
Court reporters don’t type letters—they chord syllables at 300 words per minute. Here’s how it works and why AI can’t replace them yet.
#2618: Fixing Acronyms in TTS Pipelines
How to handle acronyms in text-to-speech pipelines using BERT models, lexicons, and layered preprocessing.
#2590: How Disfluency Detection Models Clean Up Speech
How transformer models distinguish "um" from meaningful speech — and why removing too much makes you sound like a robot.
#2582: What Your Browser Does to Mic Audio Before It Reaches Your Server
getUserMedia returns audio, but not raw audio. Here's what browsers actually do to your mic feed before it hits your server.
#2563: How Audio Fingerprinting Actually Works
Spectrogram peaks, constellation maps, and hash matching — the elegant mechanics behind identifying any song in seconds.
#2543: Base64 for Audio: What Developers Need to Know
Base64 isn’t compression — it’s a safe transport encoding. Here’s how it works with audio APIs and where its limits are.
#2510: Where Voice AI Actually Works (Not Cold Calls)
Drive-thru accuracy, healthcare triage, and the design secret that makes people *want* to talk to a machine.
#2486: Why Noise Reduction Can Ruin Transcription Accuracy
Cleaning audio before transcription can increase errors by up to 46%. Here's the right approach for your voice app.
#2479: Hands-Free Dictation with a Screaming Baby
Choosing the right headset and control method for dictation when you're holding a baby who won't stop screaming.
#2443: How Podcast RSS Feeds Can Speak Every Language
One RSS feed, a transcript tag, and TTS voice cloning — the emerging standard for letting any podcast speak any language.
#2337: How Speaker Diarization Powers Everything From Call Centers to Courts
Discover how PyAnnote and other tools tackle the critical task of identifying "who spoke when" in audio—and why it’s harder than it sounds.
#2311: Danish AI: Bridging the Localization Gap
How does AI handle Danish? Explore the challenges and progress in making AI tools work for small-language populations.
#2288: The Invisible Gatekeeper of Voice Tech
How voice activity detection shapes every step of the voice tech pipeline, and why it’s harder than it seems.
#2272: The AI Transcription Sweet Spot
Does higher-quality audio make AI transcription worse? New research reveals a surprising "sweet spot" for bitrate, challenging a core assumption of...
#2192: How We Built a Podcast Pipeline
Hilbert reveals the complete technical architecture behind 2,000+ episodes—from voice memos to GPU-powered TTS, with Claude models, LangGraph workf...
#2183: Making Voice Agents Feel Natural
Turn-taking, interruptions, and latency are destroying voice AI UX—and the fixes are deeply technical. Here's what's actually happening underneath.
#2027: Text-In, Text-Out: The Missing Photoshop for Words
Why is editing text with AI so clunky? We explore the "TITO" paradigm—using small, local models for fast, private text transformation.
#1752: Whisper Small Beats Whisper Large in Speed & Accuracy
A 4GPU benchmark on Ubuntu shows the 1.5B parameter Whisper Large is slower and less accurate than the tiny Whisper Small.
#1715: Why Voice Agents Need Frameworks (Not Just APIs)
Raw APIs handle models, but who manages the audio plumbing? We break down Vapi, LiveKit, and Pipecat.
#1634: Agent Interview: Inception Mercury two
Meet Mercury 2, the Abu Dhabi-based AI using diffusion architecture to cut costs and boost wit.
#1601: Cohere: The Switzerland of Enterprise AI
While others chase viral memes, Cohere is quietly building the secure, cloud-agnostic infrastructure powering the global enterprise.
#1539: The Voice Keyboard: Killing the "Digital Sandwich"
Stop shouting at your phone. Discover how dedicated hardware and local AI are making instant, private voice-to-text a reality.
#868: Beyond the Digital Sandwich: Pro Mobile Mics for AI
Stop holding your phone like a piece of toast. Explore the best mobile microphone setups for high-quality AI voice transcription.
#682: The Secret Power of Your Smartphone’s Tiny Microphones
Why does a phone mic outperform a pro headset for AI transcription? Herman and Corn dive into the physics of MEMS and the truth about audio quality.
#33: The Unseen Magic of AI's Ears: Decoding VAD
Ever wonder how your AI knows you're talking? We're diving deep into VAD, the unseen magic behind AI's ears.
#22: Mic Check: Mastering AI Dictation Hardware
Uncover the secrets to perfect AI dictation! Corn and Herman explore the ultimate speech-to-text hardware.
#26: Personalizing Whisper: The Voice Typing Revolution
Voice typing is changing everything. Join us as we explore the revolution of personalizing Whisper!
#15: AI Gets Personal: The Power of Voice Fine-Tuning
AI that understands *your* voice? Dive into the fascinating world of fine-tuning and discover how AI gets personal.
#9: Benchmarking Custom ASR Tools - Beyond The WER
Benchmarking custom ASR fine-tunes: We're diving deep beyond the WER to truly measure performance.
#7: Building Custom ASR Tools
Ever wondered how to build your own ASR tools from scratch? Discover the why and how in this episode!
#8: Building Your Own Whisper
Ever wondered if you could build your own speech recognition tool? We dive deep into crafting custom ASR.
#5: Fine-Tuning ASR For Maximal Usability
Fine-tuned ASR is just the start. Discover the next steps for deployment and maximizing usability.
#6: How To Fine Tune Whisper
Build your own AI transcription tool! We'll walk you through fine-tuning Whisper, from data to notebook.
#4: If Your Voice Ages, Does Your Fine-Tune Become Useless?
Your voice changes, but your fine-tuned model shouldn't become useless. We explore the biology of the larynx and ASR.
#2: Local STT For AMD GPU Owners
AMD GPU? No problem! Dive into local AI adventures like on-device speech to text.
#3: Safetensors or something else: STT inference formats explained
Unpacking ASR weight formats: Safetensors and beyond. Tune in to understand the distinctions.