#2486: Why Noise Reduction Can Ruin Transcription Accuracy

Cleaning audio before transcription can increase errors by up to 46%. Here's the right approach for your voice app.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2644
Published: Apr 27
Duration: 27:27
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: speech-recognition audio-processing automatic-speech-recognition

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When building a voice app, the most intuitive assumption is that cleaner audio leads to better transcription. New research from Deepgram suggests that assumption is actively harmful.

The Noise Reduction Paradox

Deepgram tested 40 configurations across 4 ASR models and 10 noise conditions. Their finding: running noise suppression before transcription increased word error rate by 1% to 46%. Modern end-to-end neural ASR models are trained on raw noisy audio and learn to use subtle acoustic cues that noise reduction algorithms strip away. The cleaning process can't distinguish between noise and useful signal — it just removes whatever looks non-speech, often damaging the very features the transcription model relies on.

The Algorithm Landscape

For transcription, the best pipeline might be no preprocessing at all. Feed raw audio directly to a modern ASR model and measure the word error rate first. Only add lightweight noise reduction if the raw audio performs poorly.

For podcast-quality audio, the priority flips. Traditional DSP methods like spectral subtraction introduce "musical noise" artifacts that sound worse than the original background hum. Heavy deep learning solutions from Krisp or NVIDIA handle the widest range of noise but require GPUs.

The sweet spot is a hybrid approach like PercepNet, which uses traditional DSP for predictable speech components and a tiny neural network only for unvoiced sounds. It runs on less than 5% of a single CPU core while preserving natural timbre.

The Hardest Noise Types

Not all background noise is equal. Stationary noise like fan hum is easy. Transient noises like door slams are harder. The most challenging is babble noise — competing speech in the same frequency range as the target voice. With a single microphone, separating overlapping voices is mathematically underdetermined.

The Accent Gap

There's almost no published research on whether noise suppression models perform differently across accents. Since training data is overwhelmingly American English, an Irish accent's different spectral characteristics and vowel placements might cause the model to misclassify accented phonemes as noise.

The Recommended Architecture

Use two separate audio paths: one raw feed going directly to ASR for transcription, one cleaned feed for podcast output. They have different goals and should have different processing. Test before adding complexity, and match your noise reduction approach to your specific noise type, accent, and compute constraints.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2486: Why Noise Reduction Can Ruin Transcription Accuracy

Daniel sent us this one — he's building a voice app for transcription and he's got his son Ezra in the background, which means screaming, chaos, the full toddler soundtrack. He's asking about background noise removal frameworks, specifically the spectrum from heavyweight machine learning to lightweight options, whether his Irish accent matters for how these models perform, and which noise types are actually the hardest to deal with. Then he's got this split use case — transcription accuracy versus podcast-quality audio where it needs to sound natural, not like a robot in a tin can. And the real question underneath all of it: can you get decent removal that sounds good without burning a mountain of compute?

This is one of those problems where the obvious answer is wrong. Most people building a voice pipeline think step one is clean the audio, step two is transcribe it. But there's this finding from Deepgram last year that I cannot stop thinking about — they tested forty different configurations, four ASR models across ten noise conditions, and in those tests, running noise suppression before transcription made the word error rate worse by anywhere from one percent to forty-six percent.

So you take noisy audio, you clean it up, and the transcription gets less accurate?

That's exactly what they found. They called it the noise reduction paradox. Modern end-to-end neural ASR models like their Nova-three system are trained on raw noisy audio, and they learn to use subtle acoustic cues that noise suppression strips away. The cleaning algorithm doesn't know which frequencies matter for speech recognition — it just knows which ones look like noise. So you end up throwing out signal the ASR model was actually using.

Daniel's whole premise of "I need to clean this before I transcribe it" might be backwards.

For transcription specifically, yes. If he's using a modern ASR model, the best pipeline might be feed it the raw audio and let the model sort it out. Now, that's for transcription. The podcast use case is completely different — nobody wants to listen to a screaming toddler in their earbuds. But for pure voice-to-text accuracy, noise reduction can be actively harmful.

That's a pretty fundamental split. Most people building something like this would assume one clean audio path serves both purposes.

That assumption is exactly what trips people up. Let me walk through the algorithm landscape, because the choice of framework depends entirely on which path you're optimizing for. At the lightweight end, you've got traditional digital signal processing — spectral subtraction, Wiener filtering. These are computationally trivial, they've been around for decades, but they introduce what's called musical noise artifacts. It's this warbly, watery sound that's actually more distracting than the original background hum. Speex and WebRTC's built-in noise suppression live in this category, and neither is actively maintained anymore.

Musical noise — is that the thing where silence suddenly sounds like a ghost whistling?

It's because spectral subtraction works by estimating the noise profile during silent segments, then subtracting that frequency profile from the speech segments. But noise isn't perfectly stationary, so your estimate is always slightly wrong, and those errors manifest as random tonal blips. It sounds artificial in a way that human ears find really grating.

Which makes it a non-starter for the podcast path. What about the machine learning heavyweights Daniel mentioned?

The full deep learning approach — Krisp, NVIDIA Audio Effects, Picovoice Koala — these use large neural networks trained on millions of noise samples. They handle the widest range of noise types, including the really hard stuff like babble noise, which is multiple overlapping conversations. But they're computationally expensive. Krisp's SDK requires filling out a sales form just to access it. NVIDIA's solution needs a GPU with Tensor Cores. If Daniel's running this on a phone or a standard server without a GPU, those might be impractical.

There's a middle ground?

This is where it gets interesting. There's a hybrid approach that combines traditional DSP with a tiny neural network, and it punches way above its weight class. The best example is something called PercepNet, which Amazon developed for their Chime video conferencing. It placed second in the Interspeech twenty twenty Deep Noise Suppression Challenge in the real-time track. And here's the wild number: it runs on less than five percent of a single CPU core — they tested on an Intel i7 eight-five-six-five-U at one point eight gigahertz. Less than five percent.

Five percent of one core? That's basically free.

It's absurdly efficient. The philosophy behind PercepNet is do as little as possible with the deep neural network. It uses traditional DSP for pitch tracking and periodic speech components — the parts of your voice that are predictable and mathematically well-understood. The neural network only handles the unvoiced stuff, the fricatives and transients where DSP alone falls apart. By constraining the DNN to a tiny role, they avoid most of the artifacts that make processed audio sound robotic.

That machine-like timbre Daniel mentioned.

The more processing you do, the more you risk stripping away the natural variation in someone's voice. PercepNet's design is philosophically interesting — the best-sounding noise removal might be the one that does the least processing. It's not about throwing more neural network at the problem. It's about being surgical.

For the podcast path, PercepNet sounds like the sweet spot — decent quality, minimal compute, natural-sounding output. What about the open source side? Daniel's an open source developer, he's going to want something he can actually inspect and modify.

RNNoise is the classic answer. Mozilla built it, it uses a gated recurrent unit combined with classic DSP, and it runs at about ten milliseconds of latency on CPU with minimal resources. The problem is it's unmaintained now, and it degrades pretty badly on babble noise and music. It was trained on specific noise types, and if your noise doesn't match the training distribution, performance falls off a cliff.

Ezra screaming is definitely not in the training distribution. That's a specialized noise profile if I've ever heard one.

That brings us to the noise type question, which I think is the most underappreciated part of this whole problem. Not all background noise is created equal. Stationary noise — fan hum, HVAC, electrical buzz — that's easy. Traditional DSP handles it beautifully because the noise profile is constant. Transient noises like door slams, dog barks, car horns — those are harder because they're sudden and brief, so the model needs temporal context to distinguish them from speech onsets.

Ezra screaming, restaurant chatter, other conversations — those are a fundamentally different category.

That's babble noise, also known as the cocktail party problem. It's competing speech in the same frequency range as the target voice, with identical spectral characteristics. To a DSP algorithm, another person talking looks exactly like the person you want to hear. Even lightweight neural models struggle because they don't have enough capacity to separate two voices in the same frequency band. This is where full deep learning systems have the biggest advantage — but even they're limited by the single-channel problem.

Single-channel meaning one microphone?

Humans solve the cocktail party problem partly because we have two ears — binaural hearing gives us spatial cues that help separate sound sources. With a single microphone, you lose all spatial information. You're asking an algorithm to unmix audio that's been summed into one waveform, which is mathematically underdetermined. There's no unique solution.

If Daniel's recording voice memos on his phone while holding Ezra, he's dealing with the hardest noise type through the hardest channel configuration, and he wants it to sound good with minimal compute.

With an Irish accent, which adds another wrinkle. There's almost no published research on whether noise suppression models perform differently across accents. ASR accent bias is well-documented — non-native speakers, children, dialect speakers all get higher word error rates. But noise suppression models? The training data for most of these systems is overwhelmingly American English. An Irish accent has different spectral characteristics, different vowel placements, different intonation patterns. If the model was trained to recognize "speech" based on American English features, it might misclassify accented phonemes as noise, or vice versa.

Daniel's voice itself might be partially outside the model's comfort zone. That's a genuine gap — not just for him, but for voice app developers generally. If nobody's testing accent bias in noise suppression, there's a whole class of users getting worse performance and nobody's measuring it.

The fundamental frequency difference matters too. Male voices typically sit between eighty-five and one hundred eighty hertz. Female voices are higher, one sixty-five to two fifty-five. Male speech energy concentrates in lower frequency bands, which is exactly where traffic rumble and HVAC hum live. So for male speakers, low-frequency noise suppression is more critical because the noise and the voice are competing for the same spectral real estate.

Which means Daniel's specific combination — male, Irish accent, single microphone, babble noise from a toddler — is basically the perfect storm for noise suppression. Every challenging factor converges on his use case.

Yet people do this every day. Podcasters record in untreated rooms, journalists file from chaotic environments, parents dictate notes while holding babies. The tools exist, they just need to be matched to the use case carefully. Let me lay out what I think the actual pipeline should look like for Daniel's two paths.

Go for it.

For transcription, step one is try nothing. Feed the raw audio directly to a modern ASR model — Deepgram Nova-three, Whisper, whatever he's using — and measure the word error rate. If the ASR handles the noise natively, you're done. No compute spent, no artifacts introduced, no accent bias from a separate preprocessing step. If the raw audio error rate is too high, then add a lightweight hybrid like RNNoise or PercepNet, but only after confirming it actually improves accuracy on his specific voice and noise conditions.

Test before you add complexity.

You'd be shocked how many pipelines skip that step. For the podcast path, the priority flips. You need noise suppression, but you need it to preserve natural timbre. PercepNet is the standout if he can implement it — the Amazon Chime SDK exposes it, and there are open source reimplementations floating around. If that's not accessible, RNNoise with conservative settings, or one of the newer ultra-lightweight models like DeepFilterNet two or DTLN. Datadog apparently found DTLN more effective than RNNoise at similar efficiency.

The heavy options — Krisp, NVIDIA — those are for when latency doesn't matter and you can throw a GPU at the problem. Post-production cleanup rather than real-time.

The architecture I'd recommend is two separate audio paths. One raw feed going to the ASR for transcription, one cleaned feed going to the podcast output. They have different goals, they should have different processing.

That feels like the core engineering insight here. Don't try to find one setting that serves both masters, because the optimization targets are contradictory. What helps transcription can hurt listenability and vice versa.

There's a deeper point about how we think about noise. The term "noise" implies something to be removed, but for a neural ASR model, ambient sound is context. The model learns that certain acoustic patterns correlate with certain phonemes even in noisy conditions. When you preprocess that away, you're not helping the model — you're depriving it of information it was trained to use.

It's like if you hired a translator who's fluent in a dialect, but before you let them listen, you ran the audio through a filter that strips out dialect features because you decided they're "noise.

That's a perfect analogy. The ASR model is the translator, and it's been trained on messy real-world audio. Let it do its job.

For Daniel's specific situation — he's holding Ezra, dictating into his phone, wants both a clean transcription and maybe usable podcast audio. What's the practical recommendation? Step by step.

First, pick an ASR model that's designed for noisy audio. Deepgram Nova-three explicitly markets this capability. Whisper large models also handle noise reasonably well. Test the raw audio transcription quality before doing anything else.

If raw audio transcription fails?

Then add RNNoise or PercepNet at a low aggressiveness setting. Not to make the audio pristine, just to knock down the worst of the noise. The goal is to help the ASR, not to produce listening-quality audio. Over-suppression is the enemy.

For the podcast path?

Separate processing chain. If he's recording on a phone, the built-in noise suppression on modern devices is actually pretty good — Apple's Voice Isolation on iPhones uses a neural engine and produces surprisingly natural results. For post-production, iZotope RX has a dialogue denoiser that's industry standard for a reason. Adobe Enhance Speech is free and produces good results, though you can hear the processing if you listen closely.

If he wants something he can integrate programmatically into his own app?

That's where PercepNet or an RNNoise derivative makes sense. There's an emerging model called EchoFree that uses only two hundred seventy-eight thousand parameters and thirty million multiply-accumulate operations — that's tiny. It combines linear filtering with a neural post-filter on Bark-scale features. The paper's on arXiv. DeepFilterNet two uses frequency-dependent masking for high frequencies and deep filtering for low frequencies. These are all in the sweet spot of decent quality, low compute, open source.

What about the screaming specifically? Ezra screaming is not the same as office babble or traffic hum.

Infant vocalizations are actually a distinct acoustic category. They're high-pitched, they have irregular harmonic structure, and they're extremely broadband — they cover a wide frequency range. Most noise suppression models aren't trained on infant screams. Daniel might need to fine-tune a model on his own data if he wants optimal performance. Record a few minutes of Ezra making noise, a few minutes of silence, a few minutes of his own voice, and use that to adapt a lightweight model.

Fine-tuning a noise suppression model on your own toddler. That's either absurdly over-engineered or exactly the kind of thing a parent building a voice app would do.

It's the difference between a generic solution and one that actually works for your specific acoustic environment. Most noise suppression failure modes come from distribution mismatch between training data and deployment conditions. If your deployment condition includes a specific child's scream profile, training on that data is the most direct way to close the gap.

Let me pull on the accent thread a bit more. You said there's no published research on accent bias in noise suppression. What would that bias actually look like if it exists?

The most likely failure mode is that the voice activity detector — the component that decides whether each audio frame is speech or silence — misclassifies accented phonemes as noise. If the model was trained primarily on American English, it learns that certain spectral patterns mean "speech." An Irish accent shifts vowel formants, changes consonant articulation, alters the temporal envelope of speech. If those shifted patterns don't match the model's internal definition of "speech," it might attenuate them.

Daniel says something in his natural accent, and the noise suppressor quietly decides part of his voice is background babble and filters it out.

Which would produce exactly the machine-like timbre he's worried about. It's not that the model is adding distortion — it's that it's removing parts of the actual voice signal. The voice sounds thin or processed because it literally is thinner than the original.

This would affect both transcription and podcast quality, just in different ways. For transcription, missing phonemes means wrong words. For podcast, it means that robotic quality.

The fix is the same in both cases: test with your actual voice. Daniel should record the same sentence in quiet conditions and with Ezra screaming, run both through his pipeline, and compare. If the cleaned noisy version sounds noticeably different from the quiet version — not just noisier, but different in timbre — the suppression is eating his voice.

That's a dead simple test. No fancy equipment needed.

It catches problems that generic benchmarks miss. Industry benchmarks like the Deep Noise Suppression Challenge use standardized test sets with American English speakers. They tell you how well a model works on average, not how well it works on a specific Irish male voice with a specific toddler in the background.

Which brings us back to Daniel's core question. Can you get decent removal and decent sounding audio that isn't vastly computationally expensive? I think the answer is a qualified yes, but it depends on what "decent" means for your specific setup.

If decent means studio-quality output, no. That requires heavy processing and probably manual cleanup. But if decent means listenable, natural-sounding, clearly better than raw audio, then yes — PercepNet-level quality at under five percent CPU is absolutely achievable. The key is accepting that you're not trying to eliminate noise, you're trying to reduce it enough that it stops being distracting while preserving the voice characteristics that make the audio sound human.

For transcription, the counterintuitive answer is that "decent" might mean doing nothing at all. Let the ASR handle the noise.

Which is the hardest thing to explain to a product manager or a client. "We're not going to clean the audio" sounds like negligence. But the data supports it.

There's a broader principle here about machine learning pipelines in general. When you chain models together, each step optimizes for its own objective, and those objectives might conflict. The noise suppressor optimizes for signal-to-noise ratio. The ASR optimizes for word error rate. Those aren't the same thing.

The conflict is measurable. That Deepgram study found semantic word error rate increases of up to forty-six percent when adding speech enhancement preprocessing. Forty-six percent worse. That's not a rounding error, that's breaking the system.

Daniel's pipeline design question isn't really "which noise suppressor should I use" — it's "should I use a noise suppressor at all for the transcription path, and if so, how do I validate that it's actually helping.

For the podcast path, the question is "which suppressor preserves the most natural voice timbre while being cheap enough to run in my target environment." Different question, different answer, different evaluation criteria.

Let's talk about the newer lightweight models for a minute. You mentioned EchoFree and DeepFilterNet two and DTLN. Are these actually usable today, or are they research projects that Daniel would have to implement from scratch?

DTLN has a Rust implementation — DTLN-rs — that's being used in production. DeepFilterNet two has a Python implementation with pretrained weights available. EchoFree is newer, the paper's from mid-twenty-twenty-five, but the architecture is simple enough that a competent developer could implement it. The trend in this space is toward smaller models that run on-device, partly for latency and partly for privacy — nobody wants their voice data going to a cloud service for noise removal.

Privacy is another dimension Daniel probably cares about. If the noise suppression runs locally, the audio never leaves the device before transcription.

For a parent recording voice memos at home, that matters. You don't want recordings of your child being processed on someone else's server.

Local processing, lightweight models, accent-aware testing, separate paths for transcription and podcast. That's a coherent architecture.

The one thing I haven't mentioned is voice activity detection, which is a separate but related problem. If Daniel's app needs to know when he's speaking versus when Ezra is screaming — for turn-taking in a voice interface, or for segmenting recordings — that requires a different kind of model. Noise suppression and VAD are often bundled together, but they're distinct tasks.

VAD is probably harder in his environment too, because a screaming toddler triggers the same acoustic features as speech.

Babble noise and infant vocalizations are particularly challenging for VAD because they have speech-like characteristics. Traditional energy-based VAD fails completely — a scream is high-energy. Machine learning VAD does better but still struggles with the speech-versus-speech discrimination problem.

Daniel didn't ask about VAD specifically, but it's worth flagging. If his voice app does anything beyond passive transcription, he's going to hit that problem too.

The solution space is similar — test with real data, don't assume off-the-shelf models work out of the box, and be prepared to fine-tune.

If we were to summarize the algorithm landscape Daniel should care about, what's the shortlist?

For lightweight hybrid approaches, PercepNet is the gold standard for quality-per-compute. RNNoise is the fallback if he needs something fully open source and doesn't mind that it's unmaintained. For the podcast post-production path, iZotope RX if he has budget, Adobe Enhance Speech if he wants free and good enough. For the transcription path, his ASR model's native noise handling is the first thing to try. And if he wants to experiment with the cutting edge, DTLN-rs and DeepFilterNet two are worth benchmarking against his own voice.

The heavy deep learning options — Krisp, NVIDIA — are for when none of the lightweight approaches work and he has the compute budget to spare.

Or for offline batch processing. If he records a podcast episode and has time to run it through a GPU-accelerated denoiser before publishing, the heavy options make sense. They're just not real-time on a phone.

One thing we haven't touched on — you mentioned the "noise reduction paradox" is from Deepgram's research. Are there independent replications of that finding, or is it one company's study?

There's a separate paper on arXiv from late twenty-twenty-five that did a systematic study of speech enhancement effects on modern ASR, and they found similar results across multiple ASR models. It's not just Deepgram — it's a general property of end-to-end neural ASR systems. The models learn to use noisy features, and preprocessing removes those features.

The paradox is real and replicated. Good to know.

It's one of those findings that changes how you think about pipeline design. The old model was clean-then-process. The new model is process-then-clean-if-needed. It's a genuine paradigm shift.

It's the kind of thing that sounds wrong until you understand the mechanism, which makes it hard to explain to stakeholders. "Trust me, the noisy audio works better" is not an intuitive sell.

Which is why having the data matters. Daniel should run the experiment himself — raw audio word error rate versus cleaned audio word error rate on his own voice and noise conditions. If the raw audio wins, he has his evidence. If the cleaned audio wins, he has his answer. Either way, he's not guessing.

And now: Hilbert's daily fun fact.

The national animal of Scotland is the unicorn. It has been since the twelve hundreds, when William the First adopted it as a symbol on the Scottish royal coat of arms. Scotland is one of the few countries whose national animal is a mythological creature.

For Daniel's practical next steps, here's what I'd recommend. First, benchmark his ASR model on raw noisy audio with his own voice and Ezra's actual background noise. Don't use simulated noise, don't use generic test sets. Record real dictation sessions and measure the word error rate. That's his baseline.

Second, if the baseline is acceptable, ship it. The simplest pipeline that works is the best pipeline.

Third, for the podcast audio path, test PercepNet or RNNoise at low to medium aggressiveness. Listen critically for timbre changes, not just noise reduction. If his voice sounds different — thinner, more metallic, less natural — reduce the aggressiveness or try a different model.

Fourth, if he's getting that machine-like timbre, check whether the voice activity detector is misclassifying his accented speech. Record in quiet, record in noise, compare the cleaned versions. If the cleaned noisy recording sounds fundamentally different from the cleaned quiet recording, the suppressor is eating his voice.

Fifth, accept that with a screaming toddler in the room, perfect is not on the table. The goal is good enough — transcription that's accurate, audio that's listenable. Chasing the last ten percent of noise removal often costs more in compute and artifacts than it's worth.

There's something almost philosophical about that. Noise is part of the recording because Ezra is part of Daniel's life. The goal isn't to pretend the toddler doesn't exist — it's to communicate clearly despite the chaos.

That's the engineering mindset. Work with the constraints you have, not the constraints you wish you had.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop, and thanks to DeepSeek V four Pro for generating today's script. If you enjoyed this episode, leave us a review wherever you listen — it genuinely helps other people find the show.

We'll be back next time with another prompt from Daniel.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2486: Why Noise Reduction Can Ruin Transcription Accuracy

Downloads

You Might Also Like

#2486: Why Noise Reduction Can Ruin Transcription Accuracy