#2512: How Speech-to-Speech Models Eliminate the Robot Voice

Why AI voice agents sound robotic, and how natively integrated speech-to-speech models fix it.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2670
Published: Apr 29
Duration: 28:08
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: speech-to-speech audio-processing latency

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Why AI Voice Agents Still Sound Robotic — and How the New Generation Fixes It**

If you've ever talked to an AI voice agent and felt something was off — the pauses too long, the voice flat, the rhythm slightly wrong — you've experienced the consequences of a fundamental architectural choice. Most voice agents today use a pipeline approach: speech-to-text, then an LLM, then text-to-speech. But a new class of natively integrated speech-to-speech models is changing that.

The Pipeline Problem

The traditional approach destroys information at the first step. When you speak, your voice carries pitch contours, energy, rhythm, micro-pauses, and emotional valence. A speech-to-text engine discards all of that and keeps only the words. The LLM processes text without knowing how you said it. The TTS engine then has to guess at the prosody from scratch — and usually guesses wrong, producing flat delivery.

The pipeline also adds cumulative latency. Each stage introduces delay: the ASR waits for an endpoint, the LLM generates a full response, the TTS synthesizes before playback. Even with aggressive optimization, systems like Vapi report median latencies around 1.2-1.5 seconds end-to-end. Natural human conversation operates at roughly 200 milliseconds. Humans detect deviations below 200 milliseconds, which is why those pauses feel uncanny.

The Integrated Alternative

Speech-to-speech models skip the text bottleneck entirely. They operate directly on audio tokens — encoding raw audio into learned representations, reasoning across those tokens, and decoding back to audio. No intermediate text representation exists. Prosody, emotion, and speaker state are preserved natively.

Sesame's CSM paper illustrates the architecture: a dual-tower design with one tower for semantic understanding and one for acoustic generation, coupled through a conversational speech model that handles turn-taking dynamics. These systems can produce overlapping speech and backchanneling — the "mm-hmm" and "yeah" vocal nods that signal active listening. Pipeline systems treat those as noise to filter out.

The Current Landscape

OpenAI's Realtime API operates over WebSockets, streaming audio frames in and out with claimed median response times under 300 milliseconds for simple utterances. It supports function calling for agent use cases like calendar checks or payments.

Google's Gemini Live uses a multimodal native architecture — the same model handles text, images, and audio — and benefits from tight Android integration for lower client-side latency.

Kyutai's Moshi is open-weight and designed for full-duplex operation: it can listen and speak simultaneously, like humans do. Its conversational quality isn't quite at proprietary levels, but the architectural innovations are influential.

Hume's EVI takes a different angle, focusing on emotional intelligence. It measures vocal burst patterns, pitch variability, and speech rate, then modulates its own voice in response — shifting to a calmer register if you sound frustrated, matching your energy if you're excited.

Cartesia Sonic and ElevenLabs Conversational AI optimize the pipeline approach to minimize latency, but they still go through text. The information-loss problem remains.

Why Pipelines Still Dominate

Despite these seams, pipeline tools like Vapi, Retell, and Bland still dominate production. The reasons are practical: they've been around for years with SDKs, documentation, enterprise SLAs, and debugging tooling. Companies don't rip out working infrastructure for architectural elegance. Cost is another factor — integrated speech-to-speech models are expensive to run, while pipelines let you mix and match cheap ASR with mid-tier LLMs and fast TTS.

The trade-off is clear: integrated models offer superior naturalness and emotional attunement, but pipeline tools offer reliability, cost control, and maturity. As the integrated models mature and costs come down, the balance will shift — but for now, most production systems still stitch together separate components.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2512: How Speech-to-Speech Models Eliminate the Robot Voice

Daniel sent us this one — he's asking about AI voice agents, what's actually happening under the hood, and why so many of them still feel like you're talking to a robot that's buffering its personality. He points out that a lot of platforms, Vapi-style tools, stitch together separate components — speech-to-text, an LLM, text-to-speech — with orchestration glue holding it all together. The result is cumulative latency, and all the prosody and emotion get lost the moment audio gets flattened into text and back. Then he contrasts that with the new class of natively integrated speech-to-speech models — OpenAI's Realtime API, Gemini Live, Kyutai's Moshi, Sesame, Cartesia, Hume — systems that process audio tokens end-to-end, skip the text bottleneck, and enable full-duplex conversation with sub-second response. He wants us to walk through the architectural difference, why latency and naturalness diverge so sharply, where the pipeline tools sit on the spectrum, and what the integrated-model future looks like for phone agents, accessibility, companion AI. There's a lot to unpack here.

There really is. And I think the place to start is that most people don't realize what they're actually hearing when they talk to one of these agents. They sense something is off — the pauses are slightly too long, the voice doesn't quite react to being interrupted, there's this uncanny flatness — but they can't name why. The "why" is exactly what Daniel's getting at. The pipeline approach literally destroys information at step one.

Destroy is a strong word.

It's the right word. When you speak, your voice carries an enormous amount of information beyond the words themselves — pitch contours, energy, rhythm, micro-pauses, breath patterns, emotional valence. A traditional automatic speech recognition engine takes all of that and reduces it to a transcript. Just the words. Everything else is discarded. Then an LLM processes the text, which is fine for reasoning but it has no access to how you said it. Then a TTS engine generates audio from the LLM's text output, and it has to guess at the prosody from scratch — or more commonly, it just doesn't. You get flat, consistent delivery regardless of context.

The text transcript is the bottleneck.

And it's not just a semantic bottleneck, it's a temporal one. Each stage in that pipeline adds latency. The ASR has to wait until you've finished speaking, or at least until a clean endpoint is detected. Then the LLM has to generate a full response. Then the TTS has to synthesize that entire response before playback begins. Even with streaming at each stage, you're looking at cumulative delays. Vapi published a deep dive on this — they were seeing median latency around one-point-two to one-point-five seconds end-to-end in their early pipeline, and that's with aggressive optimization. A natural human conversation has response latencies around two hundred milliseconds. You're an order of magnitude slower.

That one-point-two seconds is what you hear as the awkward pause where you wonder if the thing heard you.

And the uncanniness compounds because humans are exquisitely sensitive to conversational timing. We detect sub-two-hundred-millisecond deviations. When the pause is too long, your brain flags it as unnatural before you consciously register why. It's the same reason Zoom calls feel draining — even tiny latencies break the rhythm of turn-taking.

By the way, before we go deeper — DeepSeek V four Pro is writing our script today. So if anything comes out especially articulate, that's why.

Welcome aboard, DeepSeek. Alright, so let me walk through what the integrated models do differently. The fundamental architectural shift is that instead of converting audio to text and back, these models operate directly on audio tokens. They take in raw audio, encode it into a learned representation — sometimes called audio embeddings or acoustic tokens — and the model reasons across those tokens directly, then decodes back to audio. There's no intermediate text representation at all.

The model is thinking in sound.

That means prosody, emotion, speaker state — all of that is preserved in the representation. The model can hear that you're frustrated or hesitant or sarcastic, and it can respond with appropriate emotional coloring in its own voice. Sesame's CSM paper is really instructive here. They use a dual-tower architecture — one tower for semantic understanding, one for acoustic generation — and they're coupled through what they call a "conversational speech model" that handles the turn-taking dynamics explicitly. Their system can generate overlapping speech, backchanneling, all the things that make conversation feel alive.

Backchanneling being the "mm-hmm" and "yeah" stuff.

Little vocal nods that signal you're tracking. Pipeline systems almost never do this well because the ASR treats those as noise to filter out, and the LLM has no mechanism to generate them in real time. With a speech-to-speech model, the system can produce those micro-responses while you're still speaking, because it's processing audio continuously rather than waiting for a turn boundary.

Let's talk about the players in this space, because Daniel listed quite a few and they're not all doing the same thing.

OpenAI's Realtime API — that's gpt-four-o-realtime and the newer gpt-realtime — is probably the most visible. They've got this WebSocket-based interface where audio frames stream in, the model processes them natively, and audio frames stream out. They support function calling, which is critical for agent use cases where the model needs to actually do things — check a calendar, query a database, process a payment. The latency numbers they've published are aggressive. They're claiming median response times under three hundred milliseconds for simple utterances, though real-world performance varies with model load.

Google's Gemini Live?

Similar idea, different implementation. Google's approach leverages their in-house TPU infrastructure and they've done some clever things with audio tokenization to reduce the token rate. The Gemini Live system uses what they call a "multimodal native" architecture — the same model handles text, images, and audio, which means the audio understanding benefits from the broader training. But the key difference from OpenAI is that Google has tighter integration with Android. Gemini Live is deeply embedded in the Android experience, which gives it lower client-side latency and better access to on-device context.

The platform advantage matters.

And that's one reason these things are hard to compare on raw benchmarks alone. A model might have three-hundred-millisecond server latency, but if the client stack adds another two hundred milliseconds of buffering and codec overhead, the user experience is still half a second. Integration quality matters as much as model quality.

What about Moshi? Kyutai's thing.

Moshi is fascinating because it's open. Kyutai — the French nonprofit lab — released Moshi in September twenty twenty-four, and they published the full architecture. It's a seven-billion-parameter model that processes audio in streaming chunks with a theoretical latency of around two hundred milliseconds. What's notable is they designed it explicitly for full-duplex operation — the model can listen and speak simultaneously, which is how humans actually converse. We don't wait for the other person to finish before formulating a response. We're processing their speech, predicting where they're going, and preparing our reply in parallel. Moshi does that.

It's open-weight?

Weights, code, the whole thing. That matters for the ecosystem because it means researchers and smaller companies can build on it without licensing costs. The trade-off is that Moshi's raw conversational quality isn't quite at the level of the proprietary systems — it can sound a bit more synthetic — but the architectural innovations are real and influential.

Then there's Hume with EVI, which is coming at this from a different angle entirely.

Yeah, Hume is the interesting one philosophically. Their whole thing is emotional intelligence. EVI — the Empathic Voice Interface — is built on what they call an "empathic large language model" that's been fine-tuned on millions of human emotional expressions. They're not just processing the words; they're measuring vocal burst patterns, pitch variability, speech rate, and mapping those to emotional dimensions. The output is a voice that modulates its emotional expression in real time based on what it detects in the user. So if you sound frustrated, the agent's voice shifts to a calmer, more reassuring register. If you sound excited, it matches your energy.

Does that actually work, or is it a parlor trick?

I think it's somewhere in between. The emotional detection is genuinely impressive — their research papers show strong correlations with human annotator judgments across a wide range of emotional dimensions. The synthesis side is good but not perfect. Sometimes the emotional modulation can feel slightly theatrical, like an actor who's trying a bit too hard. But the direction is clearly right. If we want voice agents that people actually want to talk to, emotional attunement is non-negotiable.

Cartesia Sonic and ElevenLabs Conversational AI are more on the voice quality side, right?

Yes, and this is where the landscape gets interesting because they're approaching integration from the TTS end rather than the model end. Cartesia built their Sonic model on state-space models — SSMs — which are an alternative to transformers that offer much faster inference. They're claiming sub-hundred-millisecond time-to-first-audio for TTS, which is fast. ElevenLabs has been the gold standard for voice quality and they've been building out their Conversational AI product to handle the full stack, including turn-taking logic and interruption handling. But both of them are, at their core, still doing text-to-speech — just really, really good text-to-speech with minimal latency.

They're not truly speech-to-speech in the way Moshi or GPT Realtime are.

They still go through a text representation. The LLM generates text, their TTS synthesizes it. The difference is they've optimized that pipeline to the point where the latency is low enough that some of the uncanny-valley problems diminish. But they don't solve the information-loss problem. Prosody from the user's speech still gets discarded at the ASR stage. The emotional intelligence has to be added back in through separate analysis rather than being natively preserved.

Let's get to the question I think is actually the most interesting one Daniel raised. Why do pipeline tools like Vapi, Retell, Bland still dominate production, given all these obvious seams?

This is the practical reality that the hype pieces miss. There are several reasons, and they're all interconnected. First: the integrated models are new. GPT Realtime only launched in October twenty twenty-four. Moshi was September twenty twenty-four. Sesame's CSM is even more recent. Meanwhile, Vapi and Retell have been in production for years. They have SDKs, documentation, enterprise SLAs, compliance certifications, debugging tooling, analytics dashboards. Companies don't rip out working infrastructure for architectural elegance.

The boring answer is always "it works and we know how to debug it.

Second reason: cost. The integrated speech-to-speech models are expensive to run. You're paying for a large multimodal model to process every audio frame in real time. Pipeline architectures let you mix and match — you can use a cheap ASR like Deepgram or Whisper, a mid-tier LLM, and a fast TTS. Vapi's blog actually talks about this explicitly — they've been optimizing their pipeline to get latency down while keeping costs manageable for high-volume use cases like call centers. When you're handling millions of calls a day, the per-minute cost difference between a pipeline and a native speech model is enormous.

Businesses care about cost more than whether the agent sounds slightly more human.

Right now, yes. Third reason: ecosystem lock-in and customization. With a pipeline, you can swap out any component. Want to use Claude instead of GPT for the reasoning layer? Want to use a custom fine-tuned TTS voice? Want to add a sentiment analysis module in parallel with the ASR? The integrated models are more monolithic — you get what the provider gives you. That's fine for demos, but production deployments almost always need customization.

The fourth reason?

Tool use and structured output. This is the big one. Voice agents in production aren't just chatting — they're doing things. Booking appointments, looking up account information, processing returns. They need structured outputs, API calls, database queries. The integrated speech-to-speech models are still figuring this out. OpenAI's Realtime API supports function calling, which is good, but the tooling around it is immature compared to the text-based LLM ecosystem. With a pipeline, you have the full power of the text LLM ecosystem — structured output parsing, tool definitions, validation, retry logic. You lose some of that when you go directly audio-to-audio.

The pipeline isn't just legacy inertia — there are real structural advantages for certain use cases.

I think the honest assessment is that pipelines are better for transactional use cases where accuracy and tool integration matter more than conversational naturalness. Call center deflection, appointment scheduling, order status — you don't need emotional attunement for those. You need it to work reliably, cheaply, and to integrate with the CRM. Pipelines do that well. The integrated speech-to-speech models shine in open-ended conversational contexts — companion AI, tutoring, therapy-adjacent applications, creative collaboration — where the quality of the interaction itself is the product.

That's a useful distinction. Though I suspect the line blurs over time as the integrated models get cheaper and better at tool use.

That's exactly where this is heading. And Vapi knows this — they've already added GPT Realtime support to their platform. You can now choose between their optimized pipeline and the native Realtime model, depending on the use case. I think that's the near-term future: platforms that abstract over both approaches, routing to the right architecture based on the conversation context.

Let's talk about what happens when these integrated models are actually deployed at scale for phone agents. Daniel mentioned that specifically.

Phone agents are the hardest deployment environment, bar none. You've got codec compression — audio gets crushed down to eight kilohertz for telephony. You've got network jitter, packet loss, variable latency. You've got background noise, speakerphones, bad connections. And you've got the fact that people talk to phone agents differently — they're more impatient, more likely to interrupt, more likely to speak in fragments. The integrated models handle some of this better because they're trained on more naturalistic audio, but the telephony channel itself degrades the very prosodic information that speech-to-speech models are designed to preserve.

The phone network eats your advantage.

There's work on this — models being fine-tuned specifically for telephony audio, better codec handling in the tokenization layer — but it's an active research area, not a solved problem. And this is another reason pipelines persist: the ASR-LLM-TTS stack has been battle-tested against terrible phone audio for years. The integrated models are still catching up on robustness.

What about accessibility? That seems like a clearer win.

For accessibility applications — screen readers, communication aids for people with speech impairments, navigation assistance — the naturalness gains are transformative. A screen reader that can adjust its speaking rate and emotional tone based on context makes a massive difference in comprehension and user comfort. And these are often local, on-device applications where you don't have the telephony degradation problem. Apple's been investing heavily in on-device speech models, and I expect that to accelerate.

Companion AI is the other use case Daniel flagged, and that's the one that gets the most press.

Because it's the most sci-fi. And honestly, this is where the integrated models are most compelling. If you're building something that's supposed to feel like a conversational partner — whether that's a language tutor, a mental wellness coach, or just an AI friend — the pipeline approach is fundamentally crippled. You can't build emotional rapport through a text bottleneck. Sesame's demos are striking in this regard. Their model generates little laughs, hesitations, breathing sounds — things that would never survive an ASR-to-TTS round trip. It feels present in a way that even the best pipeline system doesn't.

There's something slightly unsettling about how good these are getting. The Sesame demo voice, the one that went viral — people had strong reactions.

Yeah, the Maya voice. It's intentionally designed to be warm and slightly intimate. And you're right, people had reactions ranging from "this is amazing" to "this makes me deeply uncomfortable." That's the uncanny valley in reverse. As these systems get more natural, they cross a threshold where your brain stops treating them as technology and starts applying social expectations. And when the system then does something slightly off — a pause that's a few hundred milliseconds too long, an emotional response that doesn't quite fit — it feels worse than a clearly robotic voice, because the violation of expectation is sharper.

The better it gets, the worse the failures feel.

And that's going to be a design challenge for years. How do you signal "I'm an AI" without making the interaction unpleasant? Some companies are leaning into it — giving their agents slightly stylized voices that are pleasant but clearly synthetic. Others are trying to cross the valley entirely and be indistinguishable. I'm not sure which approach wins.

Let's go back to something you mentioned earlier about Vapi's latency optimizations. What did they actually do to get their pipeline faster?

A few things. One, they moved to streaming everywhere — streaming ASR that emits partial transcripts, streaming LLM that starts generating before the full prompt is assembled, streaming TTS that starts playback before the full response is synthesized. This is parallelism at the component level. Two, they did a lot of work on turn detection — using voice activity detection models that can spot the end of a speaker's turn faster and more accurately, which reduces the dead air between turns. Three, they optimized their WebSocket infrastructure to reduce network round trips. Their blog mentions getting median latency down to around eight hundred milliseconds in their optimized pipeline, which is good for a multi-component system.

Eight hundred milliseconds versus the two hundred to three hundred that the integrated models claim.

And that five-hundred-millisecond difference is roughly the threshold where most people start noticing. Below about four hundred milliseconds, conversations feel natural. Above seven or eight hundred, they start feeling stilted. The integrated models are on the right side of that threshold; even optimized pipelines are on the wrong side.

What about barge-in handling? That's the other thing that makes these systems feel robotic — when you try to interrupt and it just keeps talking.

This is technically difficult in both architectures but for different reasons. In a pipeline, the ASR is typically configured to ignore input while the TTS is playing — otherwise you get feedback loops and garbled transcripts. So the system is literally deaf while it's speaking. Some platforms do echo cancellation to allow barge-in, but it's finicky. In a speech-to-speech model, the architecture naturally supports full-duplex — the model is always listening, always processing. But the challenge shifts to deciding when to yield. You don't want the model to interrupt itself every time you cough or say "uh-huh." Moshi handles this with an explicit turn-taking model that predicts conversational boundaries. It's not perfect.

We've covered the architecture, the players, the production reality. What's the forward look? Where is this in two or three years?

I think we see convergence on a few fronts. One, the integrated models get dramatically cheaper — we're already seeing price drops from OpenAI on Realtime API. As inference hardware improves and model distillation techniques mature, the cost argument for pipelines weakens. Two, tool use gets solved. The speech-to-speech models will either gain native structured output capabilities or we'll see hybrid architectures where the audio model handles conversation and hands off to a text model for tool execution, but with shared representations so prosody is preserved.

Not fully end-to-end but with a much thinner text bridge.

Three, on-device models become viable. Running a seven-billion-parameter speech model locally on a phone is not crazy — we're already seeing small language models run on-device. When the speech model runs locally, latency drops to near-zero and privacy concerns evaporate. That's huge for accessibility and companion applications. Four, I think we'll see specialization. Different voices, different personalities, different conversational styles optimized for different contexts. Your banking agent should not sound like your AI companion. The emotional register should match the use case.

The pipeline platforms?

They evolve into orchestration layers. Vapi and Retell aren't going anywhere — they're building the infrastructure that makes these models deployable. Call routing, telephony integration, compliance, analytics, A-B testing. That stuff is hard and valuable regardless of whether the underlying model is a pipeline or speech-to-speech. I think they become model-agnostic platforms that let you mix and match based on the use case.

One thing we haven't touched on is evaluation. How do you even measure whether one of these systems is good?

This is a hard problem. Latency you can measure. Word error rate you can measure. But "naturalness" and "engagement" and "emotional appropriateness" are subjective. The industry is converging on a mix of automated metrics and human evaluation — mean opinion score studies where raters judge conversational quality on multiple dimensions. Sesame's paper uses something called the "Conversational Naturalness Score." Hume has their own emotional alignment metrics. But there's no standard benchmark yet, which makes it hard to compare systems apples-to-apples.

Which means the marketing claims are basically unfalsifiable.

For now, yes. "Sub-two-hundred-millisecond latency" is a number. "Most human-like voice AI" is a vibe. And until we have agreed-upon evaluation frameworks, the vibe claims will dominate. That's frustrating if you're trying to make engineering decisions, but it's where we are.

Alright, so if someone's building with this stuff today — Daniel's a developer, he's in this space — what's the actual decision framework?

I'd say it comes down to three questions. One, is conversational quality core to your product or is it nice-to-have? If you're building a companion or a tutor, pay the premium for a speech-to-speech model. If you're building an appointment scheduler, a pipeline is fine. Two, what's your volume and budget? At a thousand calls a day, the cost difference might not matter. At a million calls a day, it absolutely does. Three, how much tool integration do you need? If your agent needs to make complex API calls with structured parameters, the text-based LLM ecosystem is still more mature. Factor that into your architecture.

Probably keep an eye on the convergence timeline, because those trade-offs are shrinking.

I'd say within eighteen months, the cost and tool-use gaps narrow to the point where speech-to-speech is the default for most new deployments, and pipelines become a legacy compatibility option. The direction of travel is clear.

Now: Hilbert's daily fun fact.

In eighteen thirty-two, the HMS Beagle — the ship that would later carry Charles Darwin — was nearly rejected by the British Admiralty because its captain, Robert FitzRoy, was concerned that the ship's previous captain had committed suicide on board and he feared the vessel was cursed.

For listeners who want to actually do something with this — if you're evaluating voice AI for a project, start by testing both architectures with your actual use case. Don't trust demos. Record real user interactions, run them through both a pipeline system and a speech-to-speech system, and compare not just latency numbers but user satisfaction. The qualitative difference might matter more or less than you expect depending on your domain. If you're on a budget, look at Moshi — it's open, it's free to experiment with, and it'll give you a real sense of what speech-to-speech feels like without vendor lock-in. If you're deploying at scale, talk to the platforms about their roadmaps. Vapi, Retell, and Bland are all adding native model support, and the pricing models are shifting fast. Don't optimize for today's cost structure.

If you're just curious as a listener, try the Sesame demo if it's still up. It's the best showcase of what speech-to-speech feels like in practice. You'll notice the difference within ten seconds of talking to it — the responsiveness, the little vocal gestures, the way it handles being interrupted. It's the kind of thing that's hard to describe but immediately obvious when you experience it.

The broader question I keep coming back to is what happens when these systems are good enough that people prefer talking to them over filling out a form or tapping through an app. That's a real interface shift. Voice has been the "next big thing" for twenty years, but the technology was never good enough to justify the friction. That's changing.

It changes the economics of whole industries. Customer support, telehealth, education, eldercare — any domain where human conversation is the primary interface but human labor is the primary cost. We're not there yet, but we're closer than most people realize.

Thanks to our producer Hilbert Flumingtop for making this happen. This has been My Weird Prompts. You can find every episode at myweirdprompts.com or wherever you get your podcasts. If you enjoyed this, leaving a review helps more than you'd think. We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2512: How Speech-to-Speech Models Eliminate the Robot Voice

Downloads

You Might Also Like

#2512: How Speech-to-Speech Models Eliminate the Robot Voice