#1715: Why Voice Agents Need Frameworks (Not Just APIs)

Raw APIs handle models, but who manages the audio plumbing? We break down Vapi, LiveKit, and Pipecat.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-1868
Published: Mar 29
Duration: 25:16
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: speech-recognition text-to-speech conversational-ai

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Building a voice agent feels like a gold rush right now, but the tooling landscape is a fragmented minefield. Developers face a fundamental choice: build directly on raw real-time APIs or use a specialized framework. The core challenge isn't just generating speech—it's orchestrating a Rube Goldberg machine of speech-to-text, LLM inference, text-to-speech, and real-time audio transport without introducing latency or artifacts.

The Anatomy of a Voice Agent
A functional voice agent requires four distinct components working in perfect synchronization. The "ear" (Speech-to-Text) listens, the "brain" (LLM) reasons, the "mouth" (Text-to-Speech) speaks, and the "nervous system" (transport layer) moves audio data back and forth. When using raw APIs like OpenAI's Realtime API, the brain and mouth are often combined, but the developer is still responsible for the plumbing: managing persistent WebSocket connections, chunking raw audio bytes, and implementing Voice Activity Detection (VAD) to know when a user stops speaking.

The coordination nightmare becomes apparent when considering interruptions. If a user interrupts the AI mid-sentence, the system must cancel the LLM generation, flush the audio buffer, and fade out playback within milliseconds to avoid a "pop" or "click." Doing this manually requires writing a production-grade streaming server before writing a single line of agent logic. As one observer noted, it feels like building a video game engine just to play a game of Pong—80% infrastructure, 20% actual agent logic.

Vapi: The Managed Shortcut
Vapi positions itself as a managed orchestration platform. It abstracts away the WebRTC servers, VAD, and interruption handling, offering a single endpoint where developers plug in their API keys for OpenAI, ElevenLabs, and others. It is ideal for rapid prototyping or teams without dedicated real-time media engineers. However, this convenience comes with trade-offs: a premium cost on top of model usage, vendor lock-in, and limited visibility into the underlying infrastructure. If latency spikes or calls drop, debugging is confined to the provided dashboard rather than the underlying media server.

LiveKit: The Self-Hosted Powerhouse
LiveKit Agents takes a different approach, leveraging the company's established WebRTC infrastructure. Unlike Vapi's hosted service, LiveKit is a framework you can run yourself (on cloud providers like Modal). It uses a worker-based architecture where an agent process spins up per user session, connecting audio tracks through a mature WebRTC stack that handles network jitter and packet loss robustly. This modularity allows for Lego-like assembly of pipelines, but it requires more engineering muscle to deploy and scale. It is particularly attractive for production applications requiring data sovereignty, such as medical scribe apps where sensitive patient audio shouldn't pass through third-party servers.

Pipecat: The Purist’s Pipeline
Pipecat, developed by the team at Daily, represents the open-source middle ground. It is a Python-based framework built on the concept of a frame pipeline. Everything—audio chunks, text, video frames—is treated as a frame flowing through a graph of processors. This transparency allows developers to see exactly how data is processed between the STT and LLM. Pipecat is provider-agnostic, allowing seamless swapping between OpenAI, Anthropic, or local models. It also excels at multi-modal agents, capable of injecting visual context from webcam feeds into the conversation stream. While it offers the modularity of LiveKit with a more explicit logic handling, it still requires managing the underlying infrastructure.

Choosing the Right Tool
The decision between raw APIs and frameworks boils down to control versus convenience. Raw APIs offer the lowest latency and highest control but demand significant engineering effort for orchestration, echo cancellation, and noise suppression. Frameworks like Vapi prioritize speed-to-market, LiveKit prioritizes scalability and self-hosting, and Pipecat prioritizes transparency and modularity. As the voice AI ecosystem matures, the trend is moving toward frameworks that handle the messy plumbing, allowing developers to focus on the agent's logic rather than the audio buffers.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1715: Why Voice Agents Need Frameworks (Not Just APIs)

You want to build a voice agent. You open your terminal, you search for voice agent framework, and you are immediately slapped in the face with fifteen different options. Do you go with the shiny hosted platform? Do you download the open source Python library? Or do you just roll your own using the raw APIs that OpenAI and Deepgram have been throwing at us?

It is a total minefield out there right now. We have moved past the era where a voice bot was just a slow, laggy IVR system. Now, everyone wants that Her movie experience, and the tooling layer is fragmented because we are all trying to figure out how to handle real-time audio without it sounding like a walkie-talkie conversation from the nineties.

Well, not exactly, because I am not allowed to say that word, but you hit the nail on the head. Today's prompt from Daniel is about these voice agent frameworks, specifically looking at things like LiveKit Agents, Vapi, and Pipecat. He is asking the million-dollar question: why do these even exist if we have real-time APIs? By the way, today's episode is powered by Google Gemini Three Flash.

Herman Poppleberry here, ready to dive into the spaghetti of WebRTC and WebSockets. To answer Daniel's core question, we have to look at what a voice agent actually is. It is not just one AI. It is a Rube Goldberg machine of at least four different services working in perfect sync. You have the ear, which is your Speech-to-Text. You have the brain, which is your Large Language Model. You have the mouth, which is your Text-to-Speech. And then you have the nervous system, which is the transport layer carrying all that audio data back and forth.

And if any one of those parts has a hiccup, the whole experience feels broken. If the ear takes too long to hear, the brain is sitting there idle. If the mouth starts talking before the brain is finished, you get overlapping audio. It is a coordination nightmare.

It really is. When you use a raw API, like OpenAI's Realtime API which dropped back in October twenty-twenty-four, you are getting a massive shortcut because they combined the brain and the mouth, and sometimes the ear, into one model. But even then, you are still responsible for the plumbing. You have to manage a persistent WebSocket connection. You have to handle audio chunking, which means breaking down raw bytes of sound into something the API can digest. You have to handle Voice Activity Detection, or VAD, so the AI knows when you have actually stopped talking and it is not just a long pause for breath.

I tried to look at the documentation for a raw implementation last week, and my sloth brain nearly melted. It is not just send text, get text. It is send buffer, handle interruption event, clear playback queue, manage session state. If I am a developer just trying to build a pizza ordering bot, why do I care about audio buffers?

You probably should not have to care, and that is why frameworks like Vapi or Pipecat exist. They are essentially the orchestration layer. Think of the raw APIs as the individual instruments in an orchestra. The framework is the conductor. It ensures the violin starts when the cello stops. Without it, you are just a guy standing on a stage trying to play five instruments at once with your hands and feet.

But wait, how does that work in practice? If I’m using a framework, does it just hide the mess, or is it actually doing something more efficient under the hood than I could do with my own messy script?

It’s both. Let’s take the example of "echo cancellation" and "noise suppression." If you’re building this yourself, you’re sending raw audio from the user’s mic to the cloud. If the AI’s voice is coming out of the user’s speakers, their mic might pick up the AI talking and send it back to the server. Now the AI is listening to itself, getting confused, and responding to its own words. A framework like LiveKit handles that echo cancellation at the transport layer, often using WebRTC’s built-in stacks, so the "ear" only hears the human.

That sounds like a nightmare to debug. "Why is the AI arguing with its own echo?" is not a ticket I want to open on a Friday afternoon.

And that brings us to the raw approach. If I am a glutton for punishment and I want to use OpenAI’s Realtime API or Deepgram’s streaming directly, what am I actually signing up for?

You are signing up for a lot of low-level state management. Let’s take the interruption problem. This is the hardest part of voice AI. If the AI is mid-sentence, saying, "I would be happy to help you with your insurance claim today," and the human interrupts with, "Wait, I moved last week," the AI needs to stop talking immediately. In a raw API setup, you have to catch that incoming audio from the user, realize it is speech, send a cancel event to the LLM, and simultaneously tell the user's browser or phone to flush the audio buffer that was already queued up for playback. If you miss that by even two hundred milliseconds, the AI keeps blathering for a second after the human started talking, and the illusion of a natural conversation is dead.

It’s even worse than that. When you flush that buffer, you have to make sure the transition isn't a "pop" or a "click" in the user's ear. You have to gracefully fade out the audio in milliseconds. If you don't, it sounds like a glitchy Max Headroom broadcast.

It sounds like building a video game engine just to play a game of Pong. You are spending eighty percent of your time on the infrastructure and twenty percent on the actual logic of the agent.

That is a great way to put it. And it is not just interruptions. It is noise. If a dog barks in the background, does your raw API call think that is speech? You have to tune your VAD thresholds. If the user has a spotty internet connection and a few packets drop, does your WebSocket crash? You have to write the reconnection logic. You are building a production-grade streaming server before you even write your first system prompt.

So enter the frameworks. Daniel mentioned LiveKit, Vapi, and Pipecat. They all seem to be fighting for the same territory, but they feel very different in how they approach it. Herman, give me the breakdown on Vapi first, because that seems to be the one people gravitate toward when they want to get something running in ten minutes.

Vapi is what I would call a managed orchestration platform. They essentially say, "Give us your API keys for OpenAI and ElevenLabs, tell us what you want the bot to say, and we will give you a single endpoint." They handle the WebRTC servers, they handle the VAD, the interruptions, the multi-provider switching. It is incredibly developer-friendly. You can literally go from zero to a functioning high-quality voice bot in a few lines of code.

What is the catch? There is always a catch with managed services. I am assuming it is the classic vendor lock-in and the middleman tax.

You nailed it. You are paying a premium on top of the underlying model costs for that convenience. And more importantly, you are limited by their abstractions. If you want to do something really weird, like inject a custom audio effect mid-stream or use a very specific, niche STT model that Vapi doesn't support yet, you might be out of luck. You are trading control for speed. It is great for prototypes or companies that don't want to hire a dedicated real-time media engineer.

But what about the "black box" aspect? If a call drops or the latency spikes on Vapi, do I have any visibility into whether it's OpenAI being slow or Vapi's routing being congested?

That’s the rub. You’re at the mercy of their dashboard. They give you logs, sure, but you can't go in and tweak the underlying C++ code of the media server to optimize for a specific network condition in, say, rural India. You’re buying a finished product, not a toolkit.

Okay, so then you have something like LiveKit Agents. I see their name everywhere in the open source world. How does their approach differ from the Vapi model?

LiveKit is fascinating because they started as a general-purpose WebRTC infrastructure company. They build the servers that power group video calls and live streaming. Their Agents framework is an extension of that. Unlike Vapi, which is a hosted service, LiveKit is something you can run yourself. It is a Python or JavaScript framework where you define a pipeline. You say, "Here is my source of audio, here is my plugin for Deepgram, here is my plugin for GPT-4o, and here is my output."

So it is more modular. It is like Lego sets for voice agents.

It uses a worker-based architecture. When a user joins a LiveKit room, a worker process spins up, runs your agent code, and connects the audio tracks. Because it is built on top of a mature WebRTC stack, it handles things like network jitter and packet loss much better than a naive WebSocket implementation would. It is very powerful if you need to scale to thousands of concurrent users and want to keep your infrastructure costs down by hosting it yourself on something like Modal or your own cloud.

I like the sound of that, but I bet the learning curve is steeper. You actually have to understand how a room works and how to deploy these workers. You can't just call an API and call it a day.

It definitely requires more engineering muscle. You are responsible for the deployment and the scaling of those workers. But for a production-grade application where you need to own the data flow—maybe for HIPAA compliance or just for cost reasons—it is often the better choice. Think about a medical scribe app. You probably don't want your sensitive patient audio passing through a third-party startup's middleman servers if you can avoid it. With LiveKit, the audio goes from the user to your server to the provider. You cut out one link in the chain.

That makes total sense. Then there is Pipecat. I have seen Daniel talk about this one on GitHub. It is open source, it's Python-based, and it seems to be gaining a lot of steam. Where does Pipecat fit in this spectrum?

Pipecat is the purist’s framework. It was started by the team at Daily, another WebRTC provider. It is very focused on the concept of a pipeline. In Pipecat, everything is a frame. An audio chunk is a frame. A piece of text is a frame. You build a graph where these frames flow through different processors. It is incredibly transparent. If you want to see exactly how the audio is being processed between the STT and the LLM, you can just look at the pipeline code.

It sounds like it is for the people who want the modularity of LiveKit but perhaps a more explicit way of handling the logic. I noticed it handles things like multi-modal agents really well too, right? Not just voice, but images and video in the same stream.

It does. It is very flexible. The beauty of Pipecat is that it is provider-agnostic. You can swap out OpenAI for Anthropic or a local Llama model running on a GPU cluster without rewriting your whole orchestration logic. It sits in that sweet spot between raw APIs and a fully managed platform. You get the guardrails of a framework, but the hood is wide open for you to tinker with.

Let's dig into a concrete example there. Say I want my agent to be able to "see" what the user is doing. Maybe it's a tech support bot and the user is holding up a broken router to their webcam. How does Pipecat handle that differently than just a voice-only API?

In Pipecat, you can add a "Vision Processor" to your pipeline. As the video frames come in over the WebRTC stream, Pipecat can sample them, send them to a model like GPT-4o or Claude 3.5 Sonnet, and inject that visual context into the conversation. The agent can then say, "I see the red light is blinking on the left side," while it's still listening to the user's audio. Doing that with raw APIs means managing two or three separate streaming connections and trying to synchronize the timestamps so the AI doesn't talk about a frame it saw five seconds ago.

That sounds incredibly powerful. It’s like giving the Rube Goldberg machine a set of eyes that actually talk to the ears. So let’s talk about the second-order effects here. If I am a startup founder and I choose Vapi today because I need to launch on Monday, am I screwing myself for next year?

It depends on your scale. We saw a case study recently of a startup that started on a managed platform. They hit about ten thousand minutes of talk time a month, and suddenly their bill was five times what it would have been if they were hitting the APIs directly. But the bigger issue was latency. Every middleman adds a few milliseconds. If the managed platform's server is in Virginia and your user is in London, and the LLM is in California, you are bouncing audio around the planet like a pinball. By switching to a raw implementation or a self-hosted LiveKit setup, they were able to shave three hundred milliseconds off their response time. In voice, three hundred milliseconds is the difference between a bot that feels smart and a bot that feels broken.

That is huge. It is the uncanny valley of conversation. If the pause is just a little too long, the human brain disengages or gets frustrated. But at the same time, if that startup didn't have the managed platform, they might have spent six months just getting the WebSocket to not crash, and they would have zero customers.

That is the tradeoff. I usually tell people to think of it like a decision tree. Are you building a feature or a product? If the voice agent is just a small part of your app, use a managed framework. Don't waste your life on WebRTC. But if the voice agent IS the product—if you are building an AI receptionist or a language tutor—you eventually need to own the stack. You need that low-level control to optimize for every single millisecond.

I want to go back to something you mentioned earlier: the "dual-track" problem. We have talked about this before in the context of APIs for agents. Does using a framework help with the fact that we are currently building two versions of everything—one for humans to see on a screen and one for the AI to hear?

It can. Some of these frameworks are starting to integrate with frontend state. For example, if the voice agent says, "I have updated your reservation," a good framework can emit a data message that your web app catches to instantly refresh the UI. If you are doing that raw, you are managing a separate data channel on top of your audio channel, and keeping them synchronized is a nightmare. You don't want the UI to update three seconds before the AI finishes saying the sentence.

Right, because if the user sees the "Confirmed" checkmark on their screen while the AI is still saying "I'm looking for a slot," it breaks the magic. It feels like the AI is just a puppet.

You want that "shared state" between the voice and the visuals. Frameworks like LiveKit use a concept called "Data Channels" in WebRTC to send small JSON packets alongside the audio. Because they travel over the same connection, they stay perfectly in sync. If the AI hits a specific word in the text-to-speech output, the framework can trigger a visual event on the website at that exact millisecond.

I can see why Daniel is asking this. The landscape is moving so fast that what was true in October twenty-twenty-four when the Realtime API launched is already being challenged by these higher-level abstractions. What about the quality of the VAD? I feel like that is where most of my frustrations lie when I talk to these bots. They either cut me off mid-thought or they wait forever after I am done.

VAD is the silent killer. Most people think it is a solved problem, but it is not. A raw API usually gives you a generic VAD. But a framework like LiveKit or Pipecat lets you swap in different VAD models. You could use Silero VAD, which is very popular and robust, or you could even use a small local model that is trained specifically to ignore background office noise but catch human speech. If you are using a raw, all-in-one API, you are stuck with whatever the provider gives you.

It is like having a car where you can't change the tires. It works great on the highway, but the moment you hit a little bit of mud—or in this case, a noisy coffee shop—the whole thing slides off the road.

And let's talk about the "long tail" Daniel mentioned. There are probably twenty other frameworks popping up. Some are built on top of LangChain, some are trying to be the "React of Voice." My concern with the long tail is maintenance. Real-time audio is hard. It requires constant updates as browser standards change and as AI providers update their streaming protocols. If you pick a niche framework that is just a wrapper around a few APIs, and the maintainer gets bored, your production app is dead in six months.

So stick to the ones with real momentum. LiveKit and Pipecat seem to have the strongest community backing right now. Vapi has the most commercial traction for ease of use.

I would agree with that. And we should mention the cost of context. When you use a raw API that handles the whole loop, the provider is often charging you for the "session." If that session stays open for thirty minutes, you are paying for the overhead of that persistent connection. Some frameworks are getting clever about "hibernating" agents or using cheaper models for the initial greeting and then "handing off" to a more expensive model once the conversation gets serious. Doing that handoff manually in raw code is incredibly complex because you have to transfer the entire audio state and conversation history without the user noticing a glitch.

It’s like a relay race where the runners have to pass a glass of water without spilling a drop while sprinting at full speed.

And the water is a multi-gigabyte context window of conversation history. If you lose the context during the handoff, the AI forgets the user's name or what they were just talking about, which is a total dealbreaker for a professional agent.

Let’s get into some practical takeaways for the listeners who are sitting there with an IDE open, trying to decide which repo to clone. If you are building a prototype, what is the move?

If it is a prototype or a proof of concept, start with Vapi or a similar managed platform. Your goal is to validate the user experience. Does the voice interaction actually add value? Don't spend three weeks configuring WebRTC TURN servers for a product that might not have a market. Move fast, get the prompt right, and see how users react to the latency.

And if you are an enterprise or a high-growth startup where you know you are going to be doing millions of minutes?

Then you need to look at LiveKit Agents or Pipecat. You want to build on a foundation where you have an "escape hatch." If OpenAI raises their prices or Anthropic releases a model that is twice as fast, you want to be able to swap the "brain" of your agent without rebuilding the entire "nervous system." These open source frameworks give you that abstraction layer where the provider is just a plugin, not the entire platform.

But how does that work if I want to switch from OpenAI's Realtime API to a combination of Deepgram and ElevenLabs? Is that a "one line of code" change in these frameworks, or am I still rewriting half my app?

In Pipecat, it’s remarkably close to one line. You just swap out the OpenAILLMService for a DeepgramSTTService and an AnthropicLLMService in your pipeline definition. The framework handles the fact that Deepgram sends audio chunks in one format and ElevenLabs expects them in another. That’s the real value of the "frame-based" architecture—it standardizes the data so the components are truly hot-swappable.

I think the most important takeaway is to audit your latency. Don't just trust the marketing. Measure the time from the end of the user's speech to the first byte of the AI's audio. If it is over one second, you are in trouble. If it is under five hundred milliseconds, you are in the gold zone.

And remember that the framework is not just for the happy path. It is for the edge cases. What happens when the user's microphone is muted? What happens when two people talk at once? What happens when the LLM hallucinations a termination token and stops the conversation early? A good framework has thought about these things.

Here's a fun fact for you, Herman. Did you know that some of the earliest "voice agents" in the 1960s, like ELIZA, didn't actually have audio? They were text-based, but people still treated them like they were alive. Now that we've added voice, that psychological effect is amplified by like a thousand. We're much more sensitive to a "voice" being rude or slow than we are to a text box.

That’s a great point. It’s called the Media Equation—the idea that humans treat computers and other media as if they were real people or places. When a voice agent hesitates for 1.2 seconds, our lizard brain doesn't think "Oh, the API is high-latency." It thinks "Why is this person being hesitant? Are they lying to me? Are they confused?" That’s why these frameworks are so obsessed with shaving off fifty milliseconds here and there. It’s not just tech specs; it’s social engineering.

It’s the difference between a high-wire act and a high-wire act with a safety net. The raw API is the wire. The framework is the net. You can walk the wire without the net, but you better be a world-class performer, and even then, one gust of wind can ruin your day.

I love that. And honestly, the pace of innovation here is so high that I wouldn't be surprised if the "raw" APIs themselves start incorporating more of these framework features. We are already seeing providers add more granular control over VAD and turn-taking. But for now, the orchestration tax is real, and frameworks are the best way to pay it.

One last thing before we wrap up—what do you think about the future of "agent-native" infrastructure? Do you think we will eventually see hardware that is optimized for these frameworks?

We already are. There are companies working on specialized chips for low-latency audio processing at the edge. Imagine a world where the VAD and the initial STT happen on your device, and only the "intent" is sent to the cloud. That would eliminate the transport latency entirely. The frameworks that survive will be the ones that can bridge that gap between edge processing and cloud intelligence. Think about the "Rabbit R1" or the "Humane Pin"—regardless of how those specific products were received, they represent a move toward hardware that is essentially a physical manifestation of these voice frameworks.

So, in five years, we might not even be talking about WebRTC. We might be talking about direct neural-audio streams.

I wouldn't put it past the industry. But for today, Daniel, if you're listening: stick to the frameworks that let you sleep at night. Don't build a media server unless you really, really love debugging C++ header files.

It is a great time to be a developer, but a terrible time to be indecisive.

Well, hopefully this cleared some of the fog for Daniel and the rest of the listeners. It is not about finding the "best" framework; it is about finding the one that matches your engineering capacity and your long-term goals.

Before we go, I want to say thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power the generation of this show. Their serverless infrastructure is actually a great example of the kind of "abstraction that matters" we have been talking about today.

If you found this dive into voice frameworks useful, we would love it if you could leave us a review on Apple Podcasts or Spotify. It genuinely helps other people find the show and keeps us motivated to keep digging into these weird prompts Daniel sends our way.

You can find the full archive and all the links we mentioned at myweirdprompts dot com. We are also on Telegram if you want to get a ping every time a new episode drops.

This has been My Weird Prompts. Thanks for listening.

See ya.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1715: Why Voice Agents Need Frameworks (Not Just APIs)

Downloads

You Might Also Like

#1715: Why Voice Agents Need Frameworks (Not Just APIs)