Hey Herman, you know what I was thinking about this morning while I was making coffee? I was looking out the window at the sun hitting the stone walls of the Old City, just waiting for my phone’s little padlock icon to click open so I could check the news. It hit me just how much we have accepted facial recognition as the absolute default. It is so seamless that we barely notice it anymore. It is almost like a ghost in the machine that just knows it is me. But then I saw that audio prompt our housemate Daniel sent us last night, and it really shifted my perspective. He was asking about voice biometrics and why we are not using our voices to unlock everything in the same way we use our faces or our thumbs.
It is a fascinating question, Corn, and honestly, a bit of a provocative one given where we are in two thousand twenty six. And hello everyone, Herman Poppleberry here. You know, Daniel is right to point out that there is this weird, almost frustrating lopsidedness in how we have adopted biometrics over the last decade. We had fingerprints, which were the big thing for years—remember the old capacitive sensors that would fail if your hands were even slightly sweaty? Then we moved to facial recognition, which is now the gold standard for convenience. But voice? Voice feels like it is stuck in this perpetual state of being "almost ready" but never quite arriving as a primary security layer. It is the "fusion power" of the biometric world—always ten years away.
Exactly. And Daniel mentioned he has been poking around GitHub looking for projects that implement voice authentication, but he found that most of them are either ancient—like, from the pre-generative artificial intelligence era—or just not reliable enough for daily use. It makes me wonder if there is a fundamental technical hurdle we are missing, or if it is more of a social and privacy issue. I mean, we live here in Jerusalem, and even just walking down Jaffa Street, you see people talking to their phones constantly. They are sending voice notes, giving commands to their cars, talking to their glasses. The technology to capture voice is everywhere, so why is it not being used to secure our most private data?
Well, I think we have to look at the nature of the data itself, Corn. To understand why voice is struggling, we have to understand why Face ID succeeded so spectacularly. When you use something like modern facial recognition on a high-end smartphone in two thousand twenty six, it is not just taking a flat photo of you. It is using a vertical-cavity surface-emitting laser, or VCSEL, to project thirty thousand invisible infrared dots onto your face. It is creating a high-resolution three-dimensional map. That is a massive amount of high-fidelity spatial data. It is very difficult to spoof that with a simple two-dimensional image or even a high-quality video on another screen because the sensor is looking for depth and contour.
So you are saying that from a purely mathematical standpoint, a face has more unique "bits" of information that are harder to replicate in the physical world?
In a way, yes. Think about the "dimensionality" of the signal. Voice, at its core, is a one-dimensional signal over time. It is a pressure wave. While there is an incredible amount of complexity in the harmonics, the jitter, the shimmer, and the unique cadence of a person's speech, it is fundamentally a signal that we have become very, very good at recording and manipulating. To spoof a high-end facial recognition system, you often need a three-dimensional physical mask made of specialized materials, or a very sophisticated digital injection into the camera feed at the hardware level. But to capture someone's voice? All you need is a decent directional microphone and a bit of distance. We leave our "voice prints" everywhere. Every time you take a call in public, every time you send a voice note, you are essentially broadcasting your biometric key to anyone within earshot.
That is actually a bit terrifying when you put it that way. It is like leaving your house keys on the sidewalk every time you speak. But Daniel raised another point that I think is even more relevant to us as privacy-conscious users. He mentioned that many people are uncomfortable with "always-on" cameras. I totally get that. I have a physical slider on my laptop camera, and I know you have one on your desktop monitor too. There is something visceral about a lens pointing at you. But if we move to voice biometrics, does that not just trade one privacy nightmare for another? Instead of a camera watching you, you have a microphone that has to be constantly "listening" for your wake word or your specific voice print.
You have hit on the big architectural trade-off. For voice biometrics to be as convenient as facial recognition, the device has to be in a state of constant readiness. It has to be sampling audio to decide if the person speaking is the authorized user. Now, we have made huge strides in hardware. Modern chips have what we call low-power "secure enclaves" or "neural engines" that can do this processing locally on the device. The audio never has to leave the phone; it never goes to the cloud. But the "perception" of being always heard is a huge psychological barrier. People remember the scandals from the early twenty-twenties where contractors were listening to Siri or Alexa recordings. That trust hasn't been fully rebuilt.
Right, and it is not just the privacy of the primary user. It is the privacy of everyone around them. If I am using voice authentication in a crowded cafe near the shuk, my phone is essentially recording snippets of everyone else's conversations just to find my voice in the noise. That feels like a much wider net than a camera that is only looking at the person directly in front of it. A camera has a field of view; a microphone has a "field of hearing" that is much harder to bound.
Definitely. But let’s look at why facial recognition took off so much faster from a "user intent" perspective. I think this is the secret sauce. When you pick up your phone, you are almost always looking at it. The act of using the device naturally aligns with the biometric check. It requires zero extra effort. It is "passive" authentication. With voice, you usually have to actively say something. It is a conscious action. Unless you are already in the habit of using voice assistants, it feels like an extra step. And let’s be honest, Corn, it feels a bit weird to talk to your computer to unlock it when you are in a quiet office or a library.
Oh, for sure. I would feel like a total nerd saying "Unlock computer" or "It is me, Corn" in the middle of a quiet workspace. I’d be the guy everyone stares at. But wait, Herman, what about the security aspect Daniel mentioned? He was specifically asking about deepfakes and voice cloning. We are in February of two thousand twenty six now, and the progress in generative audio over the last twenty-four months has been staggering. I can go to a dozen different websites right now, upload thirty seconds of your voice from an old episode of this podcast, and have a near-perfect clone of you saying whatever I want. I could make you confess to being a secret fan of bad eighties synth-pop!
Hey, I have never hidden my love for a good synthesizer! But you are right. That is the million-dollar question, and honestly, it is the biggest reason why voice is not being used for high-stakes security like banking or government access. A few years ago, we talked about "replay attacks," where someone would just record you saying a specific phrase and play it back to the microphone. We solved that with "challenge-response" systems, where the device asks you to say a random string of words. But generative artificial intelligence has made those defenses obsolete. A sophisticated attacker can now generate those random words in your voice in real-time.
So if the device says, "Repeat after me: The quick brown fox jumps over the lazy dog," the attacker’s software can just synthesize that on the fly?
Exactly. The latency on these models is now down to under two hundred milliseconds. It is essentially instantaneous. However, there are some defensive technologies that are quite clever. Researchers are looking at "liveness detection" for audio. This involves looking for physiological artifacts that are present in a human throat and mouth that a speaker or a digital injection might not perfectly replicate. For example, when you speak, there are sub-audible frequencies, certain patterns of air turbulence, and "plosives"—those little pops of air when you say words starting with 'P' or 'B'—that sound very different when they come from a human mouth versus a diaphragm of a speaker.
But is that not just another arms race? As soon as we find a way to detect a fake based on air turbulence, the generative models will just be trained on that specific data to become even more realistic. It feels like a losing battle for the defenders because the attackers only have to succeed once, while the defense has to be perfect every time.
It is an arms race, but that is the history of all security. Even with facial recognition, we have seen "adversarial attacks" where people wear specific glasses or infrared-emitting jewelry to trick the system. The difference is the "barrier to entry." For voice cloning, the barrier has dropped through the floor. You do not need a specialized lab or a three-dimensional printer anymore; you just need a decent graphics card and some open-source code from GitHub.
That brings me back to what Daniel was saying about those GitHub projects. He mentioned they are often outdated. I suspect that is because the field is moving so fast that a project from two thousand twenty-three is practically prehistoric. If a library was built before the current wave of large-scale "Zero-Shot" text-to-speech models, its security assumptions are probably totally broken.
I think you are spot on. A lot of those older projects rely on "Gaussian Mixture Models" or basic "i-vectors"—technologies that were designed to distinguish between different people in a clean environment. They were never designed to distinguish between a person and a high-quality artificial intelligence clone of that person. To do this right today, you need a system that is constantly being updated with the latest "anti-spoofing" or "deepfake detection" layers. That is incredibly hard for a small open-source project to maintain. You need massive datasets of both real and synthetic voices to train a discriminator that can tell the difference.
So, where does that leave us? Are we just going to stick with faces and fingerprints forever? Or is there a place where voice biometrics actually makes sense? Daniel seemed hopeful that there was a future here.
I think the future is "multi-modal" authentication. This is where things get really interesting. Instead of relying on just one thing, your device might look at your face and simultaneously listen to the way you say a specific phrase. It is checking for "lip-sync" consistency. If the audio of your voice doesn't perfectly match the micro-movements of your lips, the system flags it as a deepfake. Or it might combine your voice print with your "behavioral biometrics," like the way you hold the phone, the slight tremor in your hand, or the speed at which you type. The more layers you add, the exponentially harder it becomes for an attacker to fake everything simultaneously in a live environment.
That makes sense. It is like that old security saying about "something you know, something you have, and something you are." But now it is more like "multiple things you are, all at once." I actually read about a system recently that uses "bone conduction" for voice authentication. It uses an accelerometer in a wearable—like a pair of smart glasses or an earbud—to pick up the vibrations of your voice through your skull rather than through the air. That would be incredibly hard to spoof with a deepfake because the attacker would need to physically vibrate your head!
Now that is the kind of nerdy detail I live for! Herman Poppleberry approved. But seriously, that points to a larger trend. We are moving away from "simple" biometrics toward "complex" ones that are tied to the physical body in ways that digital signals can't easily replicate. But we have to be careful. As we make these systems more complex, we also make them more prone to "false negatives." Imagine you have a bad cold, or you are just really tired and your voice is raspy. Your voice changes. If the system is too strict, you get locked out of your own life.
That is a great point. I have actually had my Face ID fail because I was wearing a heavy scarf and a beanie during that cold snap we had here in Jerusalem last month. It is annoying, but I can just type in my passcode. If a voice system fails every time I have a sore throat or if I am shouting over the wind, I am going to turn it off pretty quickly. Convenience is the enemy of security, but it is also the only reason people use security in the first place.
And that is the "sweet spot" that facial recognition hit. It reached a point where it is "secure enough" for ninety-nine percent of people and "convenient enough" that they actually leave it turned on. Voice has not found that sweet spot yet. It is currently either too easy to spoof or too frustrating to use in a real-world, noisy environment.
You know, we have been talking a lot about the technical side, but I want to go back to the "why" of why we are even doing this. Daniel mentioned things like FIDO, YubiKeys, and Passkeys. These are all part of the "passwordless" movement. The goal is to get rid of the weakest link in security, which is the human brain trying to remember "P-a-s-s-w-o-r-d-one-two-three" or using the same password for their bank and their pizza delivery app.
Exactly. Passkeys are a huge step forward because they use public-key cryptography. Your biometric, whether it is your face, your fingerprint, or eventually your voice, is just the "local" trigger. It is the key that unlocks the private key stored in the secure hardware of your device. The biometric data itself never leaves your phone. This is a crucial point that I think a lot of people miss. When you use Face ID, Apple or Google does not have a giant database of everyone’s faces. They just have a mathematical representation on your specific device. If we can get voice to that same level of local, secure processing, it becomes a very powerful tool.
Right, and that is why Daniel’s concern about "always-on cameras" is so valid. Even if the data is processed locally, the "physical" presence of an active camera or microphone is a vulnerability. If a piece of malware gets deep enough into your operating system, it could theoretically bypass those protections and start streaming your audio or video. It is the difference between a locked door and a door that doesn't exist.
It is a risk, for sure. But here is the thing about voice that might actually be a security "advantage" in some cases. Voice is "dynamic." You can change what you say. Your face is "static." You can’t exactly change the distance between your eyes or the shape of your nose if someone managed to spoof your facial data. With voice, you could have a system where the "passphrase" changes every time, or it is based on a secret that only you know. It combines "something you are" with "something you know." It is a two-factor check in a single interaction.
That is a fascinating angle. So it is like a verbal password that also checks your identity. But then we are back to the "social friction" of talking to our devices. I wonder if this will become more normal as we move toward "ambient computing." If we have smart speakers in every room and wearable devices like glasses that we are already talking to, then voice biometrics just becomes a natural, invisible part of that interaction.
I think that is exactly where it is going. If you are wearing a pair of smart glasses, you don’t want to take them off to look at a camera or fumble for a fingerprint sensor on the arm of the glasses. You want to just say, "Hey, what’s my schedule today?" and have the glasses recognize that it is you and give you the private information. In that context, voice is the only biometric that makes sense. It is the only one that doesn't break the "flow" of the interaction.
But man, the deepfake thing still haunts me. If I am a high-value target—like a CEO or a journalist—an attacker could just record me speaking at a conference or even just pull audio from this podcast, create a model, and then call my bank or my smart home system. We have already seen cases of "voice phishing" where people receive calls from their "boss" asking for an emergency wire transfer. In two thousand twenty-four and twenty-five, those scams cost companies hundreds of millions of dollars.
It is happening already, and it is getting more sophisticated. There was a case recently where a company was scammed because an employee thought they were on a video call with their Chief Financial Officer, but it was all deepfaked—the voice, the face, the background. If we can’t even trust our eyes and ears on a live call, how can we expect a simple algorithm to do it?
This makes me think that the "future" Daniel is asking about might not be about finding a "better" voice algorithm, but about changing the way we handle identity entirely. Maybe we need "digital signatures" for everything. If every audio stream had a cryptographic signature that proved it came from a specific, trusted hardware device—like my specific phone—then it wouldn't matter how good the clone was. The signature wouldn't match.
That is the "zero-trust" approach. Do not trust the data, trust the "provenance" of the data. But that requires a massive overhaul of our entire digital infrastructure. Every microphone, every camera, every communication protocol would need to support this kind of signing. We are starting to see the beginnings of this with things like "C-two-P-A" for images, which is meant to track the history and authenticity of digital media to fight misinformation. But for real-time audio? We are a long way off from that being a universal standard.
It feels like we are in this "uncanny valley" of security right now. We have moved past the old, simple methods like four-digit pins, but we haven't quite reached the robust, future-proof ones. We are stuck in the middle, trying to patch things up with biometrics that are increasingly vulnerable to the very artificial intelligence we are using to build them.
It is a bit of a "cat and mouse" game. But I want to give Daniel some credit here. He mentioned that he prefers voice because it feels more "convenient" than always-on cameras. I think there is a huge segment of the population that feels the same way. There is something about the "gaze" of a camera that feels much more intrusive than the "ear" of a microphone, even if the technical risks are similar. We have been living with microphones in our pockets for twenty years, but the camera always feels like a new intrusion.
It is psychological. We are social animals. In nature, if something is staring at you, it is usually a predator or a mate. A camera is a "permanent stare." A microphone feels more passive, like a friend listening to a story. But as we have discussed, that feeling might be misleading in a world of generative artificial intelligence.
So, what would we tell Daniel about the "future" of voice biometrics? Is it a dead end, or is it just waiting for its moment?
I think it is waiting for "context." Voice biometrics will probably never be the "only" way we unlock our phones, but it will be an essential part of how we interact with the world as computing becomes more invisible. The key will be "active" authentication. Instead of a device "always" listening, it will only listen when we initiate a command, and then it will use that brief window to verify who we are using a combination of our voice print, our bone conduction vibrations, and maybe even the unique way we pronounce certain vowels.
And we need to get much better at "synthetic media detection." We need the artificial intelligence that protects us to be just as smart as the artificial intelligence that is trying to trick us. It is going to be a constant, invisible battle happening in the background of our devices. It is the "Silent War" of the two thousand twenties.
You know, it is funny. We started this talking about the convenience of unlocking a phone while making coffee, but we ended up talking about a global arms race in artificial intelligence and cryptography. That is usually how our conversations go, isn't it? We start with a toaster and end up with the heat death of the universe.
Guilty as charged. But that is the reality of tech in two thousand twenty-six. Nothing is "just" a feature anymore. Everything is connected to these deeper shifts in how we define truth, identity, and privacy. If you are not thinking about the philosophical implications of your lock screen, are you even living in the twenty-first century?
Well, I for one am going to keep my camera slider closed and maybe be a little more careful about who is listening when I talk to my phone in public. But I do hope some of those GitHub projects get updated. It would be cool to have a "My Weird Prompts" voice command that actually knows it is me and maybe automatically starts the kettle.
We could call it "The Poppleberry Protocol." It has a certain ring to it, don't you think?
Oh boy. Let's not get ahead of ourselves, Herman. But seriously, this has been a great dive. Daniel, thanks for sending that prompt in. It really got us thinking about the trade-offs we make every single day just to check our email.
Absolutely. And if you are listening to this and you have got your own thoughts on biometrics, or maybe you are a developer working on one of these "prehistoric" GitHub projects and you want to defend your code, we would love to hear from you. You can find the contact form on our website at myweirdprompts dot com.
And hey, if you have been enjoying these deep dives into the weird and wonderful world of technology and philosophy, do us a favor and leave a review on your favorite podcast app. It really does help other people find the show. We have been doing this for over six hundred episodes now, and it is the community that keeps us going. We are still independent, still weird, and still questioning everything.
It really is the community. We have got a lot more ground to cover in the coming weeks, from the future of decentralized privacy to the strange corners of the internet that Daniel keeps finding for us.
Thanks for joining us today. You can find all our past episodes and the RSS feed at myweirdprompts dot com, and we are on Spotify and Apple Podcasts as well.
Stay curious, stay skeptical, and stay secure. This has been My Weird Prompts.
Catch you in the next one. Goodbye!
Goodbye!