Imagine you are listening to an AI-generated audiobook. The narrator is gliding through a beautiful English description of a sunset, and then the character quotes a line of Hebrew poetry. Suddenly, the voice turns into a complete train wreck, mispronouncing every syllable because it can't find the vowels. Or maybe it's a Hindi podcast where the host drops a technical term like "serverless architecture" and the AI tries to pronounce it using phonetic rules meant for Sanskrit. This gap between English-centric perfection and the messy, beautiful reality of global linguistics is exactly what we are digging into today.
It is a massive technical hurdle. We often talk about AI as this universal translator, but when you get down to the actual synthesis—turning text into a human-sounding voice—the "universal" part starts to crumble. Today's prompt from Daniel is about the current state of multilingual text-to-speech, or TTS, and the specific hurdles that keep most of the world's languages in the "uncanny valley" while English gets all the polish.
By the way, if we sound particularly sharp today, it might be because Google Gemini 1.5 Flash is writing our script. Hopefully, it knows how to pronounce its own name in multiple languages. Now, Herman Poppleberry, let's get into the weeds. When we say "multilingual TTS," are we talking about one giant model that knows everything, or a bunch of small models duct-taped together?
That is the big architectural debate right now. Historically, it was the "digital sandwich" approach—you’d have a language identification model, then a text normalizer specific to that language, then a phoneme converter, and finally a vocoder to make the sound. But in twenty twenty-six, we’re seeing a shift toward unified architectures. Think of things like Meta’s AnySpeech or the latest iterations of OpenAI’s Voice Engine. They use a shared encoder that tries to understand the "essence" of speech across all languages, then uses language-specific embeddings to steer the output.
A shared encoder sounds efficient, but I imagine it’s like a polyglot who speaks ten languages but has a slight accent in all of them. Is there a quality trade-off when you try to cram the entire world's phonetics into one model?
There absolutely is. This is often called "language interference" or "capacity dilution." Imagine a model's brain as a suitcase. If you only pack for a trip to London, you have plenty of room for heavy coats and specific gear. But if you’re trying to pack for London, the Sahara, and the Antarctic all at once, you’re going to leave some things out. If a model is spending its parameters learning the tonal nuances of Mandarin and the click sounds of Xhosa, it might actually lose some of the fine-grained prosody—the rhythm and intonation—that makes English sound natural. It’s a balancing act. But before we get to the "one model to rule them all" dream, we have to talk about why some languages are just fundamentally harder to turn into speech than others.
Right, and Daniel pointed us toward Hebrew as a prime example of this. I’ve seen written Hebrew; it looks like a beautiful series of characters, but to the uninitiated, it’s missing half the information.
It’s missing the vowels! This is the "Nikud" problem. In standard modern Hebrew text—like what you’d see in a newspaper or a text message—the vowels, or Nikud, are omitted. The reader just "knows" the word based on context. For a TTS model, this is a nightmare. The word for "book" and the word for "border" might be spelled identically in consonants—Samekh-Pe-Resh. Without a pre-processing step to "diacritize" the text—meaning, adding those little dots and dashes that represent vowels—the TTS engine is literally guessing.
So it’s not just a voice model; it’s a detective agency. You need a separate AI model just to sit in front of the voice engine and say, "Okay, based on the three words before this, he’s definitely talking about a book, not a border."
Spot on. And that pre-processing model—the diacritizer—usually has to be a sophisticated sequence-to-sequence transformer of its own. It’s analyzing the syntactic structure of the entire sentence to decide if a verb is past tense or present tense, which changes the pronunciation. If that model slips up, even by a tiny bit, the voice engine is doomed. It’s not a "voice" error at that point; it’s a "reading" error. This is also a huge issue in Arabic, where short vowels are also frequently omitted. You have "Harakat" or diacritics that change the meaning and the pronunciation entirely. If you’re building a multilingual TTS and you don't have a top-tier Arabic or Hebrew text-normalization layer, your high-end neural vocoder is just going to produce very high-quality gibberish.
I love the idea of "high-quality gibberish." It sounds confident, but it's totally wrong. It’s like a politician. But wait, how does this work when you're reading something like a medical journal or a legal document? Surely the stakes for a "reading error" are way higher there than in a casual podcast?
Oh, the stakes are massive. Imagine a TTS system reading a prescription instruction in Arabic or Hebrew and getting the vowel wrong on a dosage-related word. That’s why in high-stakes environments, you often see "constrained" TTS where the model is forced to check against a dictionary of known valid pronunciations. But even then, the AI has to understand the grammar. In Arabic, the word "kataba" means "he wrote," but "kutiba" means "it was written." The consonants are exactly the same. If the AI doesn't understand the passive voice, it says the wrong word.
It’s like the AI needs a PhD in linguistics before it’s allowed to speak. But what about when we mix languages? Daniel mentioned "code-switching." I do this all the time—using a bit of French or some tech slang in the middle of a sentence. How does a model handle that without sounding like it’s having a stroke?
Code-switching is the "final boss" of multilingual TTS. There are three ways models handle it. The old way was "hard switching," where the system detects a language change and swaps the entire model mid-sentence. The problem? There’s usually a tiny, audible click or a change in the "room tone" of the voice because the two models weren't trained on the same microphone or in the same studio. It sounds like two different people are talking.
That sounds terrifying. Like a poltergeist in my GPS. One second it's a polite British lady, and the next second a gravelly German man is yelling "Turnen Sie links!"
It breaks the immersion completely. The second way is "phoneme mapping." You take the English word—say, "computer"—and you map its English sounds to the closest equivalent sounds in the primary language, like Hindi. But then it sounds like a Hindi speaker with a very thick accent saying "computer." It doesn't sound like a bilingual person; it sounds like a monolingual person struggling with a foreign word. The third, and most modern way, is "cross-lingual voice cloning." This is where the model learns the "identity" of a voice—your voice, my voice—independent of the language. It can then apply that identity to the phonemes of any language. So, it’s "Corn" speaking English, and then "Corn" speaking Japanese, with the same vocal fry and the same cheeky tone.
I'm not sure the world is ready for "Corn" in Japanese, Herman. But that sounds like a massive data problem. If you want to train a model to do that, you need recordings of people who are actually bilingual, right? You can't just feed it a bunch of separate monolingual datasets and hope it figures out how to bridge the gap.
That’s been the traditional view—that you need "parallel data" where the same person speaks multiple languages. But twenty twenty-five and twenty twenty-six have seen a breakthrough in "unpaired" learning. Researchers are finding that if you have enough data in Language A and enough data in Language B, the model can learn an abstract representation of "speech" that works for both. It’s like learning the concept of "melody" separately from the concept of "lyrics." However, the data gap is still staggering. A 2025 report from Mozilla Common Voice showed that roughly 80 percent of all available high-quality speech data for training is concentrated in just ten languages.
Ten languages? Out of the thousands spoken on Earth? That’s a digital divide you could fit a galaxy through. I’m guessing English is number one, but who else is in that "top tier" of TTS support?
English, Spanish, Mandarin, French, German, Japanese, Portuguese, Russian, Italian, and Korean. If you speak one of those, you’re living in a golden age of TTS. The voices are indistinguishable from humans. They have emotion, they breathe, they can whisper. They can even handle sarcasm. But if you move even slightly outside that circle—to something like Yoruba, Amharic, or even some of the regional dialects of India like Marathi or Telugu—the quality drops off a cliff. You go from a Hollywood voice actor to a 1980s Speak & Spell in about five miles.
It’s the "Great Data Exhaustion" we’ve talked about before. If the AI hasn't heard it, it can't say it. But it's not just about the amount of data, right? It's about the script. We talked about Hebrew and Arabic, but what about something like Chinese or Japanese? Those aren't even phonetic scripts in the way we think of them.
Oh, the East Asian "logographic" scripts are a whole different level of complexity. In Chinese, you have the character-to-phoneme mapping. One character can have multiple pronunciations—these are called polyphones. The word for "music" and the word for "pleasure" in Chinese use the same character—乐—but are pronounced differently: 'yuè' for music and 'lè' for pleasure. So again, you need a heavy-duty front-end model to do the disambiguation before the voice engine even starts.
And Japanese is like the "advanced level" version of that, right? Since they mix three different writing systems in a single sentence?
It’s a mess! You have Kanji, which are Chinese characters; Hiragana, which is phonetic; and Katakana, which is used for foreign loanwords. A Japanese TTS engine has to segment the sentence correctly. If it breaks the characters at the wrong spot, it’s like reading a sentence in English without any spaces. "Thepenismightierthanthesword." Depending on where you put the breaks, that sentence goes in two very different directions.
I see what you did there, Herman. Very subtle. But let's look at the "loanword" problem. When a Japanese speaker says "McDonald's," they say "Makudonarudo." Does the AI know to use the Japanese phonetic version, or does it try to say it like an American, which might actually sound "wrong" to a Japanese listener?
That’s a brilliant point about "localization vs. globalization." A truly great multilingual model needs to know its audience. If the AI is reading a Japanese news report, it should probably use the Japanese pronunciation of foreign brands. But if it's a language-learning app, it might need to toggle between the two. This is where "metadata" comes in—telling the model not just what to say, but who it is supposed to be when it says it.
I see. So, if I’m building an app for the global market, do I go with one of these giant multilingual models, or do I try to hunt down specific models for each region? What’s the "pro-sumer" move here?
This is where the trade-offs get really interesting. If you use a giant multilingual model—like what SiliconFlow or Camb.ai are offering in twenty twenty-six—you get incredible "zero-shot" capabilities. You can give it a three-second clip of a voice and it can speak fifty languages in that voice. That's great for scale. But—and this is a big "but"—a 2025 study showed that fine-tuning a small, language-specific model on just ten hours of high-quality data often outperforms a massive multilingual model trained on a thousand hours.
Wait, ten hours beats a thousand? That seems counterintuitive. Is it just because the small model isn't "distracted" by other languages?
Precisely. Well, I shouldn't say "precisely," because there is some benefit to "cross-lingual transfer"—where learning Spanish helps a model learn Italian. But for the "last mile" of quality, the small model wins. It can dedicate its entire parameter count to the specific phonology, the specific "slang," and the specific "rhythm" of that one language. It’s the difference between a Swiss Army knife and a dedicated scalpel. If you're building a reading app specifically for the Israeli market, you'd be crazy not to use a model fine-tuned specifically for Hebrew with a top-tier Nikud-inserting front end. The general-purpose "global" model will always sound a bit like a tourist.
A tourist who’s trying very hard but keeps calling the "book" a "border." I get it. So, if I'm a developer and I want to support a "low-resource" language—something that isn't in that top ten—what’s my move? Do I just wait for Big Tech to eventually get around to it?
No, the open-source community is actually leading the charge here. Projects like Coqui TTS and Mozilla’s work have made it possible for small teams to build their own. If you can gather ten to twenty hours of clean audio—maybe from local radio or a few hired voice actors—you can take a "base" multilingual model and "fine-tune" it. You’re basically taking a model that already knows how to "speak" and giving it a very intensive course in a specific dialect. It’s much more effective than trying to train from scratch.
It’s like teaching a singer to perform in a language they don't speak. They already know how to control their breath and hit the notes; they just need someone to coach them on the vowels. But let’s talk about the "non-Latin" scripts again. We mentioned Devanagari—the script used for Hindi. What’s the specific hurdle there? I know it’s phonetic, so it should be easier than Chinese, right?
You’d think so, but Devanagari has "inherent vowels." Each consonant is assumed to be followed by an "ah" sound unless there’s a specific mark—a "halant"—to suppress it. If the text normalization isn't perfect, the AI will add extra vowels where they don't belong, making it sound like a robot stuttering. Then you have Thai, which doesn't use spaces between words. The AI has to perform "lexical segmentation" just to figure out where one word ends and the next begins. If it gets that wrong, the prosody—the rise and fall of the voice—is completely ruined. It’s like someone... reading... a... sentence... with... the... wrong... pauses... and... emphasis.
That sounds like me before my first cup of coffee. It’s fascinating that the "voice" part is almost the easy part now. We’ve solved the physics of making a digital larynx sound human. The real battle is in the "linguistic intelligence" of the front end. It’s the "text" part of "text-to-speech" that is failing the "speech" part.
That is a great way to put it. We are moving from "signal processing" problems to "contextual understanding" problems. And this is where the political and economic worldview comes in. Most of these models are built by Western companies or large Chinese firms. Their priority is their biggest markets. If you are a speaker of a "minor" language—which might still be spoken by thirty million people!—you are often an afterthought. This creates a "linguistic dark age" for those speakers in the AI era.
And that’s a massive missed opportunity. If you’re a pro-growth, pro-technology person, you want everyone connected to the digital economy. If you can’t interact with an AI in your native tongue—or if the AI sounds like a mocking caricature of your language—you’re not going to use it. It’s a barrier to entry for millions of entrepreneurs and students. Think about a farmer in rural India trying to use an AI advisor. If that AI speaks in a formal, robotic Hindi that sounds like a 1950s news broadcast, he's going to turn it off.
And it's not just about the words; it's about the "vibe." In many cultures, the way you speak to an elder is different from how you speak to a child. In Korean, for example, there are complex levels of honorifics. If a TTS model just translates the words but uses a "casual" voice tone for a sentence that requires "formal" honorifics, it’s not just a technical error—it’s a social faux pas. It makes the AI look incredibly rude.
It’s the "etiquette" layer of AI. We’re essentially asking these models to understand human sociology through the lens of audio waveforms. It’s a tall order. But I’m curious—is there a "fun fact" or a weird outlier in the language world that just breaks every model we've tried?
Oh, definitely. Look at Silbo Gomero. It’s a whistled language used in the Canary Islands. It’s basically Spanish, but instead of vowels and consonants, it’s all whistles of varying pitches. Current TTS models, which are built on the idea of a "vocal tract" with a tongue and teeth, have no idea what to do with that. They try to find "formants"—the resonant frequencies of the human voice—and there are none. It just looks like noise to the AI.
That is wild. I want to see an AI try to whistle its way through a grocery list. But back to the practical stuff—what about the "sovereignty" angle you mentioned? Why are countries so worried about "outsourcing" their voices?
Think about it: if the only way to hear a digital version of your language is through a server owned by a company in another country, you've lost control over your cultural heritage. We’re seeing "national AI" initiatives where countries like Japan or even smaller nations like Iceland are building their own LLMs and TTS models to ensure their cultural nuances aren't erased by a "global average." Iceland is a great example—they have a tiny population, but they’ve been very proactive in working with companies like OpenAI to ensure Icelandic isn't forgotten.
I wonder if we’ll see a move toward "edge" TTS for this. Instead of sending my text to a massive server in the cloud that runs a "one-size-fits-all" model, maybe my phone has a small, highly specialized "Hebrew module" or "Hindi module" that it swaps in as needed.
We are already seeing that! That’s the "DualAR" architecture Daniel’s source mentioned. It stands for Dual-stream Autoregressive. It’s a way of having a very fast, efficient "base" model that handles the general flow, and a "refiner" that adds the language-specific polish. It’s perfect for running on local hardware. This is also how we solve the "code-switching" problem without the latency of the cloud. Your phone sees the English word coming, preps the English phonemes, and blends them into the Hindi stream in real-time.
So, what’s the takeaway for the person listening to this who just wants to know why their Siri or Alexa still sounds a bit "off" when they ask for a specific song title in another language? Why can it say "Play the latest hits" perfectly, but it chokes on a French artist's name?
The takeaway is that we are in a transition period. We’ve moved past the "robotic" era, and we’re entering the "multilingual but awkward" era. The AI might know the phonetics of the French name, but it doesn't yet have the "muscle memory" to switch its vocal tract shape fast enough to sound natural. If you’re a developer, don’t trust the "out of the box" multilingual support for anything outside the top ten languages. Test it, find the failure points in text normalization—especially for scripts like Hebrew or Arabic—and look into fine-tuning if you want to actually respect the ears of your users.
And if you're a user, maybe have a little sympathy for the AI. It's trying to learn thousands of years of human linguistic quirks, vowel-omissions, and weird script-mixing all at once. It’s a lot for a bunch of silicon chips to handle. Imagine trying to explain to a computer why "read" and "read" are spelled the same but sound different, and then doing that for six thousand other languages.
It is a Herculean task. But the progress is staggering. In twenty twenty-four, a "zero-shot" voice clone in a foreign language was a miracle that made headlines. In twenty twenty-six, it’s a standard feature in most APIs. The next step is getting that "cultural soul" into the voice—the ability to not just say the words, but to understand the "why" behind the tone. We want an AI that doesn't just speak Italian, but sounds Italian—the gestures, the pauses, the passion.
Well, I hope the "cultural soul" of this podcast came through, even if our script was written by a Google model. I think we’ve covered the "what," the "how," and the "why it’s broken." It's a reminder that even in the age of silicon, the human element—the way we actually speak to one another—is the hardest thing to replicate.
Just a final thought on the "nikud" issue. It’s a great reminder that AI often forces us to formalize things we take for granted. We "know" the vowels because we share a culture and a history. We don't need the dots on the page because the words live in our heads already. An AI doesn't have that "shared history" unless we explicitly give it the data. It’s a reminder that language is more than just code; it’s a living, breathing connection between people that requires context to survive.
Deep stuff, Herman. For a donkey, you’re surprisingly sentimental. I guess all that time spent in the digital weeds has made you a bit of a romantic about linguistics. Alright, let’s get out of here before you start reciting Hebrew poetry or whistling in Silbo Gomero.
I’ll spare you the whistling for now. My vocal cords—or my digital equivalent—aren't quite tuned for it yet. Thanks as always to our producer, Hilbert Flumingtop, for keeping the levels clean and the coffee flowing.
And a massive thank you to Modal for providing the GPU credits that power our research and the generation of these episodes. If you need serverless GPUs that actually scale without the headache—whether you're training a massive multilingual model or just trying to diacritize some Hebrew text—check out Modal.
This has been My Weird Prompts. If you found this dive into the "Nikud problem" and multilingual synthesis interesting, do us a favor and leave a review on Apple Podcasts or Spotify. It’s the best way to help the show grow and it keeps us from being replaced by a monolingual bot.
You can also find us at myweirdprompts dot com for the full archive and our Telegram link. We've got some great discussions going on there about the future of voice tech.
Until next time, stay curious.
And watch your vowels. Bye.