#666: The Tokenization Tax: AI’s Hidden Language Barrier

Is AI truly universal, or are we trapped in an English-speaking bubble? Discover how the "tokenization tax" impacts global AI equity.

0:000:00
Episode Details
Published
Duration
30:17
Audio
Direct link
Pipeline
V4
TTS Engine
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

On a chilly February afternoon in Jerusalem, podcast hosts Herman and Corn Poppleberry sat down to tackle one of the most pressing, yet often overlooked, crises in the development of artificial intelligence: the linguistic digital divide. Triggered by a listener’s question regarding the performance of AI for speakers of "long-tail" languages, the brothers explored whether the current AI revolution is a universal human achievement or merely a sophisticated echo chamber for the English-speaking world.

The Great Data Exhaustion and the Long Tail

The discussion began with Herman defining the current state of AI training, a period researchers are calling the "Great Data Exhaustion." For years, AI developers have relied on the massive, easily accessible troves of English-language data found on the internet. However, as the industry runs out of high-quality English text to scrape, they are finally being forced to look toward the "long tail" of human language.

Herman explained that if you graph languages by the amount of available digital data, English is the undisputed king, followed by high-resource languages like Spanish, Chinese, and French. However, the curve drops off sharply. "Long-tail" languages—such as Icelandic, Quechua, Wolof, or specific dialects of Arabic—have a much smaller digital footprint. This scarcity isn't necessarily a reflection of the number of speakers, but rather a reflection of digital literacy, internet access, and oral traditions that haven't been archived by projects like Common Crawl.

The Tokenization Tax: A Literal Cost of Language

One of the most striking insights from the episode was the concept of the "tokenization tax." Corn and Herman broke down the technical reality that AI models do not read words, but "tokens"—small chunks of text. In high-resource languages like English, common words are often a single token. In contrast, when a model encounters a long-tail language it hasn't seen much of, it must break words into five or six tiny, nonsensical fragments to process them.

Herman argued that this creates a two-tiered system of AI utility. First, it fills up the model’s "context window" much faster, meaning a speaker of a long-tail language has a significantly shorter functional memory for their prompts compared to an English speaker. Second, because AI companies charge by the token, users of languages like Telugu or Amharic are literally paying more for the same amount of information. It is a financial and technical penalty for simply using one's native tongue.

The English-Speaking Bubble and Cultural Hallucination

The conversation then shifted to the "English-speaking bubble." Even when models are capable of speaking a long-tail language through a process called "cross-lingual transfer," they often carry a heavy Western bias. Herman explained that the model essentially "thinks" in the logic of its primary training data—English—and then maps those concepts onto the target language.

This results in "cultural hallucinations," where the AI might use grammatically correct words but apply Western-centric values to concepts like family, justice, or property. Corn noted that even in a mid-resource language like Hebrew, the AI often feels "stiff" or "formal," failing to capture the lived-in reality of modern slang or the blending of cultures. The risk, the brothers noted, is that the world is being told that to use the most powerful tools in history, they must conform to a Western worldview.

Moving Toward Linguistic Sovereignty

Despite the challenges, the episode highlighted several beacons of hope. Herman pointed to a shift away from the "scrape everything" mentality toward more intentional, community-led data collection. He cited the Masakhane project in Africa as a primary example. Instead of relying on Silicon Valley to "solve" African languages, Masakhane is a grassroots organization of native speakers and researchers building their own high-quality, culturally relevant datasets.

The hosts also discussed the rise of specialized, sovereign models. Rather than one giant "God-model" trained on the whole web, nations and regions are building models tailored to their specific linguistic needs. The Jais model in the United Arab Emirates was highlighted as a success story—a model that outperformed much larger Western counterparts in Arabic because it was built from the ground up with that language and culture in mind.

Conclusion: The Future of the Digital Divide

As the sun set over the stone walls of Jerusalem, Herman and Corn concluded that the fight for language equity in AI is about more than just translation—it is about sovereignty. If the future of work and creativity is to be built on AI, then every culture must have the right to be "legible" to the machine on its own terms.

The "tokenization tax" and the English-speaking bubble are significant hurdles, but the move toward synthetic data and community-led initiatives offers a path forward. The goal, as Herman put it, is to ensure that the AI revolution doesn't just create a new kind of digital divide, but instead provides a platform where the "long tail" of human culture can finally be heard.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Read Full Transcript

Episode #666: The Tokenization Tax: AI’s Hidden Language Barrier

Daniel Daniel's Prompt
Daniel
How does training data for "long-tail" languages—those with fewer speakers and less available online content—affect the performance and accuracy of AI models compared to major models like ChatGPT? Do users of these languages have a poorer experience because of the smaller training corpora, and are we in an "English-speaking bubble" regarding our expectations of AI capabilities?
Corn
Hey everyone, welcome back to My Weird Prompts. I am Corn, and I am sitting here in our living room in Jerusalem with my brother. It is a bit of a chilly February afternoon here, the seventeenth to be exact, and the light is hitting the stone walls in that specific way it only does in late winter.
Herman
Herman Poppleberry, at your service. It is a beautiful day outside, as you say, but I have been glued to my monitor looking at some fascinating data sets. I have been tracking the performance shifts in the latest frontier models that dropped over the last few months of twenty-five and early twenty-six, and the numbers are telling a very specific story.
Corn
Of course you have. You know, our housemate Daniel sent us a really provocative prompt this morning. He was listening back to some of our older discussions about how models from the East and West differ, and it got him thinking about a much broader issue. Specifically, the languages that get left behind. He asked: How does training data for long-tail languages—those with fewer speakers and less available online content—affect the performance and accuracy of AI models compared to major models like ChatGPT? Do users of these languages have a poorer experience because of the smaller training corpora, and are we in an English-speaking bubble regarding our expectations of AI capabilities?
Herman
Right, the long-tail languages. It is such a critical topic because we often talk about AI as this universal human achievement, but the reality under the hood is much more lopsided. We are currently in the middle of what researchers are calling the Great Data Exhaustion. We have basically run out of high-quality English text to train on, so the industry is finally being forced to look at the rest of the world, but the starting line is not equal.
Corn
Exactly. Daniel was asking whether users of these long-tail languages—languages with fewer speakers or just less of a digital footprint—are essentially getting a second-class experience. Are we living in an English-speaking bubble where we assume everyone is seeing the same level of magic from these models?
Herman
That is a piercing question. And the short answer is yes, we are absolutely in a bubble. But the technical reasons why, and the ways researchers are trying to bridge that gap, are where it gets really interesting. It is not just about having fewer books to read; it is about the fundamental way the machine perceives the structure of human thought.
Corn
Well, let us start with the basics for anyone who might not be familiar with the term. When we say long-tail languages, what are we actually talking about?
Herman
Think of a graph where you plot languages by the amount of data available on the internet. At the very top, you have English, which is the undisputed king, making up over fifty percent of the entire web. Then you have languages like Spanish, Chinese, French, and German. These are high-resource languages. But as you move down the curve, it drops off fast. You hit languages like Icelandic, which has a tiny population of only about three hundred and seventy thousand people but a very high digital literacy, and then you get into languages like Quechua or Wolof or even certain dialects of Arabic where there just is not a lot of digitized text for a model to scrape. These are the low-resource or long-tail languages.
Corn
And that scraping part is key, right? Because models like ChatGPT or Gemini or the newer Llama four models are trained on things like Common Crawl.
Herman
Exactly. Common Crawl is this massive, multi-petabyte dataset that tries to archive the entire web. But the web itself is not a representative sample of human thought. It is a representative sample of who has had high-speed internet and the desire to blog or post on forums over the last thirty years. If your culture has a strong oral tradition or if your language was suppressed or if your community just did not have widespread internet access until recently, you are going to be under-represented in that training data. In fact, for many African and Indigenous languages, the amount of data available is less than one-one-thousandth of what is available for English.
Corn
So if I am a speaker of a language with, say, only a few hundred thousand speakers, and I go to one of these major models, what is my actual experience? Is it just that it does not understand me, or is it that it is literally worse at thinking in my language?
Herman
It is both, and it is even more subtle than that. There is something called the tokenization tax. This is something most English speakers never even think about. You see, AI models do not read words; they read tokens, which are little chunks of text. In English, a common word like apple might be one token. But in a long-tail language that the model has not seen much of, it might have to break a single word into five or six tiny, nonsensical fragments just to process it.
Corn
Wait, so the model is working harder just to read the prompt?
Herman
Much harder. Imagine if every time you read a sentence in English, you had to spell out every third word letter-by-letter. It would slow you down and make it harder to remember the beginning of the sentence by the time you get to the end. That is exactly what happens to the model. It means the model has a shorter effective memory in those languages because its context window gets filled up with these tiny fragments. If an English speaker can fit a fifty-page document into the model's memory, a speaker of a long-tail language might only be able to fit ten pages because their words are being chopped into so many more tokens.
Corn
That sounds incredibly unfair. It also means it is more expensive to run, right?
Herman
Precisely. Since most AI companies charge by the token, a user writing in a language like Telugu or Amharic is literally paying more for the same amount of information than someone writing in English. It is a literal tax on your native tongue. But more importantly, if the model has not seen enough examples of how your language actually works—the grammar, the idioms, the cultural context—it starts to hallucinate more. It tries to force the logic of English onto your language. It might use the right words but in a way that feels uncanny or grammatically "off" because it is essentially translating its internal English thoughts on the fly.
Corn
That is a fascinating point. It is almost like the model is thinking in English and then using a very sophisticated but slightly broken dictionary to spit out the results in the target language.
Herman
That is exactly what is happening in many cases. There is this phenomenon called cross-lingual transfer. It is actually one of the miracles of modern AI. You can train a model mostly on English, and then give it a tiny bit of, say, Yoruba, and it will suddenly be able to speak Yoruba surprisingly well. The model learns the underlying concepts—the idea of a cat or the concept of justice—in English, and then it maps those concepts to the new vocabulary.
Corn
But that sounds like a double-edged sword. If it is learning the concept of justice primarily through an English-language, Western-centric lens, does that carry over when it speaks a totally different language?
Herman
You hit the nail on the head. That is the English-speaking bubble Daniel was mentioning. We are not just exporting a language; we are exporting a world-view. If the model’s reasoning was forged in the fires of Reddit, Wikipedia, and digitized Western philosophy, its responses in any language are going to reflect those biases. It might use the right words for a local cultural concept but completely miss the nuance of how that concept actually functions in that society. For example, the concept of "family" or "property" can vary wildly between a Western individualist society and a more collectivist culture. The AI, having been raised on the Western web, will default to the individualist interpretation even when speaking a language from a collectivist culture.
Corn
I imagine this is something we see even here in Jerusalem with Hebrew. Hebrew is an interesting case because it is a high-tech society, but the language itself has a relatively small number of speakers—maybe nine or ten million people.
Herman
Right. Hebrew is what I would call a mid-resource language. It is not in the long-tail, but it is nowhere near the scale of English. And you can feel it. If you use a model in Hebrew, it often feels a bit formal, a bit stiff. It sounds like a very polite textbook from the nineteen-fifties. It struggles with modern slang or the way we blend Hebrew and Arabic or Hebrew and English in daily life. It lacks that lived-in feel. And because Hebrew is written from right to left, there are still occasional technical glitches in how the models display the text or handle punctuation.
Corn
And if it feels that way for Hebrew, I can only imagine what it is like for a language with one million speakers and very little digitized literature. So, what are the implications here? Are we basically telling the world that if you want to use the most powerful tools ever created, you have to do it in English?
Herman
That is the risk. We could be creating a new kind of digital divide. In the past, the divide was about who has the hardware—who has the laptop or the smartphone. Now, it might be about whose culture is legible to the machine. If you are a developer in Vietnam or Kenya and you want to build an app for your local community, you are starting at a disadvantage if the underlying model does not understand the linguistic nuances of your users. You are building on a foundation that was not made for you.
Corn
But Herman, you are always reading about the latest breakthroughs. Surely people are working on this. Is there a way to fix the data scarcity problem?
Herman
Oh, people are working on it with incredible intensity. There are a few different strategies. One is called synthetic data, which became the big trend in twenty-twenty-four and twenty-twenty-five. If you have a small amount of high-quality text in a rare language, you can use a large model to generate more text in that style. It is like using a small seed to grow a whole field of data.
Corn
That sounds like a bit of a circular logic problem, though. If the model is already biased, won't the synthetic data just be more biased?
Herman
Yes, it is a huge concern. It is like an echo chamber. If the model makes a slight grammatical error in the rare language and then trains on its own error, that error becomes baked into the next generation of the model. This is called model collapse. Another approach, which I find much more inspiring, is community-led data collection. There is a project called Masakhane in Africa, which is a grassroots organization of researchers working to build translation models for African languages. They are not just scraping the web; they are actively working with native speakers to create high-quality, culturally relevant datasets. They are proving that you do not need a trillion words if the words you have are high-quality and representative.
Corn
That seems much more sustainable. It is about sovereignty, right? Giving people the tools to represent themselves to the AI rather than just being scraped by a company in California.
Herman
Exactly. And there is also a move toward smaller, specialized models. Instead of one giant model that tries to know everything, you might have a model that is specifically fine-tuned for the languages of the South Pacific or the indigenous languages of North America. These models can be more efficient and more accurate because they are not trying to carry the weight of the entire internet. We saw this with the Jais model in the United Arab Emirates, which was specifically built to be the world's best Arabic model. It outperformed much larger Western models in Arabic because it was built from the ground up with that linguistic logic.
Corn
I want to go back to this idea of the reasoning versus the language. Daniel mentioned this in his prompt—the idea that maybe we can train a model to reason in a high-resource language and then just give it a language twist at the end. Is that actually how it works? Can you separate thinking from speaking?
Herman
That is one of the biggest debates in linguistics and AI right now. It relates to the Sapir-Whorf hypothesis—the idea that the language you speak shapes the way you think. In AI, we see that models do develop a sort of universal internal representation. If you look at the hidden layers of a transformer model, the concept of a tree looks very similar whether the input was the word tree in English or arbol in Spanish.
Corn
So there is a universal language of thought inside the machine?
Herman
To an extent, yes. But the mapping is never perfect. Different languages carve up the world in different ways. Some languages have twenty different words for types of snow; others might not distinguish between blue and green. If the model’s internal reasoning is built on a language that does not make those distinctions, it might struggle to reason about those concepts even if it knows the right words in the target language. It is like trying to describe a rainbow to someone using only a black and white vocabulary. You can use the words, but the underlying understanding is missing.
Corn
That really highlights the bubble. We think the AI is being objective, but it is actually just being very, very good at English-style objectivity.
Herman
Right. And then you have the issue of the safety filters. This is something people often miss. Most of the work on making AI safe—preventing it from giving instructions for dangerous things or using hate speech—is done in English. When you translate those models to long-tail languages, those safety filters often break. There have been cases where a model that is perfectly polite in English will suddenly start spewing toxic content in a language where the safety training was thinner. Researchers call this language-based jailbreaking.
Corn
That is terrifying. So you could have a situation where the most vulnerable populations are being served the least safe versions of the technology.
Herman
Precisely. It is a massive oversight. And it is not just about toxicity; it is about accuracy in critical fields like medicine or law. If you are using an AI to translate medical advice into a local dialect and it misses a crucial negation—like saying you should take a medicine instead of you should not take it—the consequences are real. In twenty-twenty-five, there was a documented case of a translation AI giving incorrect dosage instructions for a common antibiotic in a rural province because it confused two similar-sounding words in the local dialect.
Corn
It feels like we are at a crossroads. Either we accept that AI is an English-first technology that the rest of the world has to adapt to, or we make a conscious, massive effort to diversify the training foundations.
Herman
I think the latter is starting to happen, but it is a race against time. The major labs are starting to realize that the next billion users are not going to be English speakers. If they want to be global companies, they have to solve this. But it is expensive. Collecting high-quality data in a thousand different languages is a monumental task compared to just scraping the English-speaking web. It requires boots on the ground, linguists, and cultural consultants.
Corn
It makes me wonder about the future of the internet itself. If everyone starts using AI to generate content, and that AI is primarily trained on English, will we see a homogenization of world languages? Will the long-tail languages start to sound more like English because that is how the AI speaks them?
Herman
That is a legitimate fear among linguists. We call it linguistic leveling. It is the idea that the unique quirks and structures of smaller languages might get smoothed over because the AI provides a path of least resistance. If the AI always suggests a certain way of phrasing something that is grammatically correct but culturally English, people might just start using it because it is easier. Over a few generations, the language itself could lose its unique character.
Corn
It is like the G P S effect. Once you start relying on the map, you stop learning the landmarks. If we rely on English-biased AI to communicate, we might lose the landmarks of our own native tongues.
Herman
That is a great analogy. But on the flip side, AI could also be a tool for language revitalization. There are people using AI right now to document endangered languages, to create dictionaries for languages that have never had them, and to help young people learn the tongues of their ancestors. In late twenty-twenty-five, a group in New Zealand used a custom model to help teach Te Reo Maori to thousands of students, using an AI that was specifically trained on archival recordings of elders. It all depends on who is holding the steering wheel.
Corn
I think that is a really important point to end this segment on. It is not just about the data; it is about the intent. If we treat these languages as an afterthought, they will remain in the long-tail. But if we see them as essential to the human experience, we can use these tools to amplify them.
Herman
Absolutely. And I think we should dig into some of the more technical aspects of how this works in our next segment. I want to talk about how researchers are actually measuring this gap, because there are some really surprising benchmarks out there that show just how wide the chasm is.
Corn
Let us do it. But before we get into the weeds, I want to take a quick second to say if you are enjoying this deep dive, we would really appreciate it if you could leave us a review on your podcast app or on Spotify. We are doing this as a collaboration between us and our housemate Daniel, and hearing from you guys really helps us know what topics you want us to tackle next.
Herman
Yeah, it genuinely helps the show grow. We love seeing those notifications pop up. It makes the long hours of data-crunching feel worth it.
Corn
Alright, Herman, you mentioned benchmarks. When we say an AI is worse in a certain language, how do we actually know? Is there a standardized test for this?
Herman
There are several. One of the most famous is called M M L U, which stands for Massive Multitask Language Understanding. It is a huge set of questions across fifty-seven subjects like S T E M, the humanities, and more. When you run M M L U in English, the top models are scoring in the eighty or ninety percent range. But when you translate that same test into a low-resource language, the scores often plummet.
Corn
How big is the drop?
Herman
Oh, it can be staggering. You might go from an eighty-five percent in English to a thirty or forty percent in a language like Swahili or Bengali. And that is for the same model! The reasoning ability is there, but it gets lost in translation. It is like trying to take a physics exam in a language you only learned six months ago. You know the physics, but you cannot understand the questions or articulate the answers. In twenty-twenty-four, a study showed that for over seventy percent of the world's languages, the leading AI models performed no better than random guessing on complex reasoning tasks.
Corn
That is a perfect way to put it. It is a bottleneck. The intelligence is trapped behind a linguistic barrier.
Herman
Exactly. And researchers are finding that even when the model gets the answer right, the path it takes to get there is different. In English, it might use a logical, step-by-step process. In a low-resource language, it might rely more on pattern matching or just guessing based on a few keywords it recognizes. This is why the responses can feel shallow or repetitive.
Corn
This brings us back to Daniel's point about the English-speaking bubble. If we only look at the English benchmarks, we think we are approaching Artificial General Intelligence. But if we look at the global benchmarks, we realize we are still very far away.
Herman
Right. We are building a very smart English-speaking brain. But a truly general intelligence should be able to reason across any human medium. There is this other benchmark called Flores-two hundred, which was released by Meta. It is a translation benchmark for two hundred different languages. It showed that for many of these languages, the state-of-the-art models are still barely functional. They can do basic translation, but they fail at anything involving nuance, sarcasm, or complex sentence structures.
Corn
So, for someone living in a country where one of these languages is dominant, the AI revolution probably feels a lot less revolutionary.
Herman
I think that is a very fair assessment. If your interactions with the world's most advanced tech are filled with errors, slow responses, and cultural tone-deafness, you are not going to be an early adopter. You are going to see it as a toy or a gimmick rather than a fundamental tool. This creates a feedback loop where people in those regions use the tools less, which means less data is generated, which means the models don't improve for them.
Corn
What about the companies themselves? Are they being transparent about this? When I sign up for a service, does it tell me, hey, this works great in English but you are on your own in Icelandic?
Herman
Not really. They usually just list the supported languages in a long menu. They might have a little disclaimer buried in the terms of service, but they want to project an image of universal capability. It is actually a bit of a marketing problem. If they admitted how much worse the models are in other languages, it would hurt their global expansion plans. However, in the last year, we have seen some companies start to release "model cards" that specifically break down performance by language, which is a step in the right direction.
Corn
It is interesting because we see this in other areas of tech too. Voice recognition was notorious for this for years—it worked great for people with standard American accents and terribly for everyone else.
Herman
Exactly! It is the same underlying issue. The training data comes from a specific demographic, so the product works best for that demographic. But with large language models, the stakes are higher because we are talking about the primary interface for information in the future. If the AI is the new librarian of the world, we need to make sure it speaks more than one language.
Corn
So what is the takeaway for our listeners who might be in that bubble? I mean, we are sitting here in Jerusalem, we speak English at home, we are using these tools in English. How should we be thinking about this?
Herman
I think the first step is just awareness. When you see a headline saying AI has passed the Bar Exam or the Medical Licensing Exam, remember that it passed the English version of those exams. We should be asking, could it pass the exam in Hindi? Could it pass it in Portuguese? If the answer is no, then the AI hasn't really mastered the subject; it has mastered the subject as expressed in English. It is a subtle but vital distinction.
Corn
It is a humility check for the industry.
Herman
Definitely. And for developers, it is a call to action. We need to be investing in local datasets. We need to be supporting open-source projects that prioritize linguistic diversity. There is a model called B L O O M, which was a massive collaborative effort involving hundreds of researchers. They made a conscious effort to include many more languages than the typical proprietary models. It was one of the first models where English was not the majority of the training data.
Corn
I love that. It is about building a bigger table rather than just inviting more people to sit at the small one.
Herman
Well said, Corn. And I think we are seeing a shift in the research community. There is a lot more prestige now in working on low-resource languages. It is seen as one of the hard problems left to solve. In twenty-twenty-five, the top prize at the leading AI conference went to a team that developed a new way to train models on purely oral languages with no written script.
Corn
You know, it occurs to me that this is not just a technical problem; it is a political one. If a government wants its citizens to be competitive in the twenty-first century, they need to make sure their national language is well-represented in these models.
Herman
You are seeing that already. Countries like France and the United Arab Emirates are investing billions in building their own national AI models. They realize that they cannot rely on Silicon Valley to preserve their linguistic and cultural heritage. They want models that understand their specific legal systems, their histories, and their social norms. It is a matter of national sovereignty.
Corn
It is almost like a new space race, but instead of the moon, we are racing to map the human mind in every possible language.
Herman
I like that. The linguistic space race. And the winners won't just be the ones with the most compute; they will be the ones with the most diverse and high-quality data. The goal is an AI that doesn't just translate, but actually understands the world through a thousand different windows.
Corn
So, to wrap this up, Daniel's prompt really hit on a fundamental truth. We are in a bubble, and that bubble is shaped like the English-speaking world. But the walls of that bubble are starting to thin as people realize that a truly global AI has to speak the whole world's language.
Herman
Exactly. It is a journey from English-centric AI to a truly polyglot intelligence. It is going to take a lot of work, a lot of community involvement, and a lot of rethinking how we value data. We have to move past the idea that more data is always better and start realizing that diverse data is what actually leads to intelligence.
Corn
Well, Herman, I think we have thoroughly explored the long-tail today. I feel like I have a much better handle on why my Hebrew prompts sometimes sound like they were written by a very confused Victorian gentleman.
Herman
Haha, exactly. He is just trying his best with the tokens he has been given. He is a very smart gentleman, he just needs a better dictionary and a few more years of living in the modern world.
Corn
Before we go, I want to remind everyone that you can find us on Spotify and at our website, myweirdprompts dot com. We have an R S S feed there if you want to subscribe, and there is a contact form if you want to send us a prompt like Daniel did. We love hearing your weird ideas, especially the ones that make us question our own bubbles.
Herman
And seriously, leave that review. It really does help us reach more people and keep these deep dives going.
Corn
Thanks for listening to My Weird Prompts. I am Corn.
Herman
And I am Herman Poppleberry. We will see you next time.
Corn
Bye everyone.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.