#2060: The Tokenizer's Hidden Tax on Non-English Text

Why does a simple greeting in Mandarin cost more to process than in English? It's the tokenizer's hidden inefficiency.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2216
Published: Apr 6
Duration: 22:58
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: linguistics tokenization ai-inference

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When you type a simple greeting in Mandarin, like "ni hao," it might look like two characters on your screen. Behind the scenes, however, a standard AI model trained primarily on English data could chop that greeting into six, seven, or even ten numerical pieces. This inefficiency isn't just a technical curiosity; it's a direct cost driver and a performance bottleneck. At the heart of this issue is the tokenizer, the invisible translator that sits between human language and machine math. It is one of the most overlooked components in artificial intelligence, yet it dictates whether a model is fast, cheap, and globally capable.

The Transformer's Translator
Modern AI models, particularly transformers, are essentially massive calculators performing linear algebra on vectors of numbers. They have no innate concept of letters or words. The tokenizer's job is to convert raw text into a sequence of these numerical vectors, called tokens. The challenge is finding the right balance. Using individual characters results in sequences that are too long, making the computational cost of the self-attention mechanism—which grows quadratically with sequence length—prohibitively expensive. Using whole words creates a vocabulary so vast it becomes mathematically unmanageable. Subword tokenization emerged as the solution, breaking words into meaningful chunks like "ap" and "ple" or "un" and "characteristically." This middle ground preserves semantic meaning while keeping sequence lengths manageable.

Algorithms and Their Biases
Several key algorithms power today's tokenizers, each with distinct approaches. Byte-Pair Encoding (BPE), the grandfather of modern methods, is a greedy, frequency-based algorithm. It starts with individual characters and iteratively merges the most frequent pairs of characters in the training data until a target vocabulary size is reached. This makes it exceptionally good for languages with repetitive patterns, like English, which is why early GPT models were so effective with it.

BERT, developed by Google, uses WordPiece, a more sophisticated cousin. Instead of just counting raw frequency, WordPiece merges units based on how much they increase the probability of the training data, optimizing for information gain. This mathematical rigor helped BERT excel at understanding linguistic nuances. However, both BPE and WordPiece share a critical weakness: they typically require pre-tokenization based on spaces. This works well for languages that use spaces but fails for languages like Japanese or Thai, which do not.

This is where SentencePiece revolutionized the field. Released by Google in 2018, SentencePiece treats the entire input as a raw stream of Unicode characters, ignoring human-defined rules like spaces. It even treats the space itself as a character, often represented by an underscore. This language-agnostic approach makes SentencePiece the backbone of models like T5, Llama, and Mistral, enabling them to process diverse global languages without needing pre-defined word boundaries.

The "Token Tax" and Representation
A significant consequence of these design choices is what can be called the "token tax" on non-English languages. Most tokenizers are trained on internet scrapes that are overwhelmingly English. Consequently, the vocabulary is optimized for English words, where common terms like "challenge" might be a single token. For a language with less representation in the training data, a single word might be broken down into many basic units or even individual bytes.

This has two major impacts. First, cost: since AI services often charge per token, processing a sentence in Telugu can cost significantly more than the same sentence in English. Second, effective context window: if a model has a 32,000-token limit and a language requires four tokens per word, it can only fit 8,000 words, whereas an English user might fit 25,000 words in the same space. This disparity highlights a critical need for larger, more diverse vocabularies that can represent the world's languages more equitably.

Optimization and the Unsung Heroes
To address these inefficiencies, newer tokenizers like tiktoken, used by GPT-4, have emerged. tiktoken is OpenAI's highly optimized BPE implementation, featuring a much larger vocabulary of around 100,000 tokens. Written in Rust, it uses aggressive optimizations and simplified regex rules to achieve blazing speeds, making it viable for real-time, large-scale applications.

Equally important are detokenizers, the unsung heroes that convert the model's numerical output back into human-readable text. This process is deceptively complex. It must correctly handle "glue" rules, such as prefixes denoting spaces or contractions that attach to previous words without a space. A failure here results in awkward outputs like "This is a test ." with a misplaced space before the period, undermining the perceived polish of the AI.

Finally, tokenizers manage special control tokens, such as `` or [SEP]. These aren't words but commands that tell the model when to stop generating or how to distinguish between different parts of a prompt. They are the invisible formatting marks that structure the model's reasoning.

In conclusion, tokenization is far more than a simple pre-processing step. It is a foundational engineering choice that influences a model's efficiency, cost, and linguistic inclusivity. As AI continues to scale globally, the evolution of tokenizers will be pivotal in ensuring that the technology serves all languages, not just those that dominate the training data.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2060: The Tokenizer's Hidden Tax on Non-English Text

Imagine you are sitting at your computer and you type a simple greeting in Mandarin, maybe ni hao, into a standard AI model trained primarily on English data. On your screen, it looks like two characters. But behind the scenes, in the guts of the machine, that simple greeting might be getting chopped into six, seven, or even ten different numerical pieces. It costs more to process, it takes up more of the model's memory, and sometimes, the model just flat-out gets confused.

It is the invisible tax of the digital age, Corn. We are talking about tokenization. It is the hidden machinery that sits between human thought and machine math, and honestly, it is one of the most overlooked bottlenecks in artificial intelligence today. Whether a model is fast, cheap, or actually understands a global language often comes down to the quality of its tokenizer. Think of it like a sieve; if the holes are too big, you lose the nuance. If they are too small, you're just left with a pile of dust that's hard to reconstruct.

Well, we are going to peel back the curtain on that machinery today. Today's prompt from Daniel is a deep dive into exactly this. He wants to know what tokenizers actually are, how they are developed for different world languages, and he is asking about the specific packages like SentencePiece, BPE, and tiktoken. We are also going to get into detokenizers, multimodal mapping, and even how you might tokenize something weird like a CSV file differently than a poem.

This is a great one because it moves us past the magic of the chatbot and into the engineering of the transformer. Oh, and before we really get rolling, just a quick note that today’s episode is powered by Google Gemini three Flash. It is the model behind the script today, which is actually quite meta considering we are discussing how these models perceive text.

Very meta. So, Herman, let's start with the absolute basics for the uninitiated, but let's not linger there. If a transformer cannot read raw text, what is it actually seeing? When I type a word, what is the first thing that happens?

The first thing to understand is that a transformer is essentially a massive calculator. It performs linear algebra on vectors of numbers. It has no concept of the letter A or the word apple. So, the tokenizer is the translator. It takes a string of text and breaks it into chunks called tokens. Now, early on, people thought, why not just use words? But words are messy. You have prefixes, suffixes, and millions of unique strings. If you try to give a model a vocabulary of every possible word in every language, the math becomes impossible. So, we use subword tokenization.

Right, so instead of "apple," it might see "ap" and "ple." Or for a complex word like "uncharacteristically," it might break it into four or five chunks. But why can't we just use characters? Why not just give the model the alphabet and let it figure it out? I mean, twenty-six letters is a much smaller vocabulary to manage than fifty thousand subwords.

Efficiency is the short answer. If you use characters, the sequence length becomes enormous. A five hundred word essay becomes three thousand characters. Since the computational cost of the self-attention mechanism in a transformer grows quadratically with the sequence length, character-level processing is incredibly expensive. If you double the length of the sequence, you quadruple the work the model has to do. Tokens are the middle ground. They are long enough to carry semantic meaning—like "ing" or "pre"—but short enough to keep the math manageable.

Okay, so it is a compression problem. But Daniel asked about the specific algorithms. I see these names floating around in research papers all the time. BPE, WordPiece, SentencePiece. They sound like different brands of the same thing, but I imagine the technical implementation varies quite a bit. Let's start with BPE because that seems to be the grandfather of the current crop.

Byte-Pair Encoding, or BPE, is fascinating because it actually started as a data compression algorithm back in the nineties. It was adapted for NLP by researchers who realized it was perfect for building a vocabulary from scratch. The way it works is greedy and iterative. You start with individual characters. Then, you look at your massive training data and find the most frequent pair of characters that appear next to each other. Let's say "t" and "h" appear together more than anything else. You merge them into a new token, "th." Then you repeat the process. Maybe "th" and "e" are the next most frequent. Now you have "the." You keep doing this until you hit a pre-defined vocabulary size, like thirty-two thousand or fifty thousand tokens.

So it is literally building a dictionary based on what it sees most often. That explains why GPT-2 and GPT-3, which used BPE, were so good at English. English is full of repeating patterns. But what happens when you throw a curveball at it? What about BERT? I know Google used something called WordPiece for that. How does it differ from the "greedy" approach of BPE?

WordPiece is the slightly more sophisticated cousin of BPE. While BPE just looks at raw frequency, WordPiece looks at likelihood. It asks, if I merge these two units, how much does it increase the probability of the training data? It is a bit more mathematically rigorous. It tries to ensure that the pieces it creates are the most informative pieces possible. It is why BERT was so effective at understanding the nuances of language in those early days of the transformer revolution. It wasn't just counting; it was optimizing for information gain.

But both of those have a bit of a weakness, right? They usually require pre-tokenization. You have to tell the algorithm where the words start and end using spaces before it can even begin its work. That works great for English or French, but what about languages that don't use spaces? If I am writing in Japanese or Thai, a space-based pre-tokenizer is going to have a very bad day.

That is exactly where SentencePiece comes in, and this is really where the field shifted toward true global utility. SentencePiece, which was released by Google around twenty-eighteen, treats the entire input as a raw stream of Unicode characters. It doesn't care about spaces. In fact, it treats the space itself as a character, often represented by a little underscore symbol in the underlying data. This is why SentencePiece is the backbone of models like T5, Llama, and Mistral. It is language-agnostic. It can look at a block of Chinese text and determine the best way to chunk it without needing to understand where a word supposedly begins.

I love that. It is basically saying, "I don't care about your human rules for grammar or punctuation; I am just looking for repeating patterns in the bits." But we should talk about the "token tax" Daniel mentioned. This is something people really feel in their wallets if they are building apps. Why does a sentence in Telugu cost ten times more to process than the same sentence in English?

It is a matter of training data representation. Most of these tokenizers are trained on massive scrapes of the internet, which is overwhelmingly English. If a tokenizer's vocabulary is full of common English words like "challenge" or "infrastructure," those only take one token. But if it has never seen a complex word in a different script, it has to fall back to the most basic units it knows, sometimes even individual bytes. So, a single word in a non-Latin script might be broken into eight different tokens. Since you pay per token and your context window is measured in tokens, the person using the non-English language is getting less intelligence for more money.

It is essentially a lack of representation in the dictionary. If the dictionary doesn't have your word, you have to spell it out letter by letter, and spelling takes more space than just saying the word. That has to affect the attention mechanism too, right? If the model is looking back at a sequence of tokens, and those tokens are just tiny fragments of a word, it has to work harder to maintain the semantic meaning over that distance.

You hit the nail on the head. If the tokens are too small, the model's effective context window shrinks. If you have a thirty-two thousand token limit, and your language requires four tokens per word, you can only fit eight thousand words in memory. An English user might fit twenty-five thousand words in that same space. This is why we are seeing a massive push for larger vocabularies. GPT-4 uses something called tiktoken, which is OpenAI’s highly optimized BPE implementation. It has a much larger vocabulary, around one hundred thousand tokens, which makes it significantly more efficient across multiple languages compared to the older GPT-3 tokenizer.

Tiktoken is also incredibly fast, right? I remember seeing benchmarks where it was processing text at speeds that made the older Python-based tokenizers look like they were standing still. Why is there such a speed gap? Is it just the language it's written in?

Oh, it is blazing. It is written in Rust and uses very aggressive optimizations. But it's also about how it handles the regex—the regular expressions—that split the text initially. Traditional tokenizers often get bogged down in complex regex rules. Tiktoken simplifies that and uses thread-level parallelism. When you are serving millions of users, the latency of just the tokenization step actually starts to matter. If you can shave fifty milliseconds off the pre-processing, that is a huge win for the user experience, especially in real-time streaming applications.

Alright, so we have chopped the text up into numbers, the transformer has done its math, and now it spits out a sequence of numbers at the other end. We can't show those to the user. We need a detokenizer. Daniel called them the unsung heroes. How do we turn the numbers back into something a human can read without it looking like a jumbled mess of missing spaces?

Detokenization is trickier than it sounds because of those "glue" rules we mentioned. If the tokenizer used a specific prefix to denote a space, the detokenizer has to know exactly how to strip those out and stitch the words back together. SentencePiece makes this very elegant because the spaces are literal characters in the sequence. You just replace the underscores with spaces and you are done. But with BPE, you might have tokens that are meant to be attached to the previous word without a space, like a contraction or a punctuation mark. If the detokenizer fails, you get that classic error where the output looks like "This is a test ." with a weird space before the period.

I have seen that! It makes the AI look like it is stuttering. It is amazing how much of what we perceive as AI personality or polish is actually just the quality of the detokenizer and the post-processing script. But what about when a model needs to stop? Daniel asked about "special tokens." I see things like <|endoftext|> or [SEP] in technical docs. Are those part of the vocabulary too?

They are. Those are control tokens. They don't represent a string of text; they represent a command to the model. The [SEP] token in BERT tells the model "this is the end of sentence A and the beginning of sentence B." The <|endoftext|> token in GPT models is like a period at the end of a thought. Without these, the model wouldn't know when to stop generating or how to distinguish between different parts of a prompt. They are like the invisible formatting marks in a Word document.

And to Daniel's question about whether these are always under the hood—yes, absolutely. Even if you are using an open-weights model like Llama-3 or Mistral, the tokenizer is a mandatory companion. You cannot separate them. If you try to use a Llama-3 tokenizer with a Mistral model, you get absolute gibberish. It would be like trying to read a book where every letter has been shifted by five positions in the alphabet. The ID number four hundred and two might mean "the" in one model and "apple" in another.

You can't swap them. Now, where this gets really interesting is when we move beyond just raw text. Daniel asked about multimodal mapping. This is the frontier. How do you tokenize an image? You can't just find repeating sub-words in a picture.

Right, an image is a grid of pixels. Do we just treat every pixel as a token? That sounds like a computational nightmare. If a 1080p image has two million pixels, the attention matrix would be... well, it wouldn't fit in any GPU I know.

It would be impossible. Instead, models like the Vision Transformer, or ViT, break the image into small patches, maybe sixteen by sixteen pixels. Each patch is then flattened into a vector and treated as a token. It is a "visual word." More advanced models use things like VQ-VAE, or Vector Quantized Variational Autoencoders, to create a discrete vocabulary of visual parts. So the model might have a token that represents a vertical edge or a specific texture. It’s essentially turning the image into a "sentence" of visual concepts.

It is like Lego bricks for images. The model sees a sequence of these visual tokens just like it sees a sequence of text tokens. But what about the unified approach? I have heard that some of the newer frontier models are trying to put everything into one big bucket.

That is the goal. Models like Gemini or Apple’s recent research on AToken are moving toward a unified ID space. Imagine a vocabulary of one hundred thousand tokens where the first fifty thousand are text, the next twenty thousand are visual patches, and the last thirty thousand are audio frames. This allows the transformer to process a video, a voice memo, and a text prompt all in the exact same mathematical space. It doesn't see them as different modalities; it just sees a sequence of tokens. It makes the model truly "natively" multimodal rather than just having a text model with an image "adapter" bolted onto the side.

That is wild. It is like a universal language of data. But let's get back to something more terrestrial. Daniel asked about CSV data versus raw text. If I am building a model to handle massive spreadsheets or financial data, would I want a different tokenizer? My gut says yes, because standard tokenizers are notoriously bad at math.

Your gut is right. If you give a standard tokenizer a number like one thousand two hundred thirty-four point fifty-six, it might break it into "1", "234", ".", and "56". From the model's perspective, the number one thousand two hundred thirty-four has no mathematical relationship to the number one thousand two hundred thirty-five. It is just a different sequence of strings. It doesn't "see" the value; it sees the label.

Right, it doesn't understand that the four is in the ones place and the three is in the tens place. It just sees them as arbitrary symbols. That explains why LLMs sometimes struggle with basic arithmetic—they are literally hallucinating the relationship between the tokens. But how do you fix that in the tokenizer?

Precisely. So, for data-heavy tasks, we use specialized strategies. One common approach is Digits-as-Tokens. You force the tokenizer to treat every single digit as its own unique token. So "1234" becomes the sequence "1", "2", "3", "4". This forces the model to learn the positional value of the numbers, almost like we do in elementary school with columns. It makes the sequence longer, which is the trade-off, but it makes the model significantly better at mathematical reasoning.

And for a CSV, you probably want the comma to be a very high-priority token that never gets merged with anything else. You need it to act as a structural anchor.

If the tokenizer merges a comma with the first digit of the next column, the entire table structure collapses in the model's mind. For structured data, you want a tokenizer that respects the delimiters. There are actually custom tokenizers used in high-frequency trading or scientific research that are designed specifically to preserve the integrity of the data format. They treat a row of a CSV as a single logical unit or a cell as a distinct entity. It's about preserving the "grid" even when the data is flattened into a line.

It is amazing how much of the intelligence we see is just a result of how well the data was prepared for the machine to digest it. It reminds me of those "glitch tokens" Daniel mentioned in his notes. I remember reading about things like "SolidGoldMagikarp." For those who don't know, there were these weird tokens in the GPT-3 vocabulary that, if you typed them, the model would just lose its mind or start talking about weird Reddit threads.

Oh, the glitch tokens are a classic example of tokenizer failure. Because the tokenizer is trained on a massive scrape of the internet, it picks up weird strings that appear frequently, like a specific Reddit username or a piece of source code from a website. But if those strings don't appear in the actual training set for the transformer itself, the model never learns what they mean. It has a word in its dictionary that it has never seen used in a sentence.

It's like a person who knows the word "quincunx" exists in the dictionary, but they've never actually seen a quincunx or heard anyone use it. When someone says it, they just freeze.

It is like a blind spot in the model's eyes. It sees the word, but it has no mental image for it. I think the takeaway here for developers and even just curious users is that tokenization isn't just a pre-processing step. It is the foundation. If you are building something specialized, you have to look at your tokens. If you're building a legal AI, for example, your tokenizer better know common Latin phrases as single units, or you're wasting context window space on fragments.

That is a huge practical takeaway. Don't just look at the MMLU scores; look at the token-to-word ratio for your specific domain. So, Herman, what is the future here? Are we going to see a world where tokenizers are learned on the fly? Or are we moving toward a universal tokenizer that every model uses?

There is a lot of research into "token-free" models, things like Mamba or certain types of state-space models that try to process raw bytes directly. The idea is to remove the middleman entirely. If you can make the math efficient enough to handle raw bytes—which are the ultimate universal units—you eliminate the token tax, you eliminate glitch tokens, and you make the model truly universal. But for now, the transformer architecture is so dominant, and it relies so heavily on these discrete units, that I think we are going to see vocabularies just getting larger and more multimodal. We are moving from fifty thousand tokens to two hundred and fifty thousand tokens.

It is a fascinating evolution. We started with words, moved to sub-words, and now we are moving toward these unified multimodal atoms of data. It’s almost like we’re trying to build a digital periodic table that can describe any kind of information with the same set of elements.

And we have to remember that this all serves the goal of better compression. A better tokenizer is a better compressor. And as the saying goes in AI circles, compression is intelligence. The more information you can pack into a single mathematical operation, the more capable the model becomes. If one token can represent a whole concept rather than just a syllable, the model's "thought" per layer is much more dense.

Well, I feel like I have a much better handle on why my API bill is so high when I try to translate my poetry into ancient Greek. It is all about those sub-word mergers. I'm literally paying for the model to spell out words it doesn't recognize.

It usually is, Corn. It usually is. If you're using an older model for a rare language, you're essentially paying a "legacy tax" on every sentence.

Alright, I think we have covered the spread here. We looked at the major players—BPE, WordPiece, SentencePiece. We talked about why your CSVs might need a special touch, and how images are being chopped into patches. And we even touched on the weird world of glitch tokens and the future of byte-level processing.

It is a deep rabbit hole. If you are a developer listening to this, I highly recommend playing around with the Hugging Face tokenizers library. You can actually load different tokenizers and see exactly how they chop up your specific data. It is a very eye-opening exercise. Try putting in some code, some emojis, and some non-English text and see which tokenizer handles them most efficiently.

Definitely. It’s the kind of thing you can’t "un-see" once you’ve looked at it. Well, that is our deep dive for today. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power this show. If you are building something that needs serious compute—whether you're training a custom tokenizer or running a massive transformer—Modal is the way to go.

This has been My Weird Prompts. If you are enjoying the show, maybe leave us a review on Apple Podcasts or Spotify. It actually helps a lot in getting these deep dives into more ears. And if you have a question like Daniel's about the "guts" of these machines, send it our way.

And you can always find us at myweirdprompts dot com for the full archive and all the ways to subscribe. We will see you in the next one.

Later, everyone. Keep an eye on those tokens.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2060: The Tokenizer's Hidden Tax on Non-English Text

Downloads

You Might Also Like

#2060: The Tokenizer's Hidden Tax on Non-English Text