If you have ever spent five minutes looking at an A P I pricing page for a large language model, you have probably seen that familiar rule of thumb. You know the one I am talking about. It is the idea that one thousand tokens is roughly equal to seven hundred fifty words. It is the foundational metric of the generative era. It is how we budget, how we benchmark, and how we understand the scale of the data we are feeding into these digital brains. But here is the thing that has been bothering me lately, and it is something our housemate Daniel brought up in a prompt he sent over this morning. That rule of thumb is a total lie the second you step outside the world of plain text.
It really is. And it is not just a little bit off. It is fundamentally a different mathematical universe. Herman Poppleberry here, by the way. And Corn, you are absolutely right. When we talk about audio, images, or video, we are not just counting words anymore. We are dealing with continuous signals. Daniel was asking specifically about why the pricing and the mechanics of multimodal inputs feel so much like a black box compared to text. And he is hitting on the most important architectural shift in A I that has happened over the last year or two. We have moved from the digital sandwich approach, which we talked about way back in episode nine hundred ninety two, to these native multimodal models like Gemini and G P T four o. But for the average developer or even a power user, the way a minute of audio or a high resolution image actually gets turned into tokens is completely opaque.
It feels like we are flying blind. I mean, if I upload a ten minute audio file of a meeting, am I consuming ten thousand tokens? A hundred thousand? Is it based on the file size, the duration, or the complexity of the sound? And more importantly, what is actually happening under the hood? Daniel wanted to know if we need unique tokenizers for every single file type or if the model just sees everything as a giant stream of numbers. So, today we are going to crack open that black box. We are going to look at the mechanics of multimodal tokenization, the move toward unified latent spaces, and why your context window is probably disappearing a lot faster than you realize when you start throwing media at it.
I love this topic because it forces us to look at the difference between a discrete signal and a continuous one. Text is easy because humans already did the hard work of discretizing it for us. We have letters, we have words, we have spaces. But a sound wave or a frame of video is a continuous flow of information. There are no natural boundaries. So, how does a transformer, which is fundamentally a sequence processing engine, handle something that does not have clear steps?
That is the perfect place to start. Let us bridge that gap. In the old days, if I sent an audio file to an A I, there was a pre processing step. It would go through an automatic speech recognition system, get turned into a text transcript, and then that text would be tokenized. But that is not what is happening anymore, right? We are talking about native ingestion.
We have moved away from that two step process because you lose so much information in the middle. If you just transcribe audio to text, you lose the tone, the prosody, the background noise, the emotion. To solve this, researchers had to figure out how to tokenize the signal itself. To answer Daniel’s first question about whether we need specific tokenizers for every type of input, the answer is a bit of a yes and no. Technically, you do need an encoder that is trained to understand the specific structure of that data. You cannot just feed raw binary data from a J P E G into a text tokenizer and expect it to work. The text tokenizer is looking for patterns in characters. An image encoder is looking for spatial relationships, edges, and textures.
So, it is not about the file extension. It is not like there is a separate tokenizer for a W A V file versus an M P three file. It is about the underlying modality.
Right. Once the data is decoded into its raw form, like a sequence of audio samples or a grid of pixels, it goes through a process called Vector Quantization, or V Q. This is the secret sauce. Imagine you have a color wheel with every possible shade of blue. That is a continuous signal. Vector Quantization is like saying, I am going to pick one hundred specific shades of blue and give each one a number. Every time I see a shade on the wheel, I will just map it to the closest number in my codebook. Now, instead of an infinite range of blues, I have a discrete set of tokens.
That makes a lot of sense. So, the codebook is essentially the vocabulary for that modality. But that brings up an efficiency problem. If I am sampling audio at, say, sixteen kilohertz, that is sixteen thousand data points every single second. Even if I am smart about how I group them, that sounds like a massive amount of tokens compared to the few words a human can speak in a second.
You have hit on the exact reason why audio and video are so much more expensive and computationally heavy. In a typical text conversation, one thousand tokens might cover several paragraphs of deep thought. In audio, depending on the model’s architecture, one thousand tokens might only cover a few seconds of sound. For instance, some of the newer multimodal encoders take a chunk of audio, maybe twenty milliseconds long, and compress it into a single token. But even at that rate, you are looking at fifty tokens per second. A one minute clip suddenly becomes three thousand tokens. Compare that to a transcript of that same minute, which might only be one hundred fifty words or two hundred tokens.
So, we are talking about a factor of fifteen or twenty in terms of sequence length. This really changes how we think about the K V cache, which we discussed in episode one thousand eighty one. If I am filling up my context window with these dense audio tokens, I am going to hit that memory wall much faster than I would with text.
Much faster. And this is where the transparency issue comes in. When you use an A P I like Gemini, they often charge you per second of audio rather than per token. They do that to make it easier for the user, but it hides the technical reality. Behind the scenes, that one second of audio is being expanded into a huge sequence of vectors that the transformer has to attend to. This is why you see limitations on how much video or audio you can upload. It is not just about storage; it is about the quadratic cost of attention. Every extra audio token makes the model exponentially harder to run.
I want to dig into the idea of the unified latent space. Daniel asked if multiple tokenizers function simultaneously when you send audio and text in the same call. If I am using the Gemini A P I and I send a prompt that says, listen to this clip and tell me what the person is feeling, how does the model actually look at both of those things at once? Are they two separate streams that get merged later, or are they interleaved?
In the most modern architectures, they are interleaved. This is the big breakthrough of models like G P T four o and the latest iterations of Gemini. They use what is called a multimodal prefix or an interleaved sequence. Imagine a single long line of tokens. The first ten might be the text of your prompt. The next five hundred might be the tokens representing the audio clip. The model’s attention mechanism treats them all as part of the same sequence. It can look at a text token and an audio token at the same time and calculate the relationship between them.
But for that to work, the audio tokens and the text tokens have to exist in the same mathematical space, right? If the text tokenizer outputs a number representing the word apple, and the audio tokenizer outputs a number representing a specific frequency, those numbers mean nothing to each other unless they have been aligned.
That is the alignment layer. During training, the model is shown millions of examples of audio paired with text descriptions. It learns that the vector for the spoken word apple and the vector for the written word apple should point in roughly the same direction in this high dimensional space. This is why we call it a latent space. It is a hidden, underlying map of concepts that is independent of how the concept was delivered. Whether I see a picture of a dog, hear a bark, or read the word D O G, the model’s internal representation should land in the same neighborhood of that map.
That is fascinating. It suggests that the tokenizer itself is almost just a translator at the border. Once you are inside the country of the model, everyone speaks the same language of vectors. But that leads to Daniel’s third question, which I think is a huge point of confusion for a lot of people. When we upload these files, are they entering an ephemeral R A G pipeline, or are they being ingested like text? For anyone who missed our earlier deep dives, R A G stands for Retrieval Augmented Generation. It is usually where you store a bunch of documents in a database and the A I goes and looks them up. People seem to think that because audio and video are big files, the A I must be doing some kind of lookup.
And that is a very common misconception. For most of these high end multimodal A P Is, it is not R A G. It is native ingestion. When you upload that video to Gemini, it is actually putting those frames directly into the context window. It is essentially reading the entire file as part of the prompt. This is why the context windows have grown so large. When Gemini announced a one million or even two million token window, a lot of text users were like, why would I ever need that many tokens? But the answer is video. A few minutes of high quality video at several frames per second, where each frame is subdivided into patches that are tokenized individually, can easily eat up hundreds of thousands of tokens.
So, it is not searching a database of your video; it is literally watching the video as it processes your question. That is a massive difference in terms of how the model can reason about the data. If it were R A G, it might only find the specific parts of the video that seem relevant to your keywords. But with native ingestion, it can understand the temporal flow, the subtle changes over time, and the connection between a sound in the beginning and an action at the end.
But it also means you are paying for every single one of those tokens in every turn of the conversation. If you are in a chat session and you upload a video, and then you ask five follow up questions, you are re processing that video sequence every single time unless the system is using some very clever caching. This is where the tokenization tax really starts to hurt. We talked about this in episode six hundred sixty six regarding language barriers, but there is a similar tax for multimodal data. If the tokenizer is inefficient, you are essentially wasting money on redundant data.
Let us talk about that efficiency. How do we measure it? With text, we can look at the compression ratio. If a tokenizer can represent a complex idea in fewer tokens, it is more efficient. Is there a similar benchmark for audio or video?
It is much harder to define because it depends on the sampling rate and the resolution. But here is a concrete example. In January of twenty twenty six, there was a significant update to the Gemini A P I regarding how it handles multimodal inputs. They introduced a more aggressive form of temporal compression for video. Before that, the model might have been looking at every single frame, or maybe two frames per second, regardless of what was happening in the video. The update allowed the encoder to identify redundant frames. If nothing is moving in the shot for three seconds, it can represent that entire block with fewer tokens. It is almost like a smart video codec, but instead of optimizing for human eyes, it is optimizing for the transformer’s attention.
That is a huge leap forward. It reminds me of how we talked about audio engineering as prompt engineering in episode five hundred ninety eight. If I am a developer and I want to save money, I should probably be thinking about the signal density I am sending. If I send a high definition, sixty frames per second video of a person sitting still and talking, I am wasting a massive amount of compute. I could probably downsample that to five frames per second and seventy two p, and the model would still understand the content perfectly.
You absolutely should. In fact, input normalization is going to be one of the most important skills for A I engineers in the next year. Most people just throw the raw file at the A P I and hope for the best. But if you understand that the tokenizer is going to chop that image into sixteen by sixteen pixel patches, you can resize your images to be multiples of sixteen to avoid weird artifacts or wasted padding tokens. For audio, if the model’s internal encoder is optimized for sixteen kilohertz, sending it a forty eight kilohertz professional studio recording is not giving you better results. It is just creating more work for the pre processor and potentially leading to more tokens than you need.
This brings us back to the opacity of pricing. If the A P I is charging me per second, but the underlying cost is based on tokens, and the number of tokens can change based on how the model compresses the signal, then the price per second is really just a simplified average. It feels like the industry is trying to hide the complexity from us, but in doing so, they are preventing us from being efficient.
It is a classic trade off. They want it to be user friendly. They want you to think, okay, it costs one cent per minute of audio. That is easy to put in a spreadsheet. But if you are building an enterprise application that processes millions of minutes, you need to know if you can get that down to half a cent by optimizing your bitrates. Right now, most providers do not give you that level of granularity. We are in this weird middle ground where the technology is multimodal, but the business model is still stuck in the text era.
I want to go back to Daniel’s question about whether we need unique tokenizers for every file type. You said yes and no, but let us look at the future. Do you think we will ever get to a point where there is a truly universal tokenizer? Something that does not care if it is looking at a pixel, a sound wave, or a sensor reading from an industrial machine?
There is a lot of research into what we call token free or continuous latent models. The idea is to skip the discrete tokenization step entirely and just map the raw signal directly into a continuous vector space. If we can do that, we eliminate the codebook. We eliminate the need to decide ahead of time which three hundred shades of blue are the most important. The model would just see the raw gradients. The problem is that transformers are currently designed to work with discrete sequences. To move to a truly universal, continuous input, we might need a fundamental change in the architecture, something like the state space models or Mamba architectures that people are getting excited about.
That would be a massive shift. It would mean the end of the token as the primary unit of A I. But until then, we are stuck with this hidden economy of multimodal tokens. Herman, you mentioned the January update for Gemini. Have you seen similar transparency in other models? For example, does G P T four o give any indication of how it is patchifying images?
They have started to. In their documentation, they explain that an image is often broken down into fifty token chunks or something similar, depending on the detail level you select. But even then, it is an approximation. There is this hidden layer of logic that decides how many tokens to allocate based on the complexity of the image. If you have a very detailed architectural blueprint, the model might decide it needs more tokens to capture all the lines and text than it would for a picture of a clear blue sky. This is what I call the cognitive load of the input. It is not just about the size of the file; it is about the density of the information.
That is such a crucial point. Cognitive load. It makes me think about how we as humans process information. If I show you a blank white wall, you do not need to spend much mental energy to describe it. But if I show you a busy street in Jerusalem, your brain is working overtime to categorize all the people, the cars, the stone textures. A I is finally starting to work the same way. The problem is that our current billing systems are based on the white wall and the busy street costing the exact same amount because they are both one image.
And that is why I think we are going to see a push for a multimodal token standard. We need a way for developers to say, I am sending you this many bits of information, and I expect it to cost this much. Right now, you might send the same image twice and get slightly different results because of how the cloud provider is balancing their compute load or which version of the encoder is active that day. It is very difficult to build a predictable business on top of that.
So, what can people actually do today? If you are a developer or a curious user like Daniel, and you want to be smart about this, where do you start?
Step one is to stop thinking in terms of file size and start thinking in terms of signal density. For images, look at the patch size of the model you are using. If it is sixteen by sixteen, or thirty two by thirty two, make sure your images are scaled appropriately. For audio, find out the native sampling rate of the encoder. Usually, sixteen kilohertz or twenty four kilohertz is the sweet spot. Anything above that is likely being discarded or downsampled anyway, so you might as well do it yourself and save the bandwidth.
And for video?
Video is all about the frame rate. Most A I models do not need thirty frames per second to understand what is happening. For most tasks, like summarizing a meeting or identifying an object, one or two frames per second is plenty. If you can reduce your video from thirty frames to two, you have just cut your token consumption by ninety percent without losing much semantic meaning. This is the kind of input normalization that separates the hobbyists from the pros right now.
That is a great practical takeaway. I also think we need to be very careful with long conversations involving media. If you are using a chat interface and you have uploaded a video, remember that every time you ask a new question, you might be paying for that entire video again. It is often better to start a new thread if your next question does not actually require the context of that heavy media file.
That is a pro tip right there. The context window is a hungry beast. Every time you hit enter, you are feeding it the entire history of the chat. If that history includes a ten minute audio file, you are paying a massive tokenization tax on every single message.
It is funny, we started this show years ago talking about simple text prompts, and now we are talking about managing multi gigabyte streams of sensor data and video frames. The weird prompts are getting a lot more complex, but the core challenge remains the same. We are trying to figure out the most efficient way to communicate with these systems.
It really is the same journey. We are just moving from vocabulary to signal. And I think Daniel’s question highlights how much we still have to learn about the bridge between the physical world of waves and light and the digital world of vectors and tokens. We are building a map of the world, one patch and one millisecond at a time.
Well, I think we have covered a lot of ground here. We have looked at why the token to word rule fails for multimodal, how Vector Quantization creates a codebook for sound and light, and why native ingestion is replacing the old R A G pipelines for media. It is a fascinating time to be looking at this, especially with the rapid updates we are seeing in early twenty twenty six.
It really is. And I hope this gives Daniel and our listeners a bit more clarity when they are looking at those A P I dashboards. Don’t let the simple pricing fool you. There is a lot of complex engineering happening in those black boxes.
Definitely. Let's really lean into that complexity for a second, Herman. If we are talking about a forty five hundred word deep dive, we need to address the actual sequence length math for the listeners who are building these systems. If I have a context window of one million tokens, like we see in Gemini one point five Pro, and I am feeding it a high resolution video, how many minutes are we actually talking about before the model starts to lose its mind?
That is the million dollar question. Or, given A P I costs, maybe the ten thousand dollar question. Let's do the math. If you are using a model that samples video at one frame per second, and each frame is subdivided into a grid of sixteen by sixteen patches, that is two hundred fifty six tokens per frame. At one frame per second, a ten minute video is six hundred seconds, which equals one hundred fifty three thousand six hundred tokens. That sounds manageable for a million token window, right? But what if you need more detail? What if you are sampling at ten frames per second to catch fast motion? Suddenly, that same ten minute video is one point five million tokens. You have just blown past the context window of almost every model on the market.
And that is just the video. That doesn't include the audio track or the text prompt you are sending along with it. This is why we see these models sometimes hallucinating or "forgetting" the beginning of a video. It is not necessarily that the model is bad; it is that the token density of the video has pushed the most relevant information out of the active attention mechanism.
And this brings us back to the "Tokenization Tax" from episode six hundred sixty six. In that episode, we talked about how certain languages like Telugu or Burmese are tokenized very inefficiently compared to English, meaning speakers of those languages pay more for the same A I performance. We are seeing a "Multimodal Tax" now. If you don't know how to normalize your inputs, you are paying a massive premium for data that the model doesn't even need to do its job.
I want to go deeper into the "Digital Sandwich" versus "Native Multimodal" distinction. In episode nine hundred ninety two, we talked about how early voice assistants were basically three separate models taped together. You had the A S R for speech to text, the L L M for reasoning, and the T T S for text to speech. In that world, tokenization was simple because everything was converted to text before the "brain" ever saw it. But in a native model like G P T four o, the "brain" is seeing the audio tokens directly. Herman, does that mean the model actually "hears" the frequency, or is it still just seeing a number that represents a frequency?
It is seeing a number, but that number is part of a vector that captures the relationship between that frequency and all the others around it. It is like the difference between reading the sheet music for a song and actually hearing the vibration of the strings. The native model can "feel" the vibration in the data. This is why native multimodal models are so much better at detecting sarcasm or emotion in a voice. They aren't just reading the words "I am fine"; they are seeing the token for a high pitched, strained frequency that contradicts the words.
That is a powerful shift. But it also means the tokenizer has to be much more sophisticated. If the codebook for the audio tokenizer is too small, the model becomes "tone deaf." It might not have a token for a specific inflection, so it just maps it to the closest thing it knows, and suddenly the sarcasm is gone.
Precisely. This is why the development of these codebooks is such a closely guarded secret. Google, Open A I, and Anthropic are all competing to create the most efficient and expressive codebooks. They want to represent the maximum amount of human experience with the minimum number of tokens. It is the ultimate compression challenge.
So, when Daniel asks if we need unique tokenizers for every file type, the answer is that we need unique "encoders" that can map those files into a shared "latent space." Once they are in that space, they are all just tokens. But the journey from a .mp4 to a token is where the magic—and the cost—happens.
And that journey is getting more efficient. The January twenty twenty six update I mentioned earlier for Gemini is a great example. They started using what is called "Dynamic Patching." Instead of dividing every image into a fixed grid, the model looks at the image first and decides where the information is. If you have a picture of a person standing in front of a plain white wall, the model might only use a few tokens for the wall and hundreds of tokens for the person's face. It is a more "human" way of looking at things. We don't give equal attention to every square inch of our field of vision.
That feels like the beginning of the end for the "token" as a fixed unit of cost. If the number of tokens for an image can change based on what is in the image, then the A P I providers are going to have a hard time explaining their bills to users. "Why did this photo of my cat cost twice as much as the photo of my dog?" "Well, your cat has more complex fur patterns, sir."
It sounds ridiculous, but that is exactly where we are headed. We are moving from a "per word" economy to a "per unit of complexity" economy. And for developers, that means we need new tools. We need "token simulators" that can tell us how much a file will cost before we hit the "send" button.
I think that is a great place to wrap up the technical deep dive. We have covered the shift from the digital sandwich to native ingestion, the mechanics of Vector Quantization, the reality of the context window wall, and the future of dynamic, complexity based tokenization.
It has been a journey. And I think it really highlights that we are still in the "Wild West" of multimodal A I. The rules are being written in real time.
Definitely. And hey, if you are finding these deep dives helpful, we would really appreciate it if you could leave us a review on your favorite podcast app. Whether you are on Spotify or Apple Podcasts, those reviews really help other people find the show and join the conversation.
They really do. We love seeing the feedback and the new questions that come in. It keeps us on our toes.
You can find all our past episodes, including the ones we mentioned today like episode nine hundred ninety two on voice A I and episode one thousand eighty one on the K V cache, over at myweirdprompts.com. There is a search bar there so you can dig through our entire archive of over a thousand episodes.
Thanks for joining us again. It is always a pleasure to dive into the weeds with you, Corn.
Same here, Herman. Until next time, this has been My Weird Prompts.
Take care, everyone.
I was just thinking, Herman, about that codebook analogy you used for the colors. If we have a codebook for everything, does that mean the A I’s world is essentially just a very large, but finite, collection of symbols? Like, is there any room for true novelty that hasn't been tokenized?
That is a deep philosophical question to end on. In a discrete system, yes, everything is a combination of existing symbols. But the number of possible combinations in a high dimensional space is so vast that it might as well be infinite. It is like the alphabet. We only have twenty six letters, but we have not run out of new things to say yet.
That is a fair point. Though I still think a sloth’s perspective on time might need its own specialized tokenizer. It is a much slower frequency than what most models are trained for.
We will have to look into that for episode two thousand. A sloth specific encoder.
I will start working on the training data. It might take a while.
I would expect nothing less.
Alright, we should probably head out. Daniel is probably wondering where his audio prompt went.
He is probably already working on the next one.
Most likely. Thanks for listening, everyone. We will catch you in the next one.
Bye for now.
One last thing, for those of you interested in the technical specifics of the Gemini update I mentioned, they have a developer blog post from mid January that goes into the temporal compression algorithms. It is a bit of a dense read, but if you are building video applications, it is essential.
Good call. It explains the difference between their fixed rate sampling and the new dynamic sampling. It is a game changer for cost optimization.
Alright, now we are really going. Thanks again.
See you.
I wonder if we should have mentioned the geopolitical side of this. I mean, the fact that the leading multimodal models are almost entirely coming out of American companies like Google and Open A I is a huge deal for digital sovereignty.
It is. And the infrastructure required to run these multimodal encoders is so massive that it creates a natural moat. Not many countries or companies can afford the compute to tokenize the world in real time.
It is a new kind of power. The power to define the codebook for reality.
That is a heavy thought. Maybe a topic for next week.
Maybe so. Alright, for real this time, thanks for listening.
Goodbye.
And don't forget to check out the website at myweirdprompts.com for the full transcript and links to the research papers we discussed.
We will have everything linked there.
Great. Talk soon.
Talk soon.