We are diving straight into the deep end today because Google just did something that is going to fundamentally change how every single one of us builds retrieval systems. Today's prompt from Daniel is about Gemini Embedding 2, which dropped on March tenth, twenty twenty-six, and it is a massive shift. We are talking about the first natively multimodal embedding model from Google that maps text, images, video, audio, and documents into one single joint vector space.
It really is a watershed moment, Corn. And I should mention, since we’re talking about cutting-edge models, today’s episode is actually powered by Google Gemini 3 Flash. But back to Embedding 2—the headline number everyone is circling is that seventy percent latency reduction. When you’re building production RAG—Retrieval-Augmented Generation—or complex search engines, a seventy percent drop in latency isn't just a "nice to have." It’s the difference between a feature being feasible or being a total non-starter.
I want to poke at that "natively multimodal" part, Herman Poppleberry. Because, let’s be honest, we’ve had multimodal search for a while. If I go to a retail site and upload a picture of a shoe to find similar shoes, that’s multimodal. So, what is the actual architectural leap here? Is it just Google putting their brand on something that already existed, or is the "native" part doing some heavy lifting?
The "native" part is doing all the heavy lifting. To understand why this is a big deal, you have to look at the "bolt-on" era we’ve been living in. Historically, if you wanted a multimodal system, you were essentially playing matchmaker between completely different neural networks. You’d have a text encoder, maybe something BERT-based, and an image encoder like CLIP from OpenAI or a Vision Transformer. These models were trained separately, on different data, and they lived in different mathematical universes. To make them talk to each other, you had to perform this complex "alignment" process, usually using contrastive learning, to force the image of a dog and the word "dog" to land near each other in a shared space.
So it was like trying to get a French speaker and a Japanese speaker to agree on a color by having them both point at a Crayon box. They aren't actually speaking the same language; they’re just being coerced into a shared middle ground.
That’s a decent way to visualize it, but the technical reality was even messier. Because those encoders were separate, you had massive overhead. You had to run two different inference passes through two different architectures. That’s where the latency comes from. And more importantly, you had "modality bias." The text encoder only understands the world through tokens, and the image encoder only understands pixels. They never truly "share" a brain. Gemini Embedding 2 uses a shared transformer architecture. The same weights, the same attention mechanisms, the same underlying "understanding" is applied regardless of whether the input is a snippet of audio, a frame of video, or a paragraph of text.
Okay, let’s get into the weeds of that shared transformer. If I’m a developer, I’m used to tokenizing text. I know how that works—you break words into sub-words. But how does a single transformer "tokenize" a PDF with images, or an audio file of a deposition, into the same space?
It’s all about the unified tokenization strategy. In these newer native architectures, everything—whether it’s a patch of an image or a slice of a spectrogram from an audio file—is projected into a common embedding space before it ever hits the transformer blocks. For images, we’re often talking about "patches." You divide the image into a grid, flatten those patches, and treat them as tokens. For audio, you might take a short window of the waveform or a mel-spectrogram. The key is that once they are in that initial latent space, the transformer treats them as sequences of vectors. It doesn't care if vector number five hundred came from a pixel or a phoneme.
But surely the scale of information is different? A second of video has way more raw data than a sentence. How does the model balance that so the "text" part of the embedding isn't just drowned out by the massive amount of data in a "video" embedding?
That is exactly where the shared attention mechanism earns its keep. In the old bolt-on models, you had no cross-modal attention during the actual encoding. With a native model like Gemini Embedding 2, the model can learn that certain visual tokens are highly correlated with specific text tokens during the training phase. It learns a more "semantic" representation that transcends the medium. When Google says it supports dimensions from one hundred twenty-eight up to three thousand seventy-two, they’re giving you the ability to tune how much "resolution" you need in that shared space. They recommend seven hundred sixty-eight as the sweet spot, which is interesting because it suggests they’ve found a way to compress that multimodal richness into a relatively standard vector size without losing the nuance.
You mentioned "vector debt" earlier. I love that term because it sounds like something I’d have to explain to a bankruptcy judge. What does vector debt look like in a real-world system using the old separate-encoder approach?
Vector debt is the hidden cost of maintaining multiple, misaligned embedding spaces. Imagine you’re running a massive digital asset manager for a movie studio. You have a vector database for your scripts—that’s text. You have another one for your raw footage—that’s video. And another for your soundtracks—audio. If a director says, "Find me a scene where the music is tense and the actor looks angry," you are in a world of hurt. You have to query three different indexes, try to "rank-fuse" the results, and hope that your "tense music" vector in the audio space actually correlates with the "angry" vector in the video space. The "debt" is the engineering hours spent building these translation layers and the inevitable loss of accuracy because the systems weren't built together.
And I’m guessing the "seventy percent latency reduction" Google is touting comes from the fact that you aren't running those three separate gauntlets anymore?
Precisely. You aren't managing three different model deployments. You aren't moving data between different inference servers. You feed the multimodal prompt—maybe a text description plus a reference image—into one model, and you get one vector back. It simplifies the infrastructure immensely. But it’s not just about speed. It’s about "cross-modal retrieval." In a native space, the distance between a video of a "golden retriever jumping into a lake" and the text "dog splashing in water" is calculated directly. There’s no "translation" loss.
Let’s talk about a practical case study then. You mentioned retail earlier. Let’s say I’m building a high-end furniture search. A customer takes a photo of a mid-century modern chair in a cafe and types "but I want this in blue velvet." In the old world, how did we handle that?
In the old world, you’d likely use a multi-stage pipeline. Stage one: extract features from the image using a vision model. Stage two: extract features from the text using a language model. Stage three: use a "fusion" layer—usually a small neural network you trained yourself—to combine those two vectors into a "query vector." Then you’d hit your database. The problem is that your fusion layer is a bottleneck. It’s only as good as the data you used to train it. If you didn't have many examples of "blue velvet" paired with "mid-century chairs," the model might prioritize the "blue" but ignore the "mid-century" geometry.
And with Gemini Embedding 2?
With a native model, you just pass both inputs. The model’s internal attention has already seen millions of examples of how text modifiers like "blue velvet" interact with visual structures like "mid-century modern." It produces a single vector that inherently represents the intersection of those two concepts. You don't have to build the fusion layer. The model is the fusion layer.
That sounds like a dream for developers, but what’s the catch? There’s always a catch. If I’m dumping video, audio, and text into one model, doesn't that model become a massive, expensive beast to run? Google might have the TPUs to do it, but what about the rest of us?
Well, that’s why the API-first approach is so significant here. You aren't running the weights on your own hardware most of the time; you’re hitting the Gemini API. But even if you were, the efficiency gains in the "one model to rule them all" approach often outweigh the size of the model. Think about it: instead of loading three separate twenty-billion parameter models into VRAM, you load one fifty-billion parameter model that handles everything. The memory footprint is actually better.
I want to go back to the PDF thing. Google specifically called out "documents" or PDFs as a modality. Now, to a layman, a PDF is just text. But to anyone who has ever tried to scrape one, a PDF is a nightmare of tables, charts, and images with text wrapped around them. How is a unified embedding space changing the "RAG on documents" game?
This is actually one of the most underrated parts of the announcement. Traditional RAG is "text-only" by default. You use a library to rip the text out of a PDF, chunk it, and embed it. If there was a crucial chart on page five showing a revenue spike, that information was essentially lost to the vector database unless you had a separate, very expensive process to describe that chart in text. Gemini Embedding 2 treats the PDF as a visual and structural entity. It "sees" the layout. It understands that a caption next to an image is related to that image. When you embed a page of a PDF, the resulting vector captures the semantic meaning of the text, the visual data in the charts, and the spatial relationship between them.
So if I ask my AI assistant, "Which quarter had the highest growth according to these documents?" and the answer is only found in a bar chart, the native multimodal embedding will actually find that page?
That’s the promise. In a text-only RAG system, that query would fail because the word "quarter" and the numbers might not appear in the text in a way that matches the query. But the multimodal embedding knows what a "bar chart" looks like and how it maps to the concept of "growth." It bridges the gap between the "visual linguistic" world and the "raw data" world.
We should probably talk about what this does to the vector database industry. Companies like Pinecone, Weaviate, and Milvus have spent years optimizing for text vectors. Now, suddenly, the "payload" of a search is much more complex. Does a unified space make their lives easier or harder?
In the short term, much easier. One of the biggest headaches for vector DBs was "multi-vector indexing." If you wanted to search across images and text, you often had to maintain two separate indexes and do the "join" in the application layer. With a unified model, you have one index. A vector is a vector. Whether it was generated from a video or a tweet, it’s just a list of numbers in a seven hundred sixty-eight-dimensional space. The database doesn't care about the source modality. This simplifies the architecture of RAG systems significantly. You have one source of truth, one index to scale, and one set of benchmarks to worry about.
But wait, if I have a unified index, what happens to my "metadata filtering"? If I only want to search videos, I’m still tagging that in the metadata, right? The embedding isn't magically telling the database "this vector came from a video file."
You still use metadata for categorical filtering, but the "semantic" overlap is more fluid now. For example, if you search for "explosions" in a unified index, you might get back a text description of a firework show, an audio file of a car backfiring, and a video of a Michael Bay movie. Because they are in the same space, the "concept" of an explosion is the primary driver, not the file type. This allows for much more creative discovery. Think about a newsroom. A journalist could search for a specific event and get the relevant articles, the raw interview audio, and the B-roll footage all in one ranked list based purely on how well they match the "event" semantically.
I can see some potential pitfalls here though. If everything is in one space, do we run into a "jack of all trades, master of none" problem? Is a unified model as good at "text-to-text" retrieval as a dedicated, highly-specialized text embedding model like the old Ada or the newer specialized Gemini text models?
That is the million-dollar question. Usually, there is a "multimodality tax." When you force a model to learn images and audio alongside text, you sometimes see a slight regression in pure text performance because the "representational capacity" of the model is being spread thin. However, Google’s claim is that because the architecture is "native"—meaning it was built from the ground up to be multimodal—it doesn't suffer from that as much. They are using the massive scale of the Gemini training set to ensure that the text "clusters" are still as robust as ever. But in practice, if you are building a system that is ninety-nine percent text-based, you might still stick with a specialized text-only model for that last one percent of accuracy. But for the other eighty percent of use cases? The convenience of the multimodal space is going to win.
Let’s look at the competition for a second. Google isn't the first to the party. We’ve had CLIP from OpenAI since twenty twenty-one. Meta has ImageBind, which I remember being a huge deal because it handled six modalities—text, image, audio, depth, thermal, and IMU data. How does Gemini Embedding 2 stack up against something like ImageBind?
ImageBind was a brilliant research project, but it never really became the "industry standard" for production RAG. It was more of a "look what’s possible" moment. Google’s entry is different because it’s integrated directly into the Vertex AI and Gemini ecosystem. It’s "production-ready" in a way that research models often aren't. And specifically, the inclusion of "video" as a first-class citizen is huge. ImageBind could handle video as a sequence of images, but Gemini Embedding 2 is leveraging the temporal understanding that Google has been perfecting with the main Gemini models. It understands "action" and "change over time" in a way that a simple image-text model can’t.
"Temporal understanding" is a fancy way of saying it knows the difference between a glass falling and a glass being picked up, even if the frames look similar.
And that is critical for security applications, sports analytics, or even just navigating a personal video library. If you search for "the moment the birthday cake candles were blown out," a model that only sees "stills" might just give you any picture of a cake. A model with temporal native embeddings understands the "action" of blowing out the candles.
So, looking forward, where does this leave the "manual" engineers? The people who spent the last three years perfecting their "alignment" scripts and their custom fusion layers. Are they just... obsolete now?
Not obsolete, but their job description is shifting. We’re moving from "feature engineering" and "model alignment" to "system orchestration." The hard work of "making the models talk" is being subsumed into the foundation models themselves. The new challenge is: how do you build a user experience that actually takes advantage of this? If I can search across every modality, how do I present those results to a user without overwhelming them? How do I handle "multimodal prompt engineering"?
"Don't just tell the AI what you want, show it a picture and play it a sound." That’s a very different interface than a search box.
It really is. We’re going to see a "multimodal-first" design philosophy. Imagine a creative director at an ad agency. They don't just type "I want a moody vibe." They drag in a color palette, a fifteen-second clip of a classic film, and a Spotify link. The unified embedding space takes all of those disparate inputs and finds the exact assets in their library that match that "vibe" across all media types. That was technically impossible to do efficiently two years ago. Now, it’s an API call away.
I want to talk about the "documents" part again because I think that’s where the most immediate business value is. We’ve talked about RAG a lot on this show, and the biggest failure point is always the "unstructured" nature of business data. If I’m an insurance company and I have a million claim forms that are a mix of handwritten notes, photos of car accidents, and typed text, a unified embedding space feels like a literal "get out of jail free" card for my data science team.
It truly is. Think about the traditional pipeline for that insurance company. They’d have an OCR—Optical Character Recognition—system to read the text. They’d have a computer vision model to detect "fender bender" versus "total loss" in the photos. They’d have a sentiment analysis model for the notes. Then they’d try to join all those outputs in a SQL database to make sense of the claim. With Gemini Embedding 2, you embed the entire claim folder into one vector. That vector "contains" the visual evidence of the crash and the linguistic evidence of the claim. When an adjuster searches for "fraudulent-looking rear-end collisions," the model is looking for the "semantic signature" of fraud that spans both the text and the images simultaneously.
It’s almost like the model is developing "intuition."
I wouldn't go that far, but it’s certainly developing a more "holistic" representation. It’s no longer looking through a keyhole at one modality at a time. It sees the whole room. And the seventy percent latency reduction means you can do this at the scale of a million claims without your cloud bill looking like a phone number.
Let’s pivot to the "second-order effects" you mentioned in the plan. If we have these unified spaces, does it change how we train the next generation of models? If we already have a perfect mapping of the world in vectors, do we even need to train the "reasoning" part of the AI on raw data anymore? Can we just train it on the embeddings?
That is a very high-level "researchy" question, but the answer is: likely yes. We’re already seeing "embedding-to-embedding" models. If the embedding space is "rich" enough—meaning it truly captures the essence of the world—then the reasoning layer just needs to learn how to manipulate those vectors. It doesn't need to relearn what a "dog" looks like every time. This could lead to much smaller, much faster "reasoning" models because they are standing on the shoulders of these massive multimodal embedding giants.
It’s like learning to play chess once you already know what the pieces are and how they move. You don't have to keep staring at the wood grain of the knight; you just focus on the strategy.
That’s a great analogy, actually. The embeddings provide the "what," and the next generation of models can focus entirely on the "why" and the "how." But there are new challenges. One of them is "interpretability." If a unified model returns a result, and I ask "Why did you think this video matched my text query?", it’s much harder to explain. In the old "bolt-on" world, I could say, "The text model found these three keywords, and the image model found these two shapes." In a native model, the "understanding" is smeared across the entire transformer. It’s more "accurate," but it’s more of a black box.
That’s going to be a nightmare for compliance in fields like medicine or law. "The AI said this X-ray matches the symptoms in the patient’s chart, but we don't know why."
Right. And that’s where we’ll see a new field of "multimodal explainability" emerge. How do we probe a unified vector to see which parts of the input—the text, the image, or the audio—were the primary drivers of the embedding? We’re going to need new tools to "de-bias" these models too. If a model hears a certain accent in an audio file, does it shift the vector into a "less trustworthy" part of the space? These are the kinds of subtle, multimodal biases that were easier to spot when the models were separate. Now, they could be buried deep in the shared weights.
Let’s talk about the "latency" piece one more time because I think people underestimate it. When you say "seventy percent reduction," most people think "Oh, the search results come back faster." But in the world of automation and agents, latency is the difference between an agent that works and an agent that "stutters."
If you’re building a real-time multimodal agent—something that’s looking through a camera and listening to you at the same time—you need to turn those inputs into "actionable" vectors in milliseconds. If the embedding process takes five hundred milliseconds, the "human-AI" loop feels broken. It’s uncanny and frustrating. If you can drop that to one hundred fifty milliseconds? Now you have an agent that can respond to your gestures and your words in what feels like real-time. This is the "infrastructure" that makes things like the "Her" style AI assistants actually possible.
So, what should our listeners actually do with this? Most of the people listening are either building these systems or managing the teams that do. If they’re currently using a text-only RAG or a "hand-aligned" multimodal system, is it time to tear it all down?
I wouldn't say tear it down today, but I would say: start a "shadow" index. Take a subset of your data—maybe ten percent—and run it through the Gemini Embedding 2 API. Build a parallel retrieval pipeline and run some A/B tests. Look at your "Mean Reciprocal Rank" or your "Normalized Discounted Cumulative Gain." My bet is that for any data that isn't strictly "clean text," you’re going to see a significant jump in retrieval quality. And more importantly, look at your "time to first token" in your RAG pipeline. If you can shave half a second off the retrieval step, your users are going to feel that.
And what about the "vector database" side? If I’m using something like Pinecone, do I need to change my schema?
You might actually be able to simplify your schema. If you were storing separate vectors for "image_embedding" and "text_description_embedding," you can move to a single "unified_embedding" field. This reduces your storage costs and simplifies your query logic. But you also need to rethink your "chunking" strategy. Traditional text chunking is based on character counts or sentence boundaries. How do you "chunk" a multimodal document? Do you chunk by page? By "semantic scene"? This is a new frontier for "unstructured" data engineering.
I suspect we’re going to see a lot of "best practices" papers coming out in the next six months about "Multimodal Chunking."
I’m already seeing some! People are experimenting with "visual-aware chunking" where you use the layout of the PDF to decide where a "thought" begins and ends, rather than just looking at the text. It’s a much more natural way to process information.
It feels like we’re finally moving past the "AI is a calculator for words" phase and into the "AI is a system that actually perceives" phase. Which is both exciting and, if I’m honest, a little bit "sloth-brain" overwhelming.
It’s a lot to take in. But that’s the nature of 2026, Corn. The "vector gold rush" we talked about a couple of years ago has matured. It’s no longer about just "having" embeddings; it’s about having the right embeddings that capture the full context of human information. Google’s move here is a signal that the "siloed" approach to AI—where vision, audio, and text are separate departments—is officially dead. The future is unified.
Well, I’m glad I have you to read the "Unified Future" manual for me, Herman. Because I’m still trying to figure out how to get my smart toaster to stop burning the sourdough.
Maybe you just need to embed the "concept" of "not burnt" into its latent space, Corn.
I’ll get right on that. Before we wrap up, I think we’ve covered the "what" and the "how" pretty thoroughly. The "why" is clear: efficiency, accuracy, and a massive reduction in the "engineering tax" of multimodal systems. Any final thoughts on the "where is this heading" part?
I think the next step is "streaming multimodal embeddings." Right now, we’re still thinking in terms of "files"—upload a video, get a vector. The next leap is a continuous "embedding stream" from a live sensor. Imagine an AR headset that is constantly generating a unified vector of everything you see and hear, and using that to query a "personal memory" database in real-time. That requires even more latency reduction, but Gemini Embedding 2 is the foundational stone for that kind of world.
That sounds like a privacy nightmare, but also... incredibly cool. "Hey AI, where did I put my keys? I know you were 'embedding' the living room five minutes ago."
"Your keys are at coordinates X, Y, Z in your unified semantic space." We’re mapping the physical world into the digital one, one vector at a time.
On that note, I think we’ve squeezed all the juice out of this particular orange. This has been "My Weird Prompts." If you’re building with Gemini Embedding 2, or if you’ve found some weird edge cases where it fails, we want to hear from you. Show at myweirdprompts dot com.
And thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. Big thanks to Modal as well for providing the GPU credits that power our generation pipeline.
If you found this technical deep dive useful, do us a favor and leave a review on Apple Podcasts or Spotify. It actually makes a huge difference in helping other nerds find the show.
We’re also on Telegram if you want to get notified the second a new episode drops—just search for "My Weird Prompts."
Catch you in the next one.
See ya.