#1799: The Original AI Blueprints: BERT & CLIP

Before GPT, two models changed everything. Discover how BERT and CLIP taught machines to read and see the world.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-1953
Published: Mar 31
Duration: 26:15
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: transformers ai-history computer-vision

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In the fast-paced world of artificial intelligence, it is easy to get swept up in the hype surrounding the latest large language models and multimodal generators. However, the true titans of the industry—the architectures that laid the groundwork for today's AI boom—are often overlooked in favor of newer, shinier objects. This discussion revisits two of the most pivotal models in AI history: BERT and CLIP. While they may seem like "ancient history" in AI years, their engineering principles remain the blueprints for modern machine intelligence.

The BERT Revolution: Reading Contextually

Before BERT's release by Google in October 2018, natural language processing was dominated by Recurrent Neural Networks (RNNs) and LSTMs. These models processed text linearly, reading word by word like a ticker tape. This approach struggled with long-range dependencies; if a word at the end of a sentence changed the meaning of a word at the beginning, the model often lost the context.

BERT, standing for Bidirectional Encoder Representations from Transformers, changed everything. Unlike its predecessors, BERT processes the entire sentence simultaneously. It looks at the whole context at once, allowing every word to relate to every other word in the sequence.

The magic behind BERT lies in its pre-training task: Masked Language Modeling (MLM). Researchers took massive text corpora and randomly hid about 15% of the words, effectively putting digital duct tape over them. The model's job was to predict these masked words based on the surrounding context. This forced BERT to develop a deep, bidirectional understanding of language. It wasn't just predicting the next word; it was reconstructing meaning from incomplete information.

This architecture also introduced the "Self-Attention" mechanism. Imagine a cocktail party where every word asks every other word, "How relevant are you to me?" The word "bank" might ask "river" and "deposit" for context, creating distinct mathematical representations—embeddings—for the same word based on its neighbors. This ability to handle polysemy made BERT incredibly powerful for tasks like search, sentiment analysis, and document classification.

CLIP: Bridging Vision and Language

While BERT mastered text, CLIP, released by OpenAI in 2021, bridged the gap between text and images. Before CLIP, computer vision models relied on supervised learning, requiring thousands of labeled images for specific categories. If a model hadn't seen a "Golden Retriever playing in the snow," it might fail to identify it.

CLIP took a different approach by leveraging the internet's vast collection of image-text pairs. Instead of predicting captions word-for-word, CLIP uses contrastive learning. It functions like a matching game: during training, it tries to make the mathematical representation of an image and its correct caption as similar as possible, while pushing them apart from mismatched pairs.

This process aligns two distinct universes—visual and linguistic—into a shared "latent space." The result is zero-shot learning. CLIP can identify concepts it hasn't explicitly seen by comparing the "vibe" of an image's pixels to the "vibe" of text labels. This capability became the compass for generative models like DALL-E and Stable Diffusion, providing the feedback loop necessary to generate images that match textual prompts.

The Embedding Economy and Modern Applications

The legacy of BERT and CLIP is most visible in the "embedding economy," where sentences, images, and concepts are converted into high-dimensional vectors. This allows mathematical operations on meaning, such as subtracting "maleness" from "King" to get "Queen."

In modern applications, these principles persist. Retrieval-Augmented Generation (RAG) systems, which allow chatbots to interact with private data, rely heavily on BERT-like models for retrieval. Instead of keyword matching, these systems turn queries into vectors to find the most semantically similar documents.

While the original BERT and CLIP have evolved into variants like RoBERTa and DistilBERT, their core architectures remain relevant. They serve as a reminder that in AI, the foundational innovations often outlast the hype of the latest releases, continuing to power the intelligent systems we use today.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1799: The Original AI Blueprints: BERT & CLIP

You know, Herman, everyone is so obsessed with the latest shiny object in the AI world, whether it is GPT-five rumors or the newest Gemini update, that we sometimes forget the absolute titans that actually built the house we are all living in. It is like everyone is talking about the latest electric hypercar while forgetting that someone had to invent the internal combustion engine and the pneumatic tire first.

That is a great way to put it, Corn. We are living in the era of the giants, but those giants are standing on the shoulders of two specific architectures that fundamentally changed how machines "see" and "read" the world. I am talking, of course, about BERT and CLIP. Today's prompt from Daniel is asking us to revisit these "classic" models, and honestly, it is about time. If you do not understand BERT and CLIP, you do not really understand how modern AI works.

I love that Daniel brought this up. It is easy to look at a model from two thousand eighteen or twenty twenty-one and think it is "ancient history" in AI years, but these are more like classic cars. They might not have the highest top speed compared to a frontier model today, but the engineering under the hood is what defined the entire industry. By the way, fun fact for everyone listening: today's episode is actually being powered by Google Gemini three Flash, which is a nice little nod to the evolution of these transformer architectures.

Herman Poppleberry here, ready to geek out. And you are right, Corn, the "transformer revolution" really found its footing with BERT. Before BERT, if you wanted a computer to understand a sentence, you were mostly using Recurrent Neural Networks or LSTMs. They processed words one by one, like a person reading a ticker tape. If a word at the end of the sentence changed the meaning of a word at the beginning, the model often struggled to keep that context in its head.

It was very linear. Very "A leads to B leads to C." But BERT changed that by being, well, "bidirectional." Which sounds like a fancy way of saying it looks both ways before crossing the street.

Essentially, yes! BERT stands for Bidirectional Encoder Representations from Transformers. When Google released it in October twenty eighteen, it blew everything out of the water because it didn't just read left-to-right or right-to-left. It looked at the entire sentence simultaneously. Every word in a sentence was processed in relation to every other word.

And then a few years later, OpenAI gives us CLIP in early twenty twenty-one, which did for images what BERT did for text, but with a twist. It bridged the gap. It taught the machine that a picture of a cat and the word "cat" are essentially the same concept in different formats.

.. wait, I mean, that is the core of it. CLIP, or Contrastive Language-Image Pre-training, moved us into the multimodal era. We are going to spend today looking at why these two are the "blueprints." We will look at the masked language modeling of BERT, the contrastive learning of CLIP, and why, if you are building a real-world application today, you are probably still using BERT embeddings or CLIP-style logic under the hood.

Let's start with the "Language" side of the house. BERT. When it dropped, it wasn't just another incremental update. It was a "hold my beer" moment for Google. They released two versions: BERT-base with one hundred ten million parameters and BERT-large with three hundred forty million. By today's standards, where we talk about trillions of parameters, three hundred forty million sounds like something you could run on a toaster. But the efficiency was the magic.

The magic was the pre-training task. This is the part that I think is most elegant. Most models back then were trained to predict the next word in a sequence. "The cat sat on the..." and the model guesses "mat." But the Google researchers decided to try "Masked Language Modeling." They would take a massive corpus of text, like all of English Wikipedia and a huge collection of digital books, and they would randomly hide fifteen percent of the words.

They just put a digital piece of duct tape over the words and told the model, "Figure out what is underneath."

Right. So if the sentence is "The [MASK] sat on the mat," the model has to look at "The," "sat," "on," "the," and "mat" to realize the missing word is likely a noun, probably an animal. Because it can see the words that come after the mask, it has a much deeper understanding of context. It's not just guessing the future based on the past; it's reconstructing a whole idea.

I remember seeing an example of how this handles polysemy... you know, words with multiple meanings. Like the word "bank." If I say "I am going to the bank to deposit a check," BERT's attention mechanism looks at the word "bank" and sees it is strongly connected to "deposit" and "check." But if I say "The boat crashed into the river bank," BERT sees "bank" is connected to "river" and "boat." It creates a different mathematical representation... an embedding... for the same word based on its neighbors.

That is the "Self-Attention" mechanism at work. In BERT, you have those twelve or twenty-four layers of transformers, and in each layer, there are these Query, Key, and Value matrices. Without getting too bogged down in the linear algebra, imagine every word in a sentence is asking every other word, "How relevant are you to me?" The word "bank" asks "river," and "river" says, "I am very relevant to your current meaning." The word "bank" asks the word "the," and "the" says, "I am just grammar, ignore me."

It is like a cocktail party where everyone is trying to find the most important person to talk to so they can understand the gossip.

It really is. And because BERT was an "encoder-only" model, it was designed purely for understanding. It wasn't meant to write poems or chat with you like Gemini or GPT. It was meant to turn human language into a high-dimensional vector... a list of numbers... that captured the "essence" of the meaning. That is why it became the king of search. Google integrated BERT into Search in twenty nineteen, and it was the biggest leap in ten years because the engine finally started understanding the intent behind a query rather than just matching keywords.

It’s the difference between a librarian who just looks for titles and a librarian who has actually read the books. But let me push back a bit on the "classic" status. If BERT is so great at understanding, why did everyone move to the GPT style, the "decoder-only" models?

It’s a trade-off in architectural philosophy. BERT is bidirectional, which is amazing for understanding a fixed block of text. But if you want to generate text, bidirectionality is actually a hindrance because you cannot "see" the future words you haven't written yet. GPT is like a person writing a story word by word; BERT is like an editor looking at a completed paragraph. For tasks like sentiment analysis, named entity recognition, or document classification, BERT-style models are often still more efficient and accurate than their generative cousins because they have that "all-at-once" perspective.

So BERT is the specialist editor, and GPT is the creative writer. That makes sense. But then we have CLIP. And CLIP feels like the bridge between the world of text and the world of vision. Before CLIP, if you wanted a model to recognize an image of a dog, you had to feed it thousands of images labeled "dog." If you showed it a "Golden Retriever playing in the snow," and it hadn't seen that specific label, it might just say "dog" or fail entirely.

Right, the old way was "supervised learning" on fixed labels. It was very rigid. OpenAI's breakthrough with CLIP was to say: "The internet is full of images with captions. People have already done the labeling for us." They took four hundred million image-text pairs from the web. But instead of teaching the model to predict the caption word-for-word, they used "Contrastive Learning."

This is the part that sounds like a game of "Match the Pair."

It is exactly that. You have two encoders: one for images and one for text. During training, you show the model a batch of images and their correct captions. You tell the model, "Make the mathematical representation of this image and this specific text as similar as possible. And at the same time, make them as different as possible from all the other mismatched images and captions in this batch."

So it’s not learning what a "dog" is by looking at pixels alone. It’s learning what the concept of a dog looks like compared to the word dog. It is aligning two different universes... the visual and the linguistic... into one shared "latent space."

And that has a massive second-order effect: Zero-Shot learning. Because CLIP understands concepts rather than just labels, you can give it an image of something it has never seen in a formal training set... say, a very specific type of vintage nineteen sixties toaster... and then give it a list of text labels like "toaster," "car," "tree," and "person." CLIP will look at the image, look at the text vectors, and see which one is closest in that shared space. It will pick "toaster" because it understands the visual "toasterness" matches the linguistic "toasterness."

It is basically the "Vibe Check" model. It checks if the vibe of the pixels matches the vibe of the words. And what is wild is how this unlocked the generative AI boom. People think DALL-E or Stable Diffusion just "know" how to draw. But they only know what to draw because CLIP is there acting as the compass. When you type "an astronaut riding a horse on Mars," CLIP is the reason the model knows what an astronaut looks like and can verify if the image it is generating actually matches your prompt.

It is the ultimate discriminator. It provided the feedback loop that made multimodal AI possible. Before CLIP, vision and language were two separate departments in the AI building that didn't talk to each other. CLIP tore down the walls and put them in an open-plan office.

Which leads me to something I've been thinking about: the "Embedding Economy." Because of BERT and CLIP, we now treat everything as a vector. A sentence is a vector, an image is a vector. And once everything is a number in a high-dimensional space, you can do math on it. You can subtract "maleness" from "King" and add "femaleness" to get "Queen." That old NLP cliché actually became a production reality with these models.

It really did. And if you look at modern RAG systems... Retrieval-Augmented Generation... which is how most companies are actually using AI right now to talk to their private data, they are almost all using BERT-descendant models for the retrieval part. When you ask a chatbot about your company's travel policy, the system doesn't search for keywords. It uses a model like BERT to turn your question into a vector, then it looks through a "vector database" of your documents to find the paragraph whose vector is closest to your question's vector.

I love that. We’ve gone from "searching for words" to "searching for meaning." But here is the cheeky question, Herman. If these models are so foundational, why aren't we still just using the original BERT from twenty eighteen? Why did we need "RoBERTa" or "DistilBERT" or all these other variants with the funny names?

Because the original BERT was actually "under-trained," believe it or not. Facebook... or Meta... released RoBERTa shortly after, and they showed that if you just train BERT for longer, on more data, with bigger batches, and remove the "Next Sentence Prediction" task that Google originally included, it gets significantly better. It turns out BERT had a lot more potential than the original paper even suggested. This was an early sign of the "Scaling Laws" we talk about so much now. More data and more compute equals more "intelligence."

And DistilBERT was the "diet" version, right? For when you don't want to run a massive server just to classify some emails.

Knowledge distillation is a fascinating process where you take a big "teacher" model like BERT-large and you train a smaller "student" model to mimic its outputs. You end up with a model that is forty percent smaller and sixty percent faster but retains ninety-seven percent of the performance. It made these models practical for edge devices and real-time applications.

It’s the "shrink-ray" of the AI world. Now, let's talk about the vision side again. CLIP isn't just for matching images to text. It’s being used for things like content moderation, medical imaging, and even autonomous driving. If a car sees a "pedestrian" but that pedestrian is wearing a weird costume, a traditional classifier might get confused. But a CLIP-style model understands the "human-ness" of the shape regardless of the specific pixels.

What I find truly wild is the "Zero-Shot" accuracy. In the original OpenAI paper, they showed that CLIP could achieve seventy-six percent top-one accuracy on ImageNet... which is the gold-standard benchmark for image classification... without ever being trained on the ImageNet training set. It just "figured it out" based on its general knowledge of the world from the internet. That was a "holy cow" moment for computer vision researchers. It meant that the era of manually labeling millions of images might be coming to an end.

It’s like a kid who learns what a "zebra" is just by reading a book and looking at one picture, and then when they go to the zoo, they recognize it immediately. They don't need to see ten thousand zebras to "get it."

That is the power of multimodal pre-training. But we should address some of the trade-offs or limitations. BERT, for all its brilliance, has a "context window" limit. The original BERT could only handle five hundred twelve tokens. If you had a fifty-page legal document, BERT couldn't "see" the whole thing at once. You had to chunk it up, which loses the overarching context.

And CLIP has its own quirks. It is great at "vibes" but sometimes struggles with very specific spatial relationships. If you ask it to distinguish between "a red cube on top of a blue sphere" and "a blue sphere on top of a red cube," it might get them mixed up because it sees all the right "concepts"... red, cube, blue, sphere... but it doesn't always grasp the prepositional logic of "on top of."

That is a very astute point, Corn. It is a "bag of concepts" model in many ways. It treats the caption almost like a set of tags rather than a structured sentence. This is where the newer models, the ones coming out in twenty twenty-five and twenty twenty-six, are starting to improve. They are integrating more "reasoning" into the vision-language bridge. But the foundational idea... that contrastive link... is still the bedrock.

So, if I am a developer today, and I am looking at the landscape, why should I care about BERT and CLIP when I could just use a massive API for a frontier model?

Cost and control. Running a massive frontier model every time you want to categorize a customer support ticket is like using a sledgehammer to crack a nut. It is expensive, it is slow, and it is overkill. A fine-tuned BERT model can do that task for a fraction of a cent, with millisecond latency, and you can run it on your own hardware without sending your data to a third party.

It’s the "workhorse" vs. the "polymath." You don't hire a Nobel Prize winner to file your taxes; you hire an accountant who knows exactly how to do that one thing perfectly.

And for CLIP, it is the same for search. If you are building a photo app and you want users to be able to search their own photos for "birthday party at the beach," you are going to use a CLIP-based embedding because it is incredibly efficient at searching millions of images in real-time. You pre-calculate the "vibe vectors" for all the images once, and then the search is just a simple mathematical comparison.

I actually used a tool recently that used CLIP to help find "b-roll" footage for videos. You just type in a mood, like "melancholy city street at night," and it scans thousands of clips and finds the ones that match that visual signature. It felt like magic, but it is just twenty twenty-one technology doing its job.

It’s "magic" that we’ve now normalized. But let’s look at the second-order effects of these models on the industry. BERT essentially created the "Transformer-only" world. Before BERT, people were still arguing about whether we needed recurrence or convolution. After BERT, it was like a light switch went off. Everyone realized that "Attention is All You Need" wasn't just a catchy paper title... it was a fundamental truth of deep learning.

It’s the "universal architecture." Whether it is text, images, audio, or even protein folding with AlphaFold, everything is becoming a transformer. And CLIP showed us that "Multimodality" is the natural state of intelligence. We don't live in a world of just text or just images. We live in a world where those things are inextricably linked.

What I find fascinating is how these models have aged. Usually, in tech, something from seven years ago is a relic. But BERT is still one of the most downloaded models on Hugging Face every single day. There is a whole family of "Sentence-BERT" models that are the industry standard for semantic search. It is one of the few areas in AI where the "classic" is still the "current."

It’s the Levi’s five-oh-ones of AI. They never go out of style because they just work. But let’s talk about the "Dark Side" for a second. These models were trained on the open internet. BERT was trained on Wikipedia and BooksCorpus. CLIP was trained on four hundred million unvetted image-text pairs. That means they inherited all the biases, stereotypes, and weirdness of the internet of the twenty-tens.

That is a massive issue. There have been plenty of studies showing that CLIP, for example, can have significant racial and gender biases in its associations. If you show it a picture of a doctor, its text association might skew heavily towards "man," and for a nurse, towards "woman," because that is what the training data reflected. When we use these as the "foundations" for newer models, those biases can get baked in and even amplified.

It’s the "Garbage In, Garbage Out" problem, but on a global, architectural scale. If the "compass" of your AI (CLIP) or the "editor" of your AI (BERT) has a skewed worldview, the whole system is going to be slightly off-center.

Which is why there has been so much work recently on "de-biasing" these models and being more selective with the training data. But it is hard to beat the sheer scale of the "unvetted" internet. The reason CLIP works so well is because it saw everything. If you limit it to only "perfect" data, you might lose that zero-shot magic. It’s a tension that researchers are still grappling with here in twenty twenty-six.

It reminds me of that old saying: "You can't have your cake and eat it too." You want the model to know everything about the world, but you only want it to know the good parts of the world. But the world is messy.

.. I mean, precisely... no, I won't use those words. You are right, Corn. The messiness is where the intelligence comes from. It’s the ability to handle the nuance and the edge cases.

Let’s pivot to some practical takeaways for the folks listening who might be wanting to actually use these "classics." If someone is building a project right now, any project involving text or images, what is the "Herman Poppleberry Approved" way to think about BERT and CLIP?

First, if you are doing anything with text search or classification, do not start with a massive generative model. Start with a "Sentence-Transformer" model based on BERT or RoBERTa. Look for models like "all-MiniLM-L-six-v-two" on Hugging Face. It is tiny, it is fast, and for eighty percent of use cases, it will give you better results for retrieval than a giant model because it was specifically trained for that representation.

"Mini-LM." Sounds like a small but mighty character in a fantasy novel.

It really is. It’s the "Legolas" of NLP. Fast and accurate. Second, if you are working with images, you have to look at "OpenCLIP." It is the open-source version of OpenAI's CLIP, and because the community has kept training it on even larger and better datasets like LAION-five-B, it actually outperforms the original OpenAI version. You can use it to build your own image search, your own automated tagging system, or even to help an AI "understand" what is happening in a video.

And for the real nerds out there, I’d say: go back and read the original papers. The BERT paper from twenty eighteen and the CLIP paper from twenty twenty-one. They are surprisingly readable. They aren't just math; they are a masterclass in "problem-solving" philosophy. How do we get a machine to understand context? How do we get it to bridge two different types of data? The insights there are still relevant for whatever comes after Transformers.

I agree. The "Masked Language Modeling" idea is so clever because it’s a "self-supervised" task. You don't need humans to label anything. The data provides its own labels by just hiding part of itself. That is the ultimate "free lunch" in machine learning. And CLIP’s contrastive learning is essentially "teaching by comparison," which is how humans learn a lot of things. We learn what "hot" is by comparing it to "cold."

It’s the "Sesame Street" school of AI development. "One of these things is not like the others."

Honestly, that is a better analogy than I would have come up with. It really is about finding the boundaries between concepts. And as we move forward, we are seeing these architectures evolve. We have things like "ViT"... Vision Transformers... which apply the BERT-style transformer logic directly to patches of pixels. We have "BLIP" and "BEiT" and all these other acronyms that are basically just "BERT plus CLIP" or "BERT for Vision."

It’s an alphabet soup, but the broth is always the same. It’s the Transformer. I wonder, though, Herman... do you think we will still be talking about BERT and CLIP in twenty thirty-one? Or will some new architecture like State Space Models or something even wilder have made them truly obsolete?

That is the big question. We are already seeing models like Mamba and other non-Transformer architectures trying to solve the "context window" and "efficiency" problems. But even if the Transformer architecture goes away, the conceptual breakthroughs of BERT and CLIP... the bidirectionality and the contrastive multimodal alignment... those are permanent additions to the "periodic table" of AI. They are fundamental discoveries about how information can be represented.

It’s like the difference between the "Steam Engine" as a specific machine and the "Laws of Thermodynamics." The machine might go to a museum, but the laws stay. BERT and CLIP discovered some laws about how language and vision relate to each other.

I love that. And for our listeners, I think the takeaway is: don't get blinded by the "Newest Model of the Month." Sometimes the best tool for the job is the one that has been battle-tested in production for years. BERT and CLIP are the "boring" technology that actually runs the world.

"Boring" is good. "Boring" means it works when you click the button. I think we’ve given Daniel’s prompt a good run for its money. It is a reminder that in this fast-paced field, looking back is often the best way to see where we are going.

It really is. I’ve enjoyed this trip down memory lane, even if "memory lane" is only five or six years old. In AI time, that makes us the village elders, Corn.

Speak for yourself, Herman. I’m a sloth, I age very slowly. I’m still basically a teenager in sloth years.

Fair enough. Well, before we wrap up, I want to make sure we give a shout-out to the people who make this show possible.

Definitely. Big thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thank you to Modal for providing the GPU credits that power our generation pipeline. If you are a developer looking for a way to run these kinds of models... whether it is a "classic" like BERT or the latest frontier model... Modal is the place to do it.

They make the infrastructure side of things feel as easy as these models make "understanding" feel. This has been My Weird Prompts. If you are enjoying the deep dives into the "guts" of AI, we would love it if you could leave us a review on Apple Podcasts or Spotify. It actually makes a huge difference in helping other curious minds find the show.

And if you want to see the "receipts" for what we talked about today, or find the RSS feed to make sure you never miss an episode, head over to myweirdprompts dot com.

We are also on Telegram... just search for "My Weird Prompts" to get notified the second a new episode drops.

Alright, Herman. I think it is time for us to retreat back into our respective habitats. I’ve got some "slow-motion" thinking to do about the next generation of embeddings.

And I’ve got about twenty new papers on my "to-read" list. Thanks for the chat, brother.

Any time. See you in the latent space.

Goodbye, everyone.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1799: The Original AI Blueprints: BERT & CLIP

Downloads

You Might Also Like

#1799: The Original AI Blueprints: BERT & CLIP