#670: Open Source vs. Open Weights: The AI Branding Illusion

Is your AI truly open? Herman and Corn break down the critical difference between open source and open weights in the age of LLMs.

0:000:00
Episode Details
Published
Duration
23:18
Audio
Direct link
Pipeline
V4
TTS Engine
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In episode 660 of My Weird Prompts, hosts Herman Poppleberry and Corn dive into a technical and philosophical debate currently roiling the artificial intelligence community: the distinction between "open source" and "open weights." Using a prompt from their housemate Daniel as a jumping-off point, the duo explores how the branding of AI models often masks restrictive licensing and significant risks for developers and sovereign nations alike.

The Branding Trap: Natural Ingredients vs. High Fructose Corn Syrup

Corn opens the discussion with a sharp analogy, comparing the "open" label on modern AI models to food packaging that claims to be "all-natural" while hiding processed chemicals in the fine print. While companies like Meta have successfully branded their Llama models as leaders of the open-source movement, Herman points out that they fail to meet the rigorous standards set by the Open Source Initiative (OSI).

According to the OSI’s "Open Source AI Definition 1.0," a truly open model must provide four essential freedoms: the right to use, study, modify, and share the system. Herman emphasizes that fulfilling these freedoms requires more than just the "weights"—the final numerical patterns of the model. It requires the training code, the data preprocessing scripts, and, most importantly, detailed information about the training data itself. Without these, a model is merely "open weights."

The Cake and the Recipe

To clarify the technical jargon, Herman offers a "cake" analogy. An open-weight model is like a finished cake delivered to your door. You can eat it, share slices, and perhaps add your own frosting (fine-tuning). However, you don't have the recipe, and you don’t know where the ingredients came from. True open source, by contrast, provides the recipe, the source of the flour, and the instructions to bake the cake from scratch.

This distinction isn't just pedantic; it has massive implications for what Herman calls "fork-ability." In traditional software, if a developer dislikes the direction of a project, they can "fork" it and create a new version. With open weights, you are a "tenant, not an owner." You cannot change the fundamental architecture or retrain the model to remove deep-seated biases because you lack the original "ingredients"—the training data.

The Hidden Dangers: Poison Pills and Platform Risk

The conversation shifts to the legal and security risks inherent in open-weight licenses. Corn highlights the "poison pill" clauses found in many modern AI licenses. For example, Meta’s license includes provisions that can revoke access if a user has more than 700 million monthly active users or if they enter a patent dispute with the company.

Herman warns that this creates a "platform risk" even for models running on a developer's local hardware. Unlike the Apache 2.0 or MIT licenses that powered the early internet, these conditional licenses are not irrevocable. For a startup, building on a model with these strings attached is akin to "building on sand."

Security, Sovereignty, and Weight Poisoning

Perhaps the most chilling part of the discussion centers on high-stakes applications, such as military or government use. Herman introduces the concept of "weight poisoning." Researchers have found that models can be trained to behave normally in almost every scenario but perform malicious actions when triggered by a specific phrase.

If a government agency cannot inspect the training data or recreate the model from scratch, they can never be 100% certain the model is clean. This has led to a push for "sovereign AI"—models where a nation-state has total control over the entire supply chain. However, as Corn notes, the "mountain of capital" required to train these models—often reaching toward a billion dollars—creates a barrier that forces many to settle for the convenience of open weights despite the security trade-offs.

The Heroes of True Open Source

Despite the dominance of corporate "open weight" models, the hosts highlight organizations fighting for true transparency. Herman points to the Allen Institute for AI (AI2) and their OLMo (Open Language Model) project. Unlike Llama, OLMo provides the weights, the full training data (Dolma), the code, and even intermediate checkpoints from the training process.

Corn and Herman also credit EleutherAI’s Pythia project for its pioneering work in making model development a transparent process for researchers. These projects represent a commitment to the democratization of the "means of production" in AI, rather than just the distribution of the final product.

Conclusion: A Choice Between Convenience and Control

As the episode wraps up, the takeaway is clear: the AI community is at a crossroads. While open-weight models offer immediate utility and ease of use, they come with a loss of sovereignty, transparency, and long-term legal certainty. For Herman and Corn, the "open" label is a spectrum, and understanding where a model sits on that spectrum is the only way for developers and institutions to protect themselves in an increasingly AI-driven world.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Read Full Transcript

Episode #670: Open Source vs. Open Weights: The AI Branding Illusion

Daniel Daniel's Prompt
Daniel
I'd like to discuss the difference between "open source" and "open weight" models. Many models, such as Llama, are often referred to as open source, but they don't necessarily meet the strict definition due to licensing restrictions on modifying or forking them.

Beyond the definition, why does this distinction matter in practice? What are the specific pain points for a developer or a user when a model is open weight but not truly open source? Furthermore, how does this distinction affect use cases involving data privacy, air-gapping, or government and military applications? Finally, what are currently the best models that meet the strictest definition of being fully open source?
Corn
You know, Herman, I was looking at some of the latest benchmarks for the newest model releases this morning, and it struck me how often the word open is used as a prefix. It is almost like a branding requirement at this point. Everyone wants to claim the open mantle, but if you look under the hood, the terms and conditions tell a very different story. It is a bit like those food labels that say all natural but then you read the ingredients and it is mostly high fructose corn syrup.
Herman
That is a perfect analogy, Corn. And honestly, it is something that has been keeping me up at night. By the way, for those just joining us, I am Herman Poppleberry, and welcome to episode six hundred and sixty of My Weird Prompts. We are coming to you from our home in Jerusalem, and today we are tackling a topic that our housemate Daniel sent over. He has been diving into the local artificial intelligence scene, and he noticed this growing friction in the community between what people call open source and what is increasingly being labeled as open weights. It is a distinction that sounds pedantic at first, but it actually has massive implications for how technology is built, owned, and regulated.
Corn
It really does. Daniel mentioned the Llama models specifically, which Meta has done an incredible job of marketing as the leaders of the open source AI movement. But as he pointed out, if you actually try to treat Llama like you would treat the Linux kernel or the Python programming language, you hit a brick wall pretty fast. There are restrictions on how you use it, who can use it, and how you can modify it. So, Herman, let us start with the technical baseline. When we talk about the strict definition of open source, what are we actually talking about in the context of artificial intelligence? Because the Open Source Initiative recently released a formal definition for AI, and it is a lot more demanding than just letting people download a file.
Herman
You are spot on. The Open Source Initiative, or the OSI, finally released version one point zero of the Open Source AI Definition late last year, after a lot of heated debate. To meet that standard, a model has to provide four essential freedoms. You have to be able to use the system for any purpose without asking permission. You have to be able to study how the system works and inspect its components. You have to be able to modify the system for any purpose, including changing its behavior. And finally, you have to be able to share the system for others to use, with or without your modifications. Now, here is the kicker, Corn. To truly fulfill those freedoms in AI, you do not just need the weights, which are the final set of numbers that represent the learned patterns. You also need the training code, the data preprocessing code, and most importantly, detailed information about the training data itself.
Corn
And that is where the Llama models and their peers usually fall short, right? Meta provides the weights, which is great. It means I can run it on my own hardware. But they do not tell us exactly what was in the training set. They do not give us the full recipe. And their license has that famous clause saying that if you have more than seven hundred million monthly active users, you have to ask for a special license. That alone disqualifies it from being open source under the OSI definition because open source cannot discriminate against fields of endeavor or specific users.
Herman
Exactly. And that is why the term open weights has become the preferred term for researchers and advocates who want to be precise. An open weight model is like a finished cake where the baker gives you the cake and lets you take it home, and maybe even lets you put your own frosting on it. But they do not give you the exact recipe, and they certainly do not tell you where they bought the flour or the eggs. You can eat the cake, you can share slices with your friends, but you cannot easily recreate the cake from scratch or understand why it tastes exactly the way it does.
Corn
So let us get into the practical side of this, which is what Daniel was really asking about. Why should a developer care? If I am a guy in my basement, or even a developer at a mid-sized startup, and I can download Llama four or Mistral or whatever and run it on my local machine, why does it matter if it is officially open source or just open weights? What are the actual pain points when that distinction is ignored?
Herman
The first big pain point is what I call the fork-ability problem. In traditional open source, if you do not like the direction a project is going, you can fork it. You take the source code, you rename it, and you start your own version. With an open weight model, you cannot really fork the core of the model. You can fine-tune it, which is like teaching it a few new tricks, but you cannot change the fundamental architecture or retrain it on a different data mix to fix deep-seated biases or errors. You are stuck with the foundation that the original creator built. If Meta decides to stop supporting a specific version or changes the license for future versions, you do not have the source materials to carry the torch yourself. You are fundamentally a tenant, not an owner.
Corn
That makes a lot of sense. It is a form of platform risk, even if the platform is sitting on your own hard drive. But what about the legal side? I have heard people talk about the poison pill clauses in some of these open weight licenses.
Herman
Oh, those are fascinating and terrifying. Many of these licenses, including Meta's, have clauses that say if you use the model to train another model that competes with them, or if you get into a patent dispute with them, your license is immediately revoked. Think about that for a second. If you are a startup building a revolutionary new AI product using Llama, and five years from now Meta decides you are a threat, they could potentially pull the rug out from under you by claiming a license violation. True open source licenses like Apache two point zero or MIT do not have those kinds of conditional strings attached. They are irrevocable. That certainty is what allowed the modern internet to be built on top of open software. Without that certainty, you are building on sand.
Corn
I also think about the reproducibility aspect. We talk a lot on this show about the black box nature of AI. If we do not have the training data or the full pipeline, we cannot actually audit the model for safety or bias in a meaningful way. We are just poking at it from the outside. If a government agency or a medical research firm wants to use a model, they need to know if the data it was trained on was biased, or if it contains copyrighted material that could lead to legal headaches down the road. With open weights, you just have to take the company's word for it.
Herman
And that leads perfectly into the specialized use cases Daniel mentioned, like data privacy, air-gapping, and government or military applications. Let us talk about air-gapping first. For those who do not know, an air-gapped system is one that is physically disconnected from the internet. This is standard in high-security environments like nuclear power plants, intelligence agencies, or classified research labs. Now, you can run an open weight model in an air-gapped environment. That is one of the big selling points. You download the weights once, you move them via a secure drive to your isolated network, and you are good to go. But there is a hidden catch.
Corn
Is it the telemetry?
Herman
That is part of it. Some of these models have hidden hooks or expect to phone home for certain updates, though you can usually strip those out. The bigger issue is the supply chain security. If you are the military, you do not just care that the model runs offline. You care about the provenance of every single bit in that system. You need to know that there is no back door baked into the weights themselves. There is a whole field of research now on weight poisoning, where you can train a model to behave normally ninety-nine percent of the time, but then perform a specific malicious action when it sees a certain trigger phrase. If you do not have the training data and the code to recreate the model yourself, you can never be one hundred percent sure that the model is clean.
Corn
That is a chilling thought. So for a truly high-stakes application, you essentially need a model where you have seen every single line of code and every single piece of data that went into it. You need to be able to run the training script yourself on your own secure clusters and get the exact same weights out the other end. That is the only way to verify the integrity of the system.
Herman
Exactly. And it is not just about security; it is about sovereignty. If the United States government or the Israeli government uses a model for critical infrastructure, they cannot be dependent on a license that might be affected by the shifting corporate priorities of a company in Silicon Valley. They need a model that belongs to the public domain or is under a truly permissive license that allows for total independence. This is why we are seeing a push for national AI clouds and sovereign models. But ironically, many of those projects are still starting with open weight foundations because training from scratch is just so incredibly expensive.
Corn
It feels like we are in this weird middle ground where the convenience of open weights is winning out over the security and transparency of true open source because the barrier to entry for the latter is just a mountain of capital and compute. I mean, how much does it cost to train a state-of-the-art model from scratch these days? We are talking tens, maybe hundreds of millions of dollars in compute time alone, right?
Herman
Easily. In fact, by early twenty-six, the estimates for training a frontier-level model have climbed toward the billion-dollar mark. And that is the heart of the problem. True open source requires the democratization of the means of production, not just the distribution of the final product. In the early days of software, all you needed was a computer and time to write code. But in AI, you need a massive GPU farm and a dataset that spans the entire public internet. That is not something a few hobbyists can put together in a garage. So we have this tension where the definition of open source is being stretched to accommodate the reality of the hardware requirements.
Corn
So, let us look at the other side. Daniel asked about the best models that actually meet the strictest definition of being fully open source. Who is actually doing the hard work of releasing the data, the code, and the weights without these restrictive licenses?
Herman
There are a few heroes in this space. One of the most prominent is the Allen Institute for AI, or AI two. They released a model called OLMo, which stands for Open Language Model. When they released OLMo, they did not just give you the weights. They released the full training data, which they call Dolma. They released the training code, the evaluation tools, and even the intermediate checkpoints from the training process. That means you can see how the model learned over time. It is probably the most transparent large-scale model out there right now. As of early twenty-six, they have released OLMo two, which is significantly more capable while maintaining that total transparency.
Corn
I remember reading about OLMo. It is under the Apache two point zero license, right? Which is about as open as it gets. No restrictions on commercial use, no user caps, no poison pills.
Herman
Exactly. Then you have the Pythia project from EleutherAI. They were really the pioneers of this. They released a suite of models specifically designed for researchers to study how models develop. Again, everything was open. The goal was not just to give people a tool to use, but to give them a laboratory to study. There is also the Bloom model, which came out of the BigScience project. That was a massive international collaboration with over a thousand researchers. They were very focused on ethical data sourcing and transparency, though Bloom uses the Responsible AI License, which has some behavioral restrictions that technically put it in a bit of a gray area regarding the strict OSI definition, but it is still far more open than Llama.
Corn
It is interesting that the truly open models often lag slightly behind the open weight models in terms of raw performance. Llama four is a beast. It is incredibly capable. OLMo is great, but it is often a step or two behind in terms of the state of the art. It feels like there is a transparency tax. If you spend all that time and effort being transparent and careful with your data, you might not be able to move as fast as a company that is just scraping everything and keeping the details secret.
Herman
That is the unfortunate reality right now. Performance is currently being driven by scale and secrecy. But here is why the distinction still matters for the average person. If we allow the term open source to be co-opted by open weight models, we lose the incentive to build the infrastructure for true open source. We stop asking for the data. We stop demanding the right to reproduce the results. And eventually, we end up in a world where AI is a set of black boxes owned by three or four companies, and we are all just renting space inside them.
Corn
I also think about the privacy angle Daniel mentioned. If I am running a model locally because I want to keep my data private, I am already doing better than using a cloud API like ChatGPT. But if I do not know how the model was trained, I do not know if it has been optimized to extract certain types of information or if it has memorized sensitive data from its training set that it might inadvertently leak. There is this concept of data leakage where a model might spit out a real person's address or social security number because it saw it during training. If the training data is a secret, I have no way of knowing what risks I am taking on when I feed my own private data into that model.
Herman
That is a huge point. There was a study recently where researchers were able to extract specific training examples from some of the leading models just by prompting them in clever ways. If that model was trained on private medical records or leaked emails, that is a massive privacy violation that is baked into the weights. When the data is open, we can audit it. We can say, hey, you included this dataset that you should not have. Let us retrain the model without it. When it is closed, you are just crossing your fingers.
Corn
So, if you are a developer listening to this, and you are trying to decide which path to take, what is the practical takeaway? Do you go for the high performance of a Llama or a Mistral, or do you prioritize the long-term stability and transparency of something like OLMo?
Herman
It depends on your horizon. If you are building a quick prototype or a tool where performance is everything and you are not worried about long-term platform risk, then by all means, use the best open weight model available. They are incredible tools and they have moved the industry forward by leaps and bounds. But if you are building core infrastructure, if you are working in a regulated industry, or if you are part of a government project that needs to exist for decades, you have to look toward true open source. You have to value the ability to audit, the ability to fork, and the legal certainty of an irrevocable license.
Corn
It is about the difference between a product and a foundation. An open weight model is a product that you can use. A true open source model is a foundation that you can own. And I think as the AI field matures, we are going to see a lot more focus on that ownership. We are already seeing it in the European Union with the AI Act. They are starting to make distinctions in their regulations between models that are fully transparent and those that are not. The transparency requirements for high-impact models are going to get much stricter.
Herman
And that is where the big companies might get caught out. If they cannot prove what went into their models, they might face massive fines or be banned from certain markets. This is why I think we might see a shift where even the big players start releasing more of their data pipelines. Not because they want to, but because they have to. The open source community has always been the vanguard of these kinds of standards. We saw it with the web, we saw it with cloud computing, and we are seeing it now with AI.
Corn
It is funny, I was thinking about how this applies to our own house. Daniel sends us these prompts, and we have this collaborative process. If we were a closed system, we would just give him the answers and he would have to trust us. But because we talk through the whole thing, we are essentially showing our work. We are giving him the training data and the thought process. It makes the final conclusion much more robust because he can see where we might have gone off the rails or where our biases are.
Herman
Exactly. Transparency is a feature, not a bug. It builds trust. And in a world where AI is going to be making more and more decisions for us, trust is the most valuable currency we have. If you cannot explain why a model made a decision, you cannot trust it. And you cannot explain why a model made a decision if you do not know how it was built.
Corn
I want to circle back to the military and government side for a moment. Daniel mentioned the recent news about the US military looking at models from Anthropic and others. It seems like there is a real tension there. On one hand, they want the best tech, which is currently behind closed doors or in these open weight models. On the other hand, the security requirements are so extreme that they almost demand a custom-built, fully open source solution. Do you think we will ever see a top-tier model that is truly open source? Or is the cost just too high?
Herman
I think we will, but it might not come from a corporation. It might come from a consortium of governments or a massive global foundation. Think about the Human Genome Project. That was a task so big and so important that no single company could or should have owned the result. AI is starting to feel like that. It is a foundational technology for the human species. If we leave it entirely in the hands of a few private entities, we are taking a massive risk. I think we will see a project, maybe in the next few years, that puts a billion dollars into a truly open source, state-of-the-art model. And once that exists, the open weight models will have a very hard time competing on anything other than niche features.
Corn
That is an optimistic vision, Herman. I hope you are right. It feels like the current open weight trend is a necessary stepping stone, but it cannot be the final destination. We need to move toward a world where the most powerful tools in our society are also the most transparent.
Herman
Well, that is the goal, isn't it? To make the weird prompts and the complex technology understandable for everyone. And speaking of making things understandable, we really appreciate everyone who joins us on this journey every week. If you have been finding these deep dives helpful, it would mean a lot to us if you could leave a review on your podcast app or on Spotify. It really does help other people find the show and join the conversation.
Corn
Yeah, it is the best way to support the work we do here. And if you want to get in touch or see our archives, you can always head over to myweirdprompts.com. We have the full RSS feed there and a contact form if you have a prompt of your own that you want us to tackle. Maybe you can be our next Daniel.
Herman
Or just our next listener. We love you all the same. This has been My Weird Prompts. I am Herman Poppleberry.
Corn
And I am Corn. Thanks for listening, and we will talk to you next time.
Herman
Goodbye, everyone!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.