#260: Digital Archeology: The Primitive Power of GPT-1

Revisit the 2018 model that started it all. Herman and Corn dive into GPT-1's romance-novel roots and its 117-million-parameter legacy.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-395
Published: Jan 20
Duration: 19:18
Audio: Direct link
Pipeline: V4
TTS Engine
Script Writing Agent: Gemini 3 Flash
Topics: gpt-1 absolute-positional-embeddings unsupervised-pre-training

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Digital Archeology: Unearthing the Ghost of GPT-1

In the fast-paced world of 2026, where trillion-parameter models and agentic autonomy are the norms, looking back at the year 2018 can feel like studying the Paleolithic era. In a recent discussion, podcast hosts Herman Poppleberry and Corn took a deep dive into "digital archeology," triggered by a housemate’s frustrating encounter with the original GPT-1. What began as a humorous look at a "broken" model evolved into a profound exploration of how the foundations of modern AI were laid.

The Wright Brothers' Flyer of AI

The conversation begins with a stark reality check: GPT-1, released by OpenAI on June 11, 2018, is a far cry from the sophisticated assistants we use today. Corn notes that when modern users interact with it, the model often fails to maintain coherence, sometimes labeling major cities as villages or devolving into gibberish after just a few sentences.

However, Herman offers a vital perspective: GPT-1 wasn't a failure; it was a proof of concept. He compares it to the Wright brothers' first flight—a twelve-second, 120-foot journey that changed the world, even if it couldn't cross an ocean. In 2018, the fact that a transformer-based model could generate any coherent text at all was a landmark achievement.

The Scale of the Revolution

The sheer difference in scale between 2018 and 2026 is staggering. GPT-1 boasted 117 million parameters. While that sounded like a massive number at the time, Herman points out that modern models like GPT-4 (estimated at 1.8 trillion parameters) represent a 15,000-fold increase in raw capacity.

This lack of scale explains many of the "primitive" behaviors Daniel encountered. GPT-1 utilized a context window of only 512 tokens—roughly one page of text. More importantly, it used absolute positional embeddings, meaning it had a hard-coded limit. Once the model reached token 513, it simply could not "see" further. Without the sophisticated attention mechanisms of today, the model would begin attending to its own errors, creating a feedback loop of nonsense that rendered long-form conversation impossible.

Trained on Romance and Dragons

One of the most colorful insights from the episode involves the data used to train the original model. Unlike today’s models, which are trained on massive swaths of the entire internet, GPT-1 was trained on the "BookCorpus" dataset. This consisted of over 11,000 unpublished books scraped from Smashwords, a platform dominated by indie romance, fantasy, and science fiction.

As Herman explains, GPT-1’s entire worldview was shaped by the tropes of star-crossed lovers and dragon-slaying adventures. This explains the "dramatic flair" often found in its outputs. But more importantly, it highlights the shift in AI philosophy. Before GPT-1, AI was "supervised"—it had to be hand-held through specific tasks like sentiment analysis. GPT-1 proved that "unsupervised pre-training"—simply letting a model predict the next word in a book—was enough to teach it the fundamental structures of language.

The "Layered Cake" of Modern AI

A major point of confusion for modern users is why GPT-1 feels so "robotic" and unhelpful compared to today’s chatbots. Corn and Herman clarify that GPT-1 was never intended to be a chatbot. It was a raw text predictor—an engine sitting on a workbench without a steering wheel.

Modern AI is described as a "layered cake." At the bottom is the base model (the raw engine), followed by instruction tuning (learning to follow commands), and finally RLHF (Reinforcement Learning from Human Feedback), which polishes the AI to be helpful and pleasant. GPT-1 was just the base. If you asked it a question, it wasn't trying to help you; it was simply trying to complete a document. If it got confused, it might decide the most logical "next word" was the letter "A" repeated indefinitely.

BERT vs. GPT: The Battle for the Future

The hosts also revisit the historical rivalry between OpenAI’s GPT and Google’s BERT. Released around the same time, BERT was an "encoder-only" model designed for understanding, while GPT was "decoder-only" and designed for generation. While BERT initially dominated benchmarks for language understanding, the "generative" path taken by GPT eventually led to the breakthrough of general intelligence. As Herman notes, if a model can generate the next word perfectly, it must, by necessity, understand the world.

The Legacy of a Pioneer

As the discussion concludes, Herman and Corn reflect on the current state of GPT-1 in 2026. While it is no longer a flagship model, its size has made it the new standard for "edge" AI. The 117-million-parameter scale is now used for tiny, specific tasks like spam detection or sentiment analysis on mobile devices.

Ultimately, GPT-1 is viewed not as a dead end, but as a direct ancestor. Every trillion-parameter model currently in use contains the "ghost" of that original 2018 architecture. It was the first single-celled organism of the generative AI explosion—a simple starting point that proved that with enough data and the right architecture, a machine could eventually learn to speak.

Mentions

AI Dungeon AI-powered text adventure game
BERT Google's bidirectional encoder model
BookCorpus Dataset of over 11,000 books used for training
GLUE Benchmark for natural language understanding
GPT-1 Original OpenAI language model from 2018
GPT-2 OpenAI's 1.5 billion parameter successor
Hugging Face Platform for hosting and sharing AI models
LAMBADA Benchmark for language modeling
OpenAI Company that developed GPT-1
Smashwords Site where BookCorpus books were scraped

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Open PDF

#260: Digital Archeology: The Primitive Power of GPT-1

Open PDF

Alright, we have a really fun one today. Our housemate Daniel has been doing some digital archeology in the back rooms of the internet.

Herman Poppleberry here. And yeah, Daniel was actually playing around with GPT-one. Which, for anyone who joined the AI party in the last couple of years, probably feels like trying to use a stone tool to build a skyscraper.

It is wild. He sent us this audio where he was basically poking at it with a stick and getting frustrated because it kept calling Paris a village and then devolved into absolute gibberish after three sentences. It’s such a perfect starting point for us because GPT-one is often cited as this massive landmark, but if you actually use it today, it feels... well, broken.

It’s not broken, Corn! It’s just... primitive. Think of it like the Wright brothers' first flyer. It stayed in the air for twelve seconds and traveled one hundred and twenty feet. If you tried to use it to fly from Jerusalem to New York today, you’d call it a failure, but at the time, the fact that it flew at all changed the world.

That’s a fair analogy. But Daniel’s question is really the core of it. Why does it seem so much less capable? I mean, it’s still a transformer, right? The "T" in GPT stands for Transformer. That’s the same architecture we’re using for the massive models today. So what happened between the June 11, 2018 GPT-1 paper and now?

A lot of things, but the biggest one is scale. OpenAI published the GPT-1 paper on June 11, 2018, with the model having 117 million parameters. Now, to a regular person, a hundred million sounds like a lot. But compared to what we’re working with in twenty twenty-six, it’s a grain of sand. For context, GPT-four's parameter count is undisclosed by OpenAI but estimated by third parties at around 1.8 trillion in a mixture-of-experts setup, and the models we're seeing this year are pushing even further into agentic autonomy. That is roughly a fifteen-thousand-fold increase in raw capacity since the original.

Fifteen thousand times. That’s like comparing the brain of a fruit fly to a human. But it’s not just the size, right? Because Daniel mentioned the context window. He said it started outputting gibberish after just a few turns.

Exactly. The context window on GPT-one was only five hundred and twelve tokens. In modern terms, that’s maybe a page of text. But here’s the technical kicker: it used what we call absolute positional embeddings. That means the model literally had a hard-coded limit. It couldn't see token five hundred and thirteen if it wanted to. Once it hit that limit, it didn’t have the sophisticated "forgetting" or attention mechanisms we have now to stay on track. It would just start attending to its own previous errors, creating a feedback loop of nonsense. If it made one typo or one weird word choice, it would get "distracted" by its own mistake and just spiral.

I love that image of the model getting distracted by its own shadow. But there's another part of Daniel's experience that I think is really important. He mentioned that he asked "How are you?" and it gave him a very cold, robotic "I am an AI model, I do not have feelings." He noted it wasn't tuned for conversation. And I think this is where a lot of people get confused. We think of GPT as a chatbot, but GPT-one was never intended to be a chatbot, was it?

Not even close! This is the biggest misconception about early large language models. Back in 2018, the goal wasn't “let’s make a digital friend.” The goal was "let's see if we can predict the next word in a sentence well enough to understand language." GPT-one was a proof of concept for something called unsupervised pre-training. Before this, if you wanted an AI to do sentiment analysis, you had to train it specifically on a dataset of labeled "happy" and "sad" sentences.

Right, the "supervised" approach. You had to hold the model's hand for every specific task.

Exactly. OpenAI’s big gamble with GPT-one was saying, "Hey, what if we just feed it a ton of books and let it figure out how language works on its own by trying to guess the next word?" They used the BookCorpus dataset, which contained over 11,000 books scraped from Smashwords—mostly romance, fantasy, and adventure novels.

Wait, unpublished books? So GPT-one's entire worldview was shaped by aspiring novelists writing about dragons and star-crossed lovers?

Pretty much! That’s why it has that weird, slightly dramatic flair sometimes. It was literally trained on the tropes of indie romance and sci-fi. But the point was that after it "read" those books, it had a general understanding of English. Then, if you wanted it to do a specific task, you would "fine-tune" it. You’d give it a much smaller, specific dataset for, say, answering questions. But the "base" model—the one Daniel was playing with—was just a raw text predictor. It had no idea it was supposed to be talking to a human.

This is such a crucial distinction. Today, when we use a model, we’re actually using a very layered cake. You have the base model that’s read the whole internet, then you have the instruction tuning where it learns to follow commands, and then you have RLHF—Reinforcement Learning from Human Feedback—where humans literally rate its answers to make it sound more "helpful" and "pleasant."

Right. GPT-one had zero of that. No instruction tuning, no RLHF, no safety filters. It was just a raw engine. If you asked it a question, it wasn't trying to "answer" you; it was trying to complete the document. If your prompt looked like the start of a Q and A, it might try to act like a Q and A. But if it got confused, it might decide the best "next word" was just the letter "A" repeated a hundred times.

It’s like the difference between an engine sitting on a workbench and a finished Tesla. Daniel was basically trying to drive the engine block and wondering why there were no cup holders or a steering wheel.

That is a perfect way to put it. And to address Daniel’s point about BERT—because he mentioned that name comes up a lot—BERT was Google’s big model released just a few months after GPT-one. BERT was an "encoder-only" model, while GPT was "decoder-only." BERT was actually better at a lot of things back then, like understanding the relationship between words in a sentence, because it could look at a word and see the context both before and after it. BERT large scored around 84.9 on the GLUE benchmark, while GPT-1 excelled on generation tasks like LAMBADA.

Oh, I remember this. "Bidirectional Encoder Representations from Transformers."

Look at you with the full acronym! Yeah, BERT was the king of "understanding." But GPT was the king of "generating." Because GPT only looks at what came before, it’s much better at actually writing new text. And it turns out, as we’ve seen over the last eight years, the "generating" path was the one that led to general intelligence. If you can generate the next word perfectly, you have to understand the world.

So, if GPT-one wasn't a chatbot, what were people actually using it for in twenty eighteen? I mean, was there any practical application, or was it just a lab experiment?

It was mostly a benchmark buster. GPT-one came out and absolutely smashed the state-of-the-art on nine out of twelve of the tests they used at the time. That was the "Aha!" moment for the industry. It proved that pre-training on a generic dataset actually helped the model perform better on specific tasks it had never seen before. It was the birth of the "foundation model."

That’s the "Generative Pre-trained" part. The "Pre-trained" is the revolution.

Exactly. Before that, everyone thought you needed specialized architectures for every task. GPT-one said, "No, just give me a big enough transformer and enough books, and I can learn to do anything with just a little bit of extra training."

It’s interesting, though, because Daniel’s experience with it outputting gibberish highlights how much we take for granted now. We expect these models to be coherent for thousands of words. But back then, even getting a paragraph that didn't contradict itself was a win. I remember reading that GPT-two—the successor—was the one where people started getting actually worried about its ability to generate "deceptive" or "too-convincing" text.

Yeah, GPT-two was the one where OpenAI initially said, "This is too dangerous to release." Which, looking back at it from twenty twenty-six, feels almost adorable. GPT-two had one point five billion parameters. It was ten times bigger than GPT-one, and that jump in scale was enough to make the text go from "mostly gibberish" to "wait, did a human write this?"

It’s that exponential curve. But let’s go back to Daniel’s question about early applications. Were there any prototypes for chatbots back then using GPT-one?

There were some very early experiments, but they were mostly in the "creative writing" space. People would use it to help them write stories. You’d write a sentence, and the model would give you the next one. It was very "collaborative" because the model was so unstable that the human had to do all the heavy lifting to keep the story on the rails. It was more like a sophisticated auto-complete than a conversation.

I actually remember a few "AI Dungeon" style games that started around then. They were using these early models to try and generate infinite adventure stories. And yeah, they were notorious for having the characters suddenly turn into inanimate objects or the plot just dissolving into a loop about a "dark and stormy night."

Right! And that brings us back to the context window. If the model can only "remember" the last five hundred words, it’s going to forget that you’re holding a sword or that you’re in a cave. By the time you get to the end of the page, the "beginning" of the page is gone from its memory. It’s like a goldfish trying to write a novel.

A goldfish on romance novels! It’s a hilarious image. But okay, so if we look at where we are now—today is January twentieth, twenty twenty-six—we’re seeing models that can process millions of tokens. We’re talking about "infinite" context windows, or at least windows large enough to hold an entire library. When you look back at GPT-one from this vantage point, does it feel like a dead end or a direct ancestor?

Oh, it’s a direct ancestor. Absolutely. The core idea—unsupervised pre-training on a decoder-only transformer—is still the heart of everything. We just added more layers, more parameters, better data, and that crucial "alignment" layer that makes it talk like a person. If you stripped away all the safety filters and the instruction tuning from GPT-five today, you’d still see the ghost of GPT-one in there. It’s just a much, much smarter ghost.

It’s like looking at the first single-celled organism. It’s not "bad" at being an organism; it’s just the starting point. I think what’s really fascinating is that Daniel was able to just "download" it and run it. That’s another huge change. In twenty eighteen, running a hundred-million-parameter model required some decent hardware. Now, you can run it on a phone or even in a browser.

Yeah, the "smallness" of it is now its greatest feature. We use models that size for very specific, tiny tasks now—like "is this email spam?" or "what is the sentiment of this tweet?" We don't need a trillion parameters to tell us a tweet is angry. So GPT-one’s "size" has actually become the standard for "edge" AI.

That’s a great point. We’ve gone through this cycle where we built these massive, god-like models, and now we’re trying to shrink that intelligence back down into "small" models that are the same size as GPT-one but way more capable because the training techniques have improved so much.

Exactly. A "small" model today with a hundred million parameters would absolutely smoke GPT-one, even though they’re the same size. We have better data now. We don't just use over 11,000 unpublished books; we use high-quality synthetic data, curated web scrapes, and textbooks. The "quality" of the "reading material" matters just as much as the "brain size."

So, for the listeners who want to try this themselves—because it is a trip—what should they look for? If they go to a site like Hugging Face and pull up these ancient models, what’s the best way to interact with them to see what they were actually meant for?

Don't treat it like a chatbot. Treat it like a "complete the sentence" game. Give it a very clear, structured start. Like, "The capital of France is..." and see if it says "Paris." Or start a story with a very specific style and see if it can mimic it for three sentences. If you try to argue with it or ask it for its "opinion," you’re just going to get the gibberish Daniel found. It’s a mirror, not a person.

It’s a mirror that’s been shattered and glued back together. I think the takeaway for me is just the sheer speed of this. We’re talking about less than eight years. In eight years, we went from a model that couldn't stay on topic for a paragraph to models that are passing bar exams and diagnosing rare diseases.

It’s the fastest technological ramp-up in human history. And I think it’s important to remember that people were blown away by GPT-one at the time. There were headlines about how "OpenAI’s new model understands language better than ever." We were impressed by the "village of Paris" back then because before that, models couldn't even get "Paris" right half the time!

It’s all about perspective. I remember being impressed by my first flip phone because it could send a text message. Now I’m annoyed if my watch doesn't translate a foreign language in real-time. We’re very good at moving the goalposts on "impressive."

We really are. But I think there’s something beautiful about going back and looking at these early versions. It reminds you that this isn't magic. It’s engineering. It’s math. It’s trial and error. GPT-one was a very successful error that pointed the way to everything else.

Well, I think we've thoroughly dissected the "village of Paris." This was a great prompt from Daniel. It’s easy to forget the history when the future is moving so fast.

Definitely. And hey, if you’re out there listening and you’ve been enjoying these deep dives into the "weird" side of AI and tech, we’d really appreciate it if you could leave us a review on Spotify or wherever you get your podcasts. It actually helps a lot more than you’d think.

Yeah, it keeps the lights on and the sloths fed. You can find us at myweirdprompts.com if you want to send in your own prompt—maybe you found an old AI in a digital basement somewhere that we should talk about.

Or just a question that’s been bugging you. We’re here for all of it.

Alright, that’s Episode two hundred and twenty-six. Thanks for hanging out with us in Jerusalem. I’m Corn.

And I’m Herman Poppleberry. We’ll talk to you next week.

Stay curious, everyone.

And keep poking the models with sticks. It’s the only way to learn.

Exactly. Bye everyone!

Bye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.