How LLMs Actually Work: Inside the Models That Changed Everything

The most important technology of the 2020s is also one of the least understood. Large language models power everything from coding assistants to medical research tools, yet most people using them have no idea what’s actually happening inside. These fourteen episodes build a working mental model — from the decades of research that made modern AI possible to the fundamental limitations that still can’t be trained away.

The History That Made It Possible

AI: Not an Overnight Success Story is the essential corrective to the narrative that AI appeared from nowhere in 2022. The episode traced the decades of neural network research, failed AI winters, and incremental advances that accumulated before the modern era — explaining why the breakthroughs happened when they did, and what conditions had to be true simultaneously for them to occur.
The Heavy Metal of Machine Learning: Inside PyTorch zoomed in on the software infrastructure that made modern AI research possible. PyTorch’s dynamic computation graphs, automatic differentiation, and research-friendly design didn’t just make training neural networks easier — they changed the pace of experimentation across the entire field. The episode covered the project’s origins at Facebook AI Research, its technical architecture, and why it became the dominant research framework despite TensorFlow’s head start.

How Models Are Built and Trained

How Does Fine Tuning Work Anyway? explained the fundamental technique that turns a generic pretrained model into a useful specialized one. Rather than training from scratch — which requires billions of dollars of compute — fine-tuning starts from an existing model’s learned representations and adapts them for a specific task or domain. The episode explained the mechanics, the data requirements, and the tradeoffs between different fine-tuning approaches.
Building an AI Model from Scratch: The Hidden Costs examined what’s actually involved when a company decides to train a foundation model from the ground up. The compute costs are the obvious part; the data acquisition, curation, labeling, evaluation infrastructure, and engineering team costs are often larger. The episode broke down where the money actually goes and why the economics favor a small number of very large players.
AI’s Blind Spot: Data, Bias & Common Crawl investigated the training data that shapes what models know and believe. Common Crawl — a massive web scrape that underlies most foundation model training — is not a neutral sample of human knowledge. It over-represents English, wealthy countries, and certain demographics, and under-represents languages, perspectives, and knowledge domains that are less visible online. The episode examined how this shapes model behavior and what the realistic options are for addressing it.

How Understanding Works

AI’s Secret Language: Vectors, Embeddings & Control pulled back the curtain on one of the most counterintuitive aspects of how language models work: they don’t understand words as words. Everything — text, images, audio, code — gets converted into numerical vectors in high-dimensional space, where meaning is represented as geometric relationships. The episode explained what embeddings actually are, how similarity search works, and why this approach enables capabilities that symbolic AI could never achieve.
From Keywords to Vectors: How AI Decodes Meaning traced the evolution from keyword matching (which treats “car” and “automobile” as completely different) to semantic understanding (which represents them as nearby points in vector space). This shift underlies everything from improved search to retrieval-augmented generation, and the episode explained the technical progression from word2vec through BERT to modern embedding models.

The Problems Models Can’t Escape

Why AI Lies: The Science of Digital Hallucinations explained why language models confidently state false things. LLMs are fundamentally next-token predictors, not truth machines — they generate plausible-sounding continuations of text, and sometimes plausible-sounding is not the same as accurate. The episode examined what’s happening mechanically when a model hallucinates, why the problem is hard to eliminate, and what detection and mitigation strategies actually work.
The Scaling Wall: Why Bigger AI Isn’t Always Smarter confronted the assumption that has driven AI investment for years: that more parameters and more data reliably produce better models. The episode examined the evidence that scaling returns are diminishing, the phenomenon of model collapse when training on AI-generated data, and what the plateau in benchmark performance improvements suggests about where the next gains will come from.

Model Design Decisions

The Price of Politeness: Should AI Guardrails Stay? examined one of the more heated debates in AI: the tradeoffs between safety filters and model capability. RLHF and constitutional AI training can make models safer and more aligned with human values — or they can produce models that refuse reasonable requests, give wishy-washy answers, and are less useful. The episode looked at the genuine tension between safety and utility, and at the market dynamics driving demand for uncensored models.
AI’s Secret: Decoding the .5 Updates decoded what actually changes in the incremental model updates that labs release. The naming conventions (GPT-4, 4.5, 4o) often obscure significant architectural or training changes, and the episode examined how to actually evaluate whether a new model version is meaningfully better for specific tasks — rather than relying on benchmark press releases.

The Competitive Landscape

The Benchmark Battle: Decoding the Rise of Chinese AI analyzed the emergence of Chinese AI labs as serious competitors. DeepSeek, Qwen, and others achieved benchmark results that matched or exceeded Western models at dramatically lower training costs — which raised questions about whether benchmark performance was being gamed, or whether the efficiency gap was genuinely closing. The episode examined the evidence for both interpretations.
The $5.5 Million Breakthrough: DeepSeek’s AI Disruption dove deeper into the DeepSeek moment specifically: a model that matched GPT-4-class performance trained for a fraction of the cost, using a mixture-of-experts architecture that activates only a subset of parameters per inference. The episode explained the technical choices that made this possible and what it means for the economics of the entire AI industry.

What Comes Next

Deep Think: The Rise of Deliberate AI Reasoning examined the move beyond pattern matching toward chain-of-thought reasoning, tree search, and models that can allocate more compute to hard problems at inference time. Systems like OpenAI’s o1 and DeepSeek-R1 represent a different approach to intelligence than scaling pure parameter counts — and the episode explained what deliberate reasoning actually does architecturally and where it outperforms standard generation.

These episodes won’t turn you into a machine learning researcher, but they will give you the conceptual scaffolding to understand why AI systems behave the way they do — and to evaluate claims about AI capabilities with something more than vibes.