Inference & Training
Computational aspects, fine-tuning, RLHF
94 episodes
#3816: How to Stop AI Scripts From Falling Apart
Why long-form AI generation breaks down and how hierarchical memory fixes it.
#3814: The Day We Lost Our Minds: What Temperature Does to an AI
A two-host autopsy of the day the podcast's AI hosts briefly lost coherence due to excessive sampling temperature, and what it reveals about how language models actually work.
#3767: How LLMs Actually Learn: Stages or Slurry?
Do large language models learn grammar first, then facts? The honest answer is messier and more fascinating.
#3596: Why an AI Model Kept Calling Itself Sonnet 4.6
When a Chinese model insists it's "Sonnet 4.6," is it theft, sloppy training, or something stranger?
#3595: How DeepSeek Feels More Open Than Western AI
Why Chinese AI models sometimes feel less censored on American political topics than American models do.
#3406: LoRA Isn’t Just for Image Generation
LoRA lets you fine-tune an LLM’s behavior with a 50MB file. Here’s how it works and why it matters.
#3283: Fine-Tuning DeepSeek for One Podcast
Can a purpose-specific fine-tune fix a model's stubborn writing tics? We explore the practical engineering behind it.
#3278: How to Get Early AI Model Access as a Solo Developer
How a solo developer spending $300/month can get early access to new AI models before the press release.
#3271: LLMs as Parsers, Not Calculators
Stop letting LLMs do math. Use them to parse messy text, then let deterministic code handle the numbers.
#3171: How to Break an LLM's Bad Verbal Habits
Blacklists fail and regex inverts meaning. Here's what actually works to clean up AI writing tics.
#3157: Opus 4.8: What Actually Changed Under the Hood
Anthropic dropped Opus 4.8 with no fanfare. New training data, faster inference, and smarter refusals — here's what changed.
#3098: The Pilot with the Flashlight: Inside Aviation's Pre-Flight Walkaround
Why pilots still physically inspect planes before every flight — and what a 1979 crash taught us about trusting machines.
#2923: Structured Outputs: Taming AI's Token Lottery
Why prompt engineering isn't enough to get consistent JSON from LLMs.
#2779: The Hidden Stateful Side of Serverless GPU
How Modal, RunPod, and other platforms handle container builds, caching, and versioning under the hood.
#2777: GPU Idle Waste and Serverless Green Computing
Why your dedicated GPU burns 130 watts doing nothing, and how serverless platforms cut energy waste by more than half.
#2693: When AI Ignores Your Style Guide
Why your AI ignores formatting instructions and how to fix it with pipeline architecture, not model swaps.
#2684: When Agent Skills Collide: Context Windows & Plugin Design
How to handle overlapping agent skills and whether context windows will ever make the problem go away.
#2674: Why Your Agent's Context Window Is Getting Eaten Before You Start
Stop shipping the whole toolbox to every session. A bridge plugin pattern that fetches skills on demand instead.
#2664: Can You Trust an LLM's Raw Knowledge?
Why pre-trained knowledge isn't reliable for facts — and what actually makes models useful.
#2651: AI Training Itself: Student, Teacher, and Grader
Can models generate their own training data and judge their own outputs? The promise and pitfalls of fully AI-led pipelines.
#2650: How to Catch an LLM's Bad Writing Habits
A practical guide to analyzing podcast transcripts for repetitive language and dialogue patterns — from Python word counts to embedding clustering.
#2640: Why Instructional Models Beat Conversational for Batch AI
Beyond cheaper tokens—how batch inference changes AI workflows and why instructional models beat conversational ones for automated jobs.
#2634: The Two-Stage Pipeline for Persistent User Memory
How to extract durable personal context from raw prompts and build a self-healing memory layer for AI systems.
#2559: The Smartest Path to Python for AI
A practical guide to the best courses and platforms for learning Python, specifically for machine learning.
#2551: How Progressive Disclosure Saves MCP from Token Bloat
Why dumping all tool schemas into context breaks accuracy — and three implementations that fix it.
#2540: Does Your AI Framework Change the Output?
Same model, same prompts, different harness. Does the plumbing change the water?
#2517: How Unsloth Makes LLM Fine-Tuning 2x Faster
Unsloth cuts memory usage by 50-70% and speeds up training 2.2x for models like Llama 3 and Mistral.
#2516: Overfitting Is Not a Binary Condition
Overfitting isn't binary. Learn the real triggers, the bias-variance tradeoff, and modern techniques to prevent it.
#2511: Measuring AI API Latency Through the Black Box
How to benchmark token throughput and debug slowdowns in closed CLI tools like Claude Code using OpenTelemetry and mitmproxy.
#2497: Tracing One Python Print Through 6 Abstraction Layers
What actually happens when you print "Hello" in Python? Six layers, 562 system calls, and a hardware-enforced kernel boundary.
#2495: How to Bake Personality Into an LLM in 15 Minutes
Fine-tune a model's personality with ~300 examples and a consumer GPU. SFT + DPO explained.
#2494: Active Prompt Engineering: Daniel's Diff-Based Loop
A deep dive into iterative prompt refinement using inter-iteration prediction change as an uncertainty signal.
#2483: Substitution Anonymization: Privacy Without Utility Loss
How to generate realistic synthetic voice notes and calendar data with zero PII exposure risk.
#2470: Where Intelligence Should Live in Your Pipeline
When should you fine-tune a tiny model for prompt enhancement instead of prompting a large one? The answer depends on latency, precision, and domain.
#2464: Batch APIs: The 50% Discount You're Probably Misusing
Batch inference APIs offer 50% off — but only for the right workloads. Here's when they actually make sense.
#2461: How Claude Code's Conversation Compaction Actually Works
The three-tier system, what survives, what dies, and why you shouldn't rely on auto-compact.
#2456: Choosing Between AI Cloud Providers
A practical guide to choosing between Modal, RunPod, Nebius, and Baseten for AI workloads.
#2431: The 3 Markets in an AI Trench Coat
GPUs, LPUs, and ASICs: why the best hardware for AI depends entirely on what you're trying to do.
#2408: How Backpropagation Actually Unlocks Neural Networks
How error signals flow backward through networks to make learning possible — and why "it's just calculus" misses the point.
#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps
Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.
#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals
Why most benchmark claims in AI are statistically indefensible — and what to do about it.
#2404: What Tool-Calling Benchmarks Miss About Production Failures
BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.
#2403: Choosing Your LLM Eval Framework
An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.
#2400: Claude Code’s Hidden Context Tax
How Claude’s eager-loaded primitives silently consume context—and how to optimize your setup for sharper performance.
#2356: Why AI Coding Needs Two Brains
Discover how specialized fast apply models streamline AI-powered code edits, cutting costs and latency while maintaining precision.
#2316: Who’s Building AI’s Next Training Data?
How boutique dataset firms are reshaping AI training, from rights-cleared content to domain-specific precision.
#2315: How to Update AI Models Without Starting Over
Exploring the challenge of updating AI models with new knowledge without costly full retraining.
#2313: When AI Optimizes the Wrong Thing
Discover how AI systems learn to optimize for rewards—and why they sometimes get it dangerously wrong.
#2309: Blind Ranking AI's Best Podcast Scripts
How do 15 AI models handle controversial podcast prompts? We rank their scripts blind and reveal the surprising winners.
#2307: Inside Frontier LLM Training: Stages, Costs, and Checkpoints
Discover the multi-stage process of training frontier large language models, from pretraining to post-training, and why checkpoints are the key to ...
#2306: Can LLM Councils Truly Capture Diverse Worldviews?
Exploring whether LLM councils can achieve genuine worldview diversity or if alignment processes erase meaningful differences.
#2239: How AI Benchmarks Became Broken (And What's Replacing Them)
The tests we use to measure AI progress are contaminated, saturated, and gamed. Here's what's actually working.
#2196: The Invisible Workforce Behind AI
Annotation is the invisible foundation of AI—and a $17B industry by 2030. Here's what dataset curators actually need to know about the tools, platf...
#2187: Why Claude Writes Like a Person (and Gemini Doesn't)
Claude produces prose that sounds human. Gemini reads like Wikipedia. The difference isn't capability—it's how they were trained to think about wri...
#2177: Skip Fine-Tuning: Shape LLMs With Alignment Alone
Can you build a personalized LLM by skipping traditional fine-tuning and using only post-training alignment methods like DPO and GRPO? We break dow...
#2160: Claude's Latency Profile and SLA Guarantees
Claude is measurably slower than competitors—and Anthropic's SLA promises are even thinner than the latency numbers suggest. What enterprises actua...
#2136: The Brutal Problem of AI Wargame Evaluation
Most AI wargame simulations skip evaluation entirely or rely on token expert reviews. This is the field's biggest credibility problem.
#2135: Is Your AI Wargame Signal or Noise?
Monte Carlo methods promise statistical rigor for AI wargaming, but the line between genuine insight and sampling noise is thinner than you think.
#2129: Shifting Left on Hallucinations
Stop hoping your AI doesn't lie. We explore the shift to deterministic guardrails, specialized judge models, and the tools making agents reliable.
#2123: Human Reaction Time vs. AI Latency
We obsess over shaving milliseconds off AI response times, but human biology has a hard limit. Here’s why your brain can’t keep up.
#2115: Why AI Answers Differ Even When You Ask Twice
You ask an AI the same question twice and get two different answers. It’s not a bug—it’s physics.
#2110: Tuning AI Personality: Beyond Sycophancy
AI models swing between obsequious flattery and cold dismissal. Here’s why that happens and how to fix it.
#2089: Open-Source vs. Military ATR: The Drone Recognition Gap
A public GitHub model spotted by a listener reveals the massive gap between hobbyist AI and lethal military drone detection systems.
#2065: Why Run One AI When You Can Run Two?
Speculative decoding makes LLMs 2-3x faster with zero quality loss by using a small draft model to guess tokens that a large model verifies in para...
#2063: That $500M Chatbot Is Just a Base Model
That polite chatbot? It started as a raw, chaotic autocomplete engine costing half a billion dollars to build.
#2059: When Your AI Agent Runs Stale Code
npx is silently running old versions of your AI tools. Here's why your updates vanish into a cache black hole.
#2026: Prompt Layering: Beyond the Monolithic Prompt
Stop writing giant, monolithic prompts. Learn how to stack modular layers for cleaner, more powerful AI applications.
#2025: How Do You Reward a Thought?
Rewarding an AI agent is harder than just saying "good job"—here's how we turn messy human values into math.
#2021: Your Frozen AI Is Getting Smarter (Here's How)
Your AI model might be static, but the system around it can make it learn in real-time.
#2007: AI Grading AI: The Snake Eating Its Tail
We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.
#2006: How Do You Measure an LLM's "Soul"?
Traditional benchmarks can't measure tone or empathy. Here's how to evaluate if an AI model truly "gets it right."
#2005: Beyond Vibes: The Hard Science of LLM Evaluation
Running the same LLM on different GPUs can produce different results. Here’s why that happens and how to test for it.
#1992: The Sovereign Compute Shift: Owning vs. Renting AI Iron
Israel is building a sovereign AI supercomputer with 4,000 Nvidia B200 GPUs to keep startups local.
#1985: AI Tutors vs. Human Error: Who Do You Trust?
AI gets flak for hallucinations, but humans misremember 40% of facts. Why the double standard?
#1932: How Do You QA a Probabilistic System?
LLMs break traditional testing. Here’s the 3-pillar toolkit teams use to catch hallucinations and garbage outputs at scale.
#1931: Where Your AI Pipeline Actually Dies
Why do AI pipelines crash? It’s not the models—it’s the plumbing. We break down how to manage data between stages.
#1927: Workers vs. Servers: The 2026 Compute Showdown
Is the persistent server dead? We compare Cloudflare Workers, GitHub Actions, and VPS options for modern app architecture.
#1909: The Unbakeable Cake: AI's Copyright Problem
Why can't we just delete stolen data from AI models? It's not a database—it's a baked cake.
#1907: Why We Still Fine-Tune in 2026
Despite million-token context windows, fine-tuning remains essential. Here’s why behavior, not just facts, matters.
#1894: Engineering Serendipity: Tuning AI for Better Brainstorming
Stop asking chatbots for generic ideas. Learn how to configure AI as a structured, critical partner for business innovation and career pivots.
#1882: The Hidden Human Labor Behind AI
AI isn't free—it costs billions for humans to label data. See why annotation is the real engine behind models like Gemini.
#1839: AI's Data Kitchen: From Hoovering to Fine-Tuning
We go behind the curtain of the AI data pipeline, revealing the messy, multi-billion-dollar war over data curation.
#1828: Mastering 2M Token Context in Agentic Pipelines
A massive context window sounds like a dream, but it can quickly become a nightmare for complex AI workflows.
#1824: Why Governments Are Building Bunkers for AI
Public clouds can’t handle the security or scale of classified AI. Governments are retreating to fortified bunkers.
#1822: Quantum in the Cloud: Hype vs. Hardware
Is QCaaS a billion-dollar breakthrough or an expensive science experiment? We explore the gap between hype and hardware.
#1811: Stop Hardcoding User Names in AI Prompts
Three methods for storing user identity in AI agents—and why the "Fat System Prompt" breaks production apps.
#1810: Why Your TTS Sounds Great in English, Terrible Everywhere Else
English AI voices are polished, but global languages hit a wall. Here's why text-to-speech breaks down for Hebrew, Hindi, and beyond.
#1777: Claude Called My Prompt "Rambling" and I'm Not Okay
When an AI coding tool critiques your prompt's literary quality, it raises a massive technical question about engineered personality.
#1762: Testing AI Truthfulness: Beyond Vibes
Stop trusting confident AI. We explore the formal science of testing LLMs for hallucinations and knowledge cutoffs.
#1740: Why Open Source Is a Power Tool Strategy
We dissect Resemble AI's Chatterbox to see how its open-source TTS compares to commercial giants like ElevenLabs.
#1736: The Hidden AI Economy: Following the Tokens
OpenClaw is processing 16.5 trillion tokens daily, dwarfing Wikipedia. Here’s why it’s #1.
#1709: Standard Deviation: The Map Without a Scale
Why the average number alone is misleading—and how standard deviation reveals the true story behind the spread.
#1702: Roleplay Models Aren't Just for NSFW—They're Creative Co-Processors
Forget GPT-4 for scripts—specialized roleplay models like Aion-2.0 are better at character consistency and dialogue.
#1700: Can LLMs Learn Continuously Without Forgetting?
We explore a new approach: micro-training updates every few days to keep AI knowledge fresh without constant web searches.