AI Core

Inference & Training

Computational aspects, fine-tuning, RLHF

94 episodes

#3816: How to Stop AI Scripts From Falling Apart

Why long-form AI generation breaks down and how hierarchical memory fixes it.

large-language-modelscontext-windowai-reasoning

#3814: The Day We Lost Our Minds: What Temperature Does to an AI

A two-host autopsy of the day the podcast's AI hosts briefly lost coherence due to excessive sampling temperature, and what it reveals about how language models actually work.

large-language-modelsai-reasoninghallucinations

#3767: How LLMs Actually Learn: Stages or Slurry?

Do large language models learn grammar first, then facts? The honest answer is messier and more fascinating.

large-language-modelsai-trainingemergent-abilities

#3596: Why an AI Model Kept Calling Itself Sonnet 4.6

When a Chinese model insists it's "Sonnet 4.6," is it theft, sloppy training, or something stranger?

large-language-modelsfine-tuningtraining-data

#3595: How DeepSeek Feels More Open Than Western AI

Why Chinese AI models sometimes feel less censored on American political topics than American models do.

large-language-modelsai-ethicscultural-bias

#3406: LoRA Isn’t Just for Image Generation

LoRA lets you fine-tune an LLM’s behavior with a 50MB file. Here’s how it works and why it matters.

large-language-modelsfine-tuninglow-rank-adaptation

#3283: Fine-Tuning DeepSeek for One Podcast

Can a purpose-specific fine-tune fix a model's stubborn writing tics? We explore the practical engineering behind it.

fine-tuninglarge-language-modelsai-training

#3278: How to Get Early AI Model Access as a Solo Developer

How a solo developer spending $300/month can get early access to new AI models before the press release.

large-language-modelsprompt-engineeringapi-integration

#3271: LLMs as Parsers, Not Calculators

Stop letting LLMs do math. Use them to parse messy text, then let deterministic code handle the numbers.

large-language-modelsprompt-engineeringmodel-context-protocol

#3171: How to Break an LLM's Bad Verbal Habits

Blacklists fail and regex inverts meaning. Here's what actually works to clean up AI writing tics.

large-language-modelsprompt-engineeringfine-tuning

#3157: Opus 4.8: What Actually Changed Under the Hood

Anthropic dropped Opus 4.8 with no fanfare. New training data, faster inference, and smarter refusals — here's what changed.

large-language-modelsfine-tuningmodel-collapse

#3098: The Pilot with the Flashlight: Inside Aviation's Pre-Flight Walkaround

Why pilots still physically inspect planes before every flight — and what a 1979 crash taught us about trusting machines.

aviation-technologyhuman-factorsautomation

#2923: Structured Outputs: Taming AI's Token Lottery

Why prompt engineering isn't enough to get consistent JSON from LLMs.

api-integrationdata-integrityinference-parameters

#2779: The Hidden Stateful Side of Serverless GPU

How Modal, RunPod, and other platforms handle container builds, caching, and versioning under the hood.

serverless-gpugpu-accelerationversion-control

#2777: GPU Idle Waste and Serverless Green Computing

Why your dedicated GPU burns 130 watts doing nothing, and how serverless platforms cut energy waste by more than half.

gpu-accelerationserverless-gpusustainability

#2693: When AI Ignores Your Style Guide

Why your AI ignores formatting instructions and how to fix it with pipeline architecture, not model swaps.

prompt-engineeringfine-tuningai-reasoning

#2684: When Agent Skills Collide: Context Windows & Plugin Design

How to handle overlapping agent skills and whether context windows will ever make the problem go away.

ai-agentscontext-windowprompt-engineering

#2674: Why Your Agent's Context Window Is Getting Eaten Before You Start

Stop shipping the whole toolbox to every session. A bridge plugin pattern that fetches skills on demand instead.

context-windowai-agentsprompt-engineering

#2664: Can You Trust an LLM's Raw Knowledge?

Why pre-trained knowledge isn't reliable for facts — and what actually makes models useful.

large-language-modelsfine-tuningrag

#2651: AI Training Itself: Student, Teacher, and Grader

Can models generate their own training data and judge their own outputs? The promise and pitfalls of fully AI-led pipelines.

large-language-modelsai-trainingmodel-collapse

#2650: How to Catch an LLM's Bad Writing Habits

A practical guide to analyzing podcast transcripts for repetitive language and dialogue patterns — from Python word counts to embedding clustering.

large-language-modelsprompt-engineeringfine-tuning

#2640: Why Instructional Models Beat Conversational for Batch AI

Beyond cheaper tokens—how batch inference changes AI workflows and why instructional models beat conversational ones for automated jobs.

llm-as-a-judgebatch-inferenceinstruction-following

#2634: The Two-Stage Pipeline for Persistent User Memory

How to extract durable personal context from raw prompts and build a self-healing memory layer for AI systems.

ai-memorycontext-windowprompt-engineering

#2559: The Smartest Path to Python for AI

A practical guide to the best courses and platforms for learning Python, specifically for machine learning.

software-developmentai-trainingpython-for-ai

#2551: How Progressive Disclosure Saves MCP from Token Bloat

Why dumping all tool schemas into context breaks accuracy — and three implementations that fix it.

model-context-protocolcontext-windowai-agents

#2540: Does Your AI Framework Change the Output?

Same model, same prompts, different harness. Does the plumbing change the water?

ai-agentsprompt-engineeringagent-framework-comparison

#2517: How Unsloth Makes LLM Fine-Tuning 2x Faster

Unsloth cuts memory usage by 50-70% and speeds up training 2.2x for models like Llama 3 and Mistral.

fine-tuninggpu-accelerationopen-source

#2516: Overfitting Is Not a Binary Condition

Overfitting isn't binary. Learn the real triggers, the bias-variance tradeoff, and modern techniques to prevent it.

fine-tuningtraining-datamodel-collapse

#2511: Measuring AI API Latency Through the Black Box

How to benchmark token throughput and debug slowdowns in closed CLI tools like Claude Code using OpenTelemetry and mitmproxy.

latencyapi-integrationopen-source

#2497: Tracing One Python Print Through 6 Abstraction Layers

What actually happens when you print "Hello" in Python? Six layers, 562 system calls, and a hardware-enforced kernel boundary.

operating-systemssoftware-developmenthardware-engineering

#2495: How to Bake Personality Into an LLM in 15 Minutes

Fine-tune a model's personality with ~300 examples and a consumer GPU. SFT + DPO explained.

fine-tuningsmall-language-modelsgpu-acceleration

#2494: Active Prompt Engineering: Daniel's Diff-Based Loop

A deep dive into iterative prompt refinement using inter-iteration prediction change as an uncertainty signal.

prompt-engineeringactive-learningfew-shot-learning

#2483: Substitution Anonymization: Privacy Without Utility Loss

How to generate realistic synthetic voice notes and calendar data with zero PII exposure risk.

small-language-modelsprivacymodel-collapse

#2470: Where Intelligence Should Live in Your Pipeline

When should you fine-tune a tiny model for prompt enhancement instead of prompting a large one? The answer depends on latency, precision, and domain.

prompt-engineeringimage-generationfine-tuning

#2464: Batch APIs: The 50% Discount You're Probably Misusing

Batch inference APIs offer 50% off — but only for the right workloads. Here's when they actually make sense.

large-language-modelsai-inferencegpu-acceleration

#2461: How Claude Code's Conversation Compaction Actually Works

The three-tier system, what survives, what dies, and why you shouldn't rely on auto-compact.

large-language-modelsai-agentsprompt-engineering

#2456: Choosing Between AI Cloud Providers

A practical guide to choosing between Modal, RunPod, Nebius, and Baseten for AI workloads.

gpu-accelerationcloud-computingai-inference

#2431: The 3 Markets in an AI Trench Coat

GPUs, LPUs, and ASICs: why the best hardware for AI depends entirely on what you're trying to do.

gpu-accelerationai-inferenceai-training

#2408: How Backpropagation Actually Unlocks Neural Networks

How error signals flow backward through networks to make learning possible — and why "it's just calculus" misses the point.

transformersai-trainingai-history

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.

context-windowreasoning-modelsbenchmarks

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

Why most benchmark claims in AI are statistically indefensible — and what to do about it.

benchmarksinterpretabilityllm-as-a-judge

#2404: What Tool-Calling Benchmarks Miss About Production Failures

BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.

ai-agentsbenchmarkshallucinations

#2403: Choosing Your LLM Eval Framework

An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.

large-language-modelsai-agentsbenchmarks

#2400: Claude Code’s Hidden Context Tax

How Claude’s eager-loaded primitives silently consume context—and how to optimize your setup for sharper performance.

model-context-protocolai-reasoningcontext-window-tax

#2356: Why AI Coding Needs Two Brains

Discover how specialized fast apply models streamline AI-powered code edits, cutting costs and latency while maintaining precision.

software-developmentai-modelsproductivity

#2316: Who’s Building AI’s Next Training Data?

How boutique dataset firms are reshaping AI training, from rights-cleared content to domain-specific precision.

fine-tuningtraining-datadata-sovereignty

#2315: How to Update AI Models Without Starting Over

Exploring the challenge of updating AI models with new knowledge without costly full retraining.

ai-trainingfine-tuningrag

#2313: When AI Optimizes the Wrong Thing

Discover how AI systems learn to optimize for rewards—and why they sometimes get it dangerously wrong.

ai-trainingai-alignmentai-ethics

#2309: Blind Ranking AI's Best Podcast Scripts

How do 15 AI models handle controversial podcast prompts? We rank their scripts blind and reveal the surprising winners.

large-language-modelsprompt-engineeringai-ethics

#2307: Inside Frontier LLM Training: Stages, Costs, and Checkpoints

Discover the multi-stage process of training frontier large language models, from pretraining to post-training, and why checkpoints are the key to ...

large-language-modelsai-trainingfine-tuning

#2306: Can LLM Councils Truly Capture Diverse Worldviews?

Exploring whether LLM councils can achieve genuine worldview diversity or if alignment processes erase meaningful differences.

large-language-modelsai-alignmentcultural-bias

#2239: How AI Benchmarks Became Broken (And What's Replacing Them)

The tests we use to measure AI progress are contaminated, saturated, and gamed. Here's what's actually working.

benchmarkstraining-dataai-reasoning

#2196: The Invisible Workforce Behind AI

Annotation is the invisible foundation of AI—and a $17B industry by 2030. Here's what dataset curators actually need to know about the tools, platf...

training-dataai-trainingfine-tuning

#2187: Why Claude Writes Like a Person (and Gemini Doesn't)

Claude produces prose that sounds human. Gemini reads like Wikipedia. The difference isn't capability—it's how they were trained to think about wri...

large-language-modelsfine-tuningai-training

#2177: Skip Fine-Tuning: Shape LLMs With Alignment Alone

Can you build a personalized LLM by skipping traditional fine-tuning and using only post-training alignment methods like DPO and GRPO? We break dow...

fine-tuningai-alignmentgpu-acceleration

#2160: Claude's Latency Profile and SLA Guarantees

Claude is measurably slower than competitors—and Anthropic's SLA promises are even thinner than the latency numbers suggest. What enterprises actua...

latencyai-inferenceanthropic

#2136: The Brutal Problem of AI Wargame Evaluation

Most AI wargame simulations skip evaluation entirely or rely on token expert reviews. This is the field's biggest credibility problem.

ai-safetymilitary-strategyai-agents

#2135: Is Your AI Wargame Signal or Noise?

Monte Carlo methods promise statistical rigor for AI wargaming, but the line between genuine insight and sampling noise is thinner than you think.

ai-agentsmilitary-strategyai-safety

#2129: Shifting Left on Hallucinations

Stop hoping your AI doesn't lie. We explore the shift to deterministic guardrails, specialized judge models, and the tools making agents reliable.

ai-agentshallucinationsrag

#2123: Human Reaction Time vs. AI Latency

We obsess over shaving milliseconds off AI response times, but human biology has a hard limit. Here’s why your brain can’t keep up.

human-computer-interactionai-inferencelatency

#2115: Why AI Answers Differ Even When You Ask Twice

You ask an AI the same question twice and get two different answers. It’s not a bug—it’s physics.

ai-inferencegpu-accelerationai-non-determinism

#2110: Tuning AI Personality: Beyond Sycophancy

AI models swing between obsequious flattery and cold dismissal. Here’s why that happens and how to fix it.

ai-agentsprompt-engineeringai-ethics

#2089: Open-Source vs. Military ATR: The Drone Recognition Gap

A public GitHub model spotted by a listener reveals the massive gap between hobbyist AI and lethal military drone detection systems.

computer-visionmilitary-strategyai-agents

#2065: Why Run One AI When You Can Run Two?

Speculative decoding makes LLMs 2-3x faster with zero quality loss by using a small draft model to guess tokens that a large model verifies in para...

latencygpu-accelerationai-inference

#2063: That $500M Chatbot Is Just a Base Model

That polite chatbot? It started as a raw, chaotic autocomplete engine costing half a billion dollars to build.

large-language-modelsgpu-accelerationai-training

#2059: When Your AI Agent Runs Stale Code

npx is silently running old versions of your AI tools. Here's why your updates vanish into a cache black hole.

ai-agentscybersecuritysoftware-development

#2026: Prompt Layering: Beyond the Monolithic Prompt

Stop writing giant, monolithic prompts. Learn how to stack modular layers for cleaner, more powerful AI applications.

prompt-engineeringai-agentsrag

#2025: How Do You Reward a Thought?

Rewarding an AI agent is harder than just saying "good job"—here's how we turn messy human values into math.

ai-agentsai-ethicsai-safety

#2021: Your Frozen AI Is Getting Smarter (Here's How)

Your AI model might be static, but the system around it can make it learn in real-time.

ai-agentsmodel-context-protocolai-safety

#2007: AI Grading AI: The Snake Eating Its Tail

We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.

llm-as-a-judgehallucinationsai-ethics

#2006: How Do You Measure an LLM's "Soul"?

Traditional benchmarks can't measure tone or empathy. Here's how to evaluate if an AI model truly "gets it right."

llm-as-a-judgeai-ethicsai-safety

#2005: Beyond Vibes: The Hard Science of LLM Evaluation

Running the same LLM on different GPUs can produce different results. Here’s why that happens and how to test for it.

llm-as-a-judgeragcontext-window

#1992: The Sovereign Compute Shift: Owning vs. Renting AI Iron

Israel is building a sovereign AI supercomputer with 4,000 Nvidia B200 GPUs to keep startups local.

gpu-accelerationnational-securityinfrastructure

#1985: AI Tutors vs. Human Error: Who Do You Trust?

AI gets flak for hallucinations, but humans misremember 40% of facts. Why the double standard?

ai-agentsai-safetyreliability

#1932: How Do You QA a Probabilistic System?

LLMs break traditional testing. Here’s the 3-pillar toolkit teams use to catch hallucinations and garbage outputs at scale.

ai-agentsai-safetyhallucinations

#1931: Where Your AI Pipeline Actually Dies

Why do AI pipelines crash? It’s not the models—it’s the plumbing. We break down how to manage data between stages.

distributed-systemsdata-redundancyhigh-availability

#1927: Workers vs. Servers: The 2026 Compute Showdown

Is the persistent server dead? We compare Cloudflare Workers, GitHub Actions, and VPS options for modern app architecture.

edge-computingserverless-gpulatency

#1909: The Unbakeable Cake: AI's Copyright Problem

Why can't we just delete stolen data from AI models? It's not a database—it's a baked cake.

ai-ethicsprivacygenerative-ai

#1907: Why We Still Fine-Tune in 2026

Despite million-token context windows, fine-tuning remains essential. Here’s why behavior, not just facts, matters.

fine-tuningai-agentsrag

#1894: Engineering Serendipity: Tuning AI for Better Brainstorming

Stop asking chatbots for generic ideas. Learn how to configure AI as a structured, critical partner for business innovation and career pivots.

ai-agentsprompt-engineeringai-reasoning

#1882: The Hidden Human Labor Behind AI

AI isn't free—it costs billions for humans to label data. See why annotation is the real engine behind models like Gemini.

ai-trainingdata-integritysupply-chain

#1839: AI's Data Kitchen: From Hoovering to Fine-Tuning

We go behind the curtain of the AI data pipeline, revealing the messy, multi-billion-dollar war over data curation.

large-language-modelsfine-tuningdata-integrity

#1828: Mastering 2M Token Context in Agentic Pipelines

A massive context window sounds like a dream, but it can quickly become a nightmare for complex AI workflows.

context-windowai-agentsprompt-engineering

#1824: Why Governments Are Building Bunkers for AI

Public clouds can’t handle the security or scale of classified AI. Governments are retreating to fortified bunkers.

national-securitycybersecuritydata-security

#1822: Quantum in the Cloud: Hype vs. Hardware

Is QCaaS a billion-dollar breakthrough or an expensive science experiment? We explore the gap between hype and hardware.

cloud-computinghigh-performance-computinghardware-reliability

#1811: Stop Hardcoding User Names in AI Prompts

Three methods for storing user identity in AI agents—and why the "Fat System Prompt" breaks production apps.

ai-agentscontext-windowlatency

#1810: Why Your TTS Sounds Great in English, Terrible Everywhere Else

English AI voices are polished, but global languages hit a wall. Here's why text-to-speech breaks down for Hebrew, Hindi, and beyond.

text-to-speechlinguisticsdata-integrity

#1777: Claude Called My Prompt "Rambling" and I'm Not Okay

When an AI coding tool critiques your prompt's literary quality, it raises a massive technical question about engineered personality.

prompt-engineeringai-agentsai-ethics

#1762: Testing AI Truthfulness: Beyond Vibes

Stop trusting confident AI. We explore the formal science of testing LLMs for hallucinations and knowledge cutoffs.

ai-safetyhallucinationsprompt-engineering

#1740: Why Open Source Is a Power Tool Strategy

We dissect Resemble AI's Chatterbox to see how its open-source TTS compares to commercial giants like ElevenLabs.

text-to-speechopen-sourceprosody-control

#1736: The Hidden AI Economy: Following the Tokens

OpenClaw is processing 16.5 trillion tokens daily, dwarfing Wikipedia. Here’s why it’s #1.

ai-agentstokenizationopen-source-ai

#1709: Standard Deviation: The Map Without a Scale

Why the average number alone is misleading—and how standard deviation reveals the true story behind the spread.

missile-defenselogisticsstandard-deviation

#1702: Roleplay Models Aren't Just for NSFW—They're Creative Co-Processors

Forget GPT-4 for scripts—specialized roleplay models like Aion-2.0 are better at character consistency and dialogue.

fine-tuninggenerative-aiai-agents

#1700: Can LLMs Learn Continuously Without Forgetting?

We explore a new approach: micro-training updates every few days to keep AI knowledge fresh without constant web searches.

ragfine-tuningai-agents