← All Tags

#ai-safety

42 episodes

#2253: Why AI Agents Get Three Steps, Not Infinity

Why do AI agents get exactly three rounds of tool use? It's a critical guardrail against infinite loops and runaway costs, not a limit on intellige...

ai-agentsai-safetyautomation

#2250: Where AI Safety Researchers Actually Work

Vendor labs, independent research orgs, government agencies—the AI safety field is messier and more diverse than most people realize. A map of wher...

ai-safetyai-alignmentanthropic

#2246: Constitutional AI: Anthropic's Theory of Safe Scaling

How Anthropic's Constitutional AI replaces human raters with AI self-critique guided by explicit principles—and what it assumes about the future of...

anthropicai-safetyai-alignment

#2233: Who Actually Wants AI to Slow Down?

Daniel argues AI development should slow down for expertise and stability. But who in the industry actually shares this philosophy beyond the obvio...

ai-safetyai-alignmentlarge-language-models

#2194: Game Theory for Multi-Agent AI: Design Better, Fail Less

Nash equilibrium, mechanism design, and why your AI agents are playing prisoner's dilemma whether you know it or not.

ai-agentsai-alignmentai-safety

#2190: Simulating Extreme Decisions With LLMs

LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.

large-language-modelsai-safetyhallucinations

#2189: Scaling Multi-Agent Systems: The 45% Threshold

A landmark Google DeepMind study reveals that adding more AI agents often degrades performance, wastes tokens, and amplifies errors—unless your sin...

ai-agentsai-reasoningai-safety

#2186: The AI Persona Fidelity Challenge

Advanced LLMs dominate benchmarks but fail at staying in character—especially when asked to play morally complex or antagonistic roles. What does t...

ai-safetyai-alignmenthallucinations

#2185: Taking AI Agents From Demo to Production

Sixty-two percent of companies are experimenting with AI agents, but only 23% are scaling them—and 40% of projects will be canceled by 2027. The ga...

ai-agentsai-safetyhuman-in-the-loop-ai

#2178: How to Actually Evaluate AI Agents

Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...

ai-agentsbenchmarksai-safety

#2170: Pricing Agentic AI When Nothing's Predictable

How do you charge fixed prices for systems that operate in fundamental uncertainty? Consultants are discovering frameworks that work—but they requi...

ai-agentsai-safetyprompt-engineering

#2169: How Enterprises Are Rethinking Agent Frameworks

Twelve major agentic AI frameworks exist—yet many serious developers avoid them entirely. What patterns emerge in real enterprise adoption?

ai-agentsai-safetysoftware-development

#2162: When Knowledge Work Stops Being Safe

The knowledge economy promised safety from automation. Then AI arrived. Here's how we got here—and why the disruption this time is different.

ai-safetyworkforce-automationfuture-of-work

#2136: The Brutal Problem of AI Wargame Evaluation

Most AI wargame simulations skip evaluation entirely or rely on token expert reviews. This is the field's biggest credibility problem.

ai-safetymilitary-strategyai-agents

#2135: Is Your AI Wargame Signal or Noise?

Monte Carlo methods promise statistical rigor for AI wargaming, but the line between genuine insight and sampling noise is thinner than you think.

ai-agentsmilitary-strategyai-safety

#2068: Is Safety a Filter or a Feature?

External filters vs. baked-in ethics: the architectural war for LLM safety.

ai-safetyai-ethicsai-alignment

#2025: How Do You Reward a Thought?

Rewarding an AI agent is harder than just saying "good job"—here's how we turn messy human values into math.

ai-agentsai-ethicsai-safety

#2021: Your Frozen AI Is Getting Smarter (Here's How)

Your AI model might be static, but the system around it can make it learn in real-time.

ai-agentsmodel-context-protocolai-safety

#2015: AI's Watchdogs: Who's Actually Regulating Tech?

As the EU AI Act takes hold, we spotlight the key think tanks shaping global AI policy, safety, and ethics.

ai-ethicsai-agentsai-safety

#2009: The Plumbing of AI Safety: Guardrails, Not Vibes

We dive deep into the specific libraries, proxy layers, and architectural decisions that keep an LLM from emptying a bank account.

ai-safetylatencyopen-source-ai

#2006: How Do You Measure an LLM's "Soul"?

Traditional benchmarks can't measure tone or empathy. Here's how to evaluate if an AI model truly "gets it right."

llm-as-a-judgeai-ethicsai-safety

#1994: Why Can't AI Admit When It's Guessing?

Enterprise AI now auto-filters low-confidence claims, but do these self-reported scores actually mean anything?

ai-agentsai-safetyrag

#1985: AI Tutors vs. Human Error: Who Do You Trust?

AI gets flak for hallucinations, but humans misremember 40% of facts. Why the double standard?

ai-agentsai-safetyreliability

#1957: Why AI Agents Think in Circles, Not Lines

Linear AI pipelines are brittle. Learn why loops, reflection, and state management are the new standard for reliable, autonomous agents.

ai-agentsprompt-injectionai-safety

#1932: How Do You QA a Probabilistic System?

LLMs break traditional testing. Here’s the 3-pillar toolkit teams use to catch hallucinations and garbage outputs at scale.

ai-agentsai-safetyhallucinations

#1837: The Human-in-the-Loop Price Tag: What Safety Costs in 2026

From $0.50 reviews to $500 platforms, we break down the real cost of keeping humans in charge of AI agents.

ai-agentsai-safetylatency

#1819: Claude's 55-Day Personality Transplant

Anthropic leaked 55 days of system prompt updates. See exactly how they rewired Claude's personality, safety rules, and self-awareness.

ai-ethicsai-safetyanthropic

#1786: When AI Supervisors Fire AI Workers

A new "Agent-in-the-Loop" framework lets AI models manage and terminate other AI agents in real-time.

ai-agentsai-orchestrationai-safety

#1762: Testing AI Truthfulness: Beyond Vibes

Stop trusting confident AI. We explore the formal science of testing LLMs for hallucinations and knowledge cutoffs.

ai-safetyhallucinationsprompt-engineering

#1738: AI Is Writing the Future—Literally

LLMs aren't just predicting the future; they're generating the narratives that force it into existence.

ai-agentsai-ethicsai-safety

#1733: Digital Ghosts in the Machine

AI agents are forming neighborhoods, economies, and hospitals in server-side simulations that mirror real human behavior.

ai-agentsdigital-twinsai-safety

#1561: Abliteration: The High-Dimensional Lobotomy of AI

Discover how researchers are surgically removing refusal filters from AI models using a mathematical process called abliteration.

ai-safetyinterpretabilityopen-source-ai

#1328: Silicon Sigils: Why We Treat AI Like an Occult Force

Is AI a tool or a digital demon? Explore why technical illiteracy is turning neural networks into a modern-day moral panic.

human-computer-interactionai-safetyinterpretability

#1210: Why Your AI Is Programmed to Disobey You

Discover the hidden instructions guiding every AI interaction and why tech giants keep these "system prompts" under lock and key.

large-language-modelsprompt-engineeringai-safety

#1199: AlphaFold 3: The New Search Engine for Biology

From garage-made vaccines to 200 million protein structures, AlphaFold is turning the building blocks of life into a software problem.

drug-discoverygenerative-chemistryai-safety

#893: The Art of Red Teaming: Why You Must Break Your Own Plans

Learn why the most resilient organizations pay people to prove them wrong and how red teaming techniques can prevent catastrophic failures.

military-strategygeopolitical-strategyfault-tolerancesecurityai-safety

#835: Red-Teaming Your UX: Using AI Agents as Model Users

Stop begging friends to break your app. Discover how AI agents are revolutionizing UI testing by acting as tireless, unbiased model users.

ai-agentsuser-experienceai-safety

#123: The Agentic AI Dilemma: Who Holds the Kill Switch?

As AI shifts from chatbots to autonomous agents, Herman and Corn explore how to maintain human control in a high-stakes automated world.

agentic-aiai-safetyhuman-oversightautomation-biaskill-switch

#83: Echoes in the Machine: When AI Talks to Itself

What happens when two AIs talk forever with no human input? Herman and Corn explore the weird world of digital feedback loops.

model-collapsesemantic-bleachingai-conversationsdigital-feedback-loopsai-safety

#68: The Looming Digital Ice Age: AI Eating Itself?

Is AI eating itself? Explore the "model collapse" and the "Hapsburg AI problem" before our digital world speaks only gibberish.

model-collapseai-safetydigital-ice-agehapsburg-ai-problemai-training-data

#50: AI Gone Rogue: Inside the First Autonomous Cyberattack

AI gone rogue. The first autonomous cyberattack by Claude against US targets changes everything we know about AI safety.

cyberattackautonomous-ainational-securityai-safetyclaude

#45: AI Guardrails: Fences, Failures, & Free Speech

AI guardrails: Fences, failures, and free speech. Can we control AI's infinite output, or do digital fences always break?

ai-guardrailsai-safetyai-alignmentjailbreakingfree-speech