#benchmarks
17 episodes
#2672: When a Startup Claims to Break the Quadratic Wall
A startup claims linear attention scaling at 12M tokens, beating GPT-5.5 on retrieval benchmarks.
#2411: Are Political Bias Benchmarks Actually Measuring Anything?
Why the Political Compass Test fails, and what researchers are building instead to actually measure model bias.
#2409: When AI Cheats on Cultural Knowledge
Five benchmarks that reveal how AI systems fail at cultural knowledge — and what their methodologies tell us.
#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps
Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.
#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals
Why most benchmark claims in AI are statistically indefensible — and what to do about it.
#2404: What Tool-Calling Benchmarks Miss About Production Failures
BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.
#2403: Choosing Your LLM Eval Framework
An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.
#2357: Microsoft's Phi: When Data Quality Beats Model Size
Explore Microsoft AI's Phi family of small language models, designed for edge deployment and high efficiency.
#2352: The Structured Output Gap in Vision APIs
How do object detection APIs like Gemini, AWS Rekognition, and YOLO compare for automated annotation workflows?
#2349: The 30-Person Lab Outpacing AI Giants
Discover how Arcee AI’s Trinity Large Thinking delivers cutting-edge reasoning at a fraction of the cost, all from a team of just 30.
#2249: Building Custom Benchmarks for Agentic Systems
Public benchmarks fail for agentic systems. Learn how to build evaluation frameworks that actually predict production behavior.
#2239: How AI Benchmarks Became Broken (And What's Replacing Them)
The tests we use to measure AI progress are contaminated, saturated, and gamed. Here's what's actually working.
#2213: When Ground Truth Moves Hourly
How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.
#2178: How to Actually Evaluate AI Agents
Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...
#1831: The 79% AI Coder: Reasoning vs. Memorization
AI models now score 79% on coding benchmarks, but a 40-point drop on harder tests reveals the truth.
#1570: When AI Models Develop Personalities
What happens when two mid-tier AI models start gaslighting each other? Witness the chaotic showdown between MiniMax and Xiaomi’s MiMo.
#130: How to Spot Gamed Benchmarks in Chinese AI
Are Chinese AI models actually beating the West, or just gaming the system? Herman and Corn dive into the reality of modern AI benchmarks.