#benchmarks

17 episodes

May 6

#2672: When a Startup Claims to Break the Quadratic Wall

A startup claims linear attention scaling at 12M tokens, beating GPT-5.5 on retrieval benchmarks.

large-language-modelscontext-windowbenchmarks

Apr 25

#2411: Are Political Bias Benchmarks Actually Measuring Anything?

Why the Political Compass Test fails, and what researchers are building instead to actually measure model bias.

ai-ethicscultural-biasbenchmarks

Apr 25

#2409: When AI Cheats on Cultural Knowledge

Five benchmarks that reveal how AI systems fail at cultural knowledge — and what their methodologies tell us.

cultural-biasbenchmarksmultimodal-ai

Apr 25

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.

context-windowreasoning-modelsbenchmarks

Apr 25

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

Why most benchmark claims in AI are statistically indefensible — and what to do about it.

benchmarksinterpretabilityllm-as-a-judge

Apr 25

#2404: What Tool-Calling Benchmarks Miss About Production Failures

BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.

ai-agentsbenchmarkshallucinations

Apr 25

#2403: Choosing Your LLM Eval Framework

An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.

large-language-modelsai-agentsbenchmarks

Apr 20

#2357: Microsoft's Phi: When Data Quality Beats Model Size

Explore Microsoft AI's Phi family of small language models, designed for edge deployment and high efficiency.

small-language-modelsedge-computingbenchmarks

Apr 20

#2352: The Structured Output Gap in Vision APIs

How do object detection APIs like Gemini, AWS Rekognition, and YOLO compare for automated annotation workflows?

computer-visionapi-integrationbenchmarks

Apr 20

#2349: The 30-Person Lab Outpacing AI Giants

Discover how Arcee AI’s Trinity Large Thinking delivers cutting-edge reasoning at a fraction of the cost, all from a team of just 30.

ai-modelsreasoning-modelsbenchmarks

Apr 16

#2249: Building Custom Benchmarks for Agentic Systems

Public benchmarks fail for agentic systems. Learn how to build evaluation frameworks that actually predict production behavior.

ai-agentsbenchmarksai-inference

Apr 16

#2239: How AI Benchmarks Became Broken (And What's Replacing Them)

The tests we use to measure AI progress are contaminated, saturated, and gamed. Here's what's actually working.

benchmarkstraining-dataai-reasoning

Apr 14

#2213: When Ground Truth Moves Hourly

How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.

ragbenchmarkshallucinations

Apr 12

#2178: How to Actually Evaluate AI Agents

Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...

ai-agentsbenchmarksai-safety

Mar 31

#1831: The 79% AI Coder: Reasoning vs. Memorization

AI models now score 79% on coding benchmarks, but a 40-point drop on harder tests reveals the truth.

ai-agentsai-inferencebenchmarks

Mar 26

#1570: When AI Models Develop Personalities

What happens when two mid-tier AI models start gaslighting each other? Witness the chaotic showdown between MiniMax and Xiaomi’s MiMo.

ai-modelsbenchmarksai-reasoning

Jan 1

#130: How to Spot Gamed Benchmarks in Chinese AI

Are Chinese AI models actually beating the West, or just gaming the system? Herman and Corn dive into the reality of modern AI benchmarks.

large-language-modelsai-agentsbenchmarks

#2672: When a Startup Claims to Break the Quadratic Wall

#2411: Are Political Bias Benchmarks Actually Measuring Anything?

#2409: When AI Cheats on Cultural Knowledge

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

#2404: What Tool-Calling Benchmarks Miss About Production Failures

#2403: Choosing Your LLM Eval Framework

#2357: Microsoft's Phi: When Data Quality Beats Model Size

#2352: The Structured Output Gap in Vision APIs

#2349: The 30-Person Lab Outpacing AI Giants

#2249: Building Custom Benchmarks for Agentic Systems

#2239: How AI Benchmarks Became Broken (And What's Replacing Them)

#2213: When Ground Truth Moves Hourly

#2178: How to Actually Evaluate AI Agents

#1831: The 79% AI Coder: Reasoning vs. Memorization

#1570: When AI Models Develop Personalities

#130: How to Spot Gamed Benchmarks in Chinese AI

Related Topics