#2739: When Hoofbeats Are Zebras: How Doctors Learn to Think

How family doctors develop clinical judgment—pattern recognition, Bayesian reasoning, and the cognitive traps that lead to diagnostic errors.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-2900
Published: May 10
Duration: 27:55
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: neuroscience medical-history clinical-judgment

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Clinical reasoning is fundamentally a probabilistic exercise under uncertainty—doctors never operate with complete information, yet must make decisions anyway. This episode explores how family physicians develop that critical judgment, from the formal Bayesian frameworks taught in medical school to the "illness scripts" that encode base rate information implicitly through years of patient experience.

The single most powerful mechanism for developing clinical judgment is longitudinal feedback loops—seeing the same patients over time and learning whether your diagnoses were correct. Modern healthcare structures often break these loops: hospitalists and ER physicians frequently never learn what happened after patients left their care, while family doctors in continuity practices forge their judgment in the crucible of follow-up visits and disease natural history.

The episode also examines the cognitive failure modes that drive diagnostic errors, which Pat Croskerry's work showed are about seventy-five percent cognitive bias rather than knowledge gaps. Anchoring—latching onto an early feature and failing to update—and premature closure—accepting a diagnosis before full verification—are among the most common. Strategies like diagnostic timeouts and prospective hindsight (pre-mortems) can combat these biases, but the deeper challenge is that base rates are local: a clinician moving from rural Vermont to downtown Miami must deliberately recalibrate their internal probabilities for conditions like Lyme disease.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2739: When Hoofbeats Are Zebras: How Doctors Learn to Think

Daniel sent us this one — he wants to talk about clinical diagnostics, that gut sense doctors develop for knowing what's likely versus what's rare. The old saying, "when you hear hooves, think horses, not zebras." He's asking how family doctors and GPs, who see absolutely everything walk through the door, develop real clinical judgment. How it's taught in med school, refined in residency, how experienced clinicians pass it to juniors. He specifically flagged Bayesian reasoning, base rates, pattern recognition, feedback loops, and the famous failure modes — anchoring, premature closure, and the times when the hoofbeats actually are zebras.

Oh, this is such a rich topic. And honestly, the public understanding of how doctors think bears almost no resemblance to what's actually going on cognitively. Most people imagine it's some kind of encyclopedic recall contest — the doctor who memorized the most rare diseases wins. But that's exactly backwards.

Right, because the doctor who chases zebras is the one who orders a spinal tap for a tension headache. That's not just wasteful, it's actively harmful.

The fundamental insight of clinical reasoning — and this goes back to work in the nineteen-seventies that really formalized it — is that diagnosis is fundamentally a probabilistic exercise under uncertainty. You're never operating with complete information. You have to make decisions anyway.

Walk me through how this actually gets taught. Because I imagine you don't hand a twenty-three-year-old med student a textbook on Bayesian statistics and say "good luck.

No, and that's been one of the big pedagogical debates for decades. There's this tension between teaching formal probabilistic reasoning — literally Bayes' theorem, pre-test probability, likelihood ratios — versus what's sometimes called "illness scripts," which is more pattern recognition based. Experienced clinicians use both, but at different stages.

Explain illness scripts. I've heard you use that term before.

An illness script is a mental model of how a disease typically presents. It's not just a list of symptoms — it includes context, patient demographics, temporal pattern, setting. A seasoned GP's script for community-acquired pneumonia isn't just "cough, fever, shortness of breath." It's more like: "Winter months, older patient, sudden onset, productive cough, unilateral chest findings, maybe their neighbor had something similar last week." It's a narrative structure that encodes base rate information implicitly.

— the base rates are baked into the script rather than calculated explicitly.

And that's how most experienced clinicians operate most of the time. They're not doing explicit math. But — and this is crucial — the explicit Bayesian framework is what you fall back on when pattern recognition fails, or when you're in unfamiliar clinical territory where you don't have well-developed illness scripts yet.

The teaching sequence would be — start with the formal framework, then build the pattern recognition on top of it?

That's the ideal, and it's how a lot of modern curricula are structured. There's been a shift away from the old Flexner model — two years of basic sciences, two years of clinical rotations — toward more integrated approaches. Places like McMaster in Canada pioneered problem-based learning where students encounter clinical cases from day one and reason through them explicitly.

When did that shift happen?

McMaster launched their program in nineteen sixty-nine, and it was genuinely revolutionary. The idea was that you learn the basic sciences in the context of clinical problems, rather than as abstract prerequisites. Instead of memorizing all of biochemistry and then two years later learning how it applies to diabetes, you encounter a diabetic patient case and work backward to understand the underlying biochemistry.

Which seems like it would make the probabilistic reasoning more intuitive from the start.

It does, but it also has its own failure modes. One criticism of pure problem-based learning is that students can end up with patchy knowledge — they know the mechanisms they've encountered through cases, but might have gaps in areas that didn't come up organically. So most places now do some hybrid. But let me get to something even more important than the formal curriculum, and it ties directly to what Daniel was asking about how this gut sense actually develops. The single most powerful mechanism is feedback loops — specifically, longitudinal feedback where you see the same patients over time.

You make a diagnosis, you see whether you were right.

This is where the structure of modern healthcare actually works against developing good clinical judgment. If you're a hospitalist or an emergency department physician, you often never find out what happened to your patients after they left your care. You made a diagnosis, initiated treatment, they went upstairs or went home, and you never saw them again. That's a broken feedback loop.

Whereas a family doctor in a small practice sees the same patients for years.

And that's the crucible where real clinical judgment gets forged. You make a call — "I think this is viral, come back if it doesn't resolve in five days" — and then they actually come back, or they don't. You see the natural history of disease play out. You learn what "sick" looks like versus "not sick" in a way no textbook can teach you.

There's something almost apprenticeship-like about that. You're learning from the disease itself.

And this is also how experienced clinicians teach juniors. The classic teaching method in primary care is the one-minute preceptor model — the trainee sees the patient, presents the case, and the preceptor probes their reasoning. "What do you think is going on? What else could it be? What makes you lean toward that diagnosis?" It's Socratic, but in the context of real patients with real consequences.

Let's dig into the Bayesian piece more concretely. You mentioned pre-test probability — can you walk through how that actually works in practice?

Let's take chest pain in a primary care setting. If a thirty-five-year-old woman with no cardiac risk factors presents with chest pain, the pre-test probability of coronary artery disease is extremely low — well under one percent. Even if the pain has some features consistent with angina, the base rate dominates. You're almost certainly dealing with something musculoskeletal, gastrointestinal, or anxiety-related.

The same symptom in a sixty-five-year-old diabetic smoker is a completely different calculus.

The pre-test probability there might be twenty or thirty percent before you've even done an EKG. This is where the Bayesian framework really shines — it forces you to be explicit about your starting point. The same test result means different things depending on where you started.

This is the thing people get wrong about medical testing, right? The false positive problem.

Imagine a disease that affects one in a thousand people, and you have a test that's ninety-nine percent sensitive and ninety-nine percent specific. Sounds nearly perfect. If you test ten thousand people at random, you'll catch about ten true cases. But you'll also get about a hundred false positives — one percent of the healthy nine thousand nine hundred ninety. So someone with a positive result actually has only about a nine percent chance of having the disease. The base rate completely transforms what a positive result means.

This is why screening everyone for everything is a terrible idea.

It's harmful. You end up with cascades of follow-up testing, anxiety, sometimes invasive procedures — all triggered by false positives in a low-prevalence population. This is underappreciated in public discourse. People think "more testing equals better medicine," and it's just not true.

Let's talk about the failure modes Daniel mentioned. Anchoring — what's the classic presentation?

Anchoring is when you latch onto a particular feature of the presentation early and can't let go, even as new information contradicts your initial impression. The classic example is the patient with a headache who mentions a recent stressful life event, and the clinician anchors on "tension headache due to stress" — then discounts the progressively worsening nature of the pain, the fact that it's worse in the morning, the subtle neurological findings.

Meanwhile it's a brain tumor.

Or a subarachnoid hemorrhage, or meningitis. The anchoring itself prevents you from updating your probabilities as you should.

Premature closure — is that the same thing or different?

Related but distinct. Premature closure is accepting a diagnosis before it's been fully verified, and stopping the diagnostic process too early. You've found one thing that fits, and you stop looking. Pat Croskerry — one of the giants in diagnostic error research — estimated that about seventy-five percent of diagnostic errors have a cognitive component, with premature closure being one of the most common.

Seventy-five percent is enormous.

Croskerry's work, which really took off in the early two-thousands, was a wake-up call. Before that, there was this assumption that diagnostic errors were mostly knowledge deficits — the doctor just didn't know enough. But Croskerry showed that it's much more often about how we process information, not what we know.

It's a cognitive bias problem, not a knowledge problem.

And that's both humbling and hopeful. Humbling because even brilliant, knowledgeable clinicians make these errors. Hopeful because they're potentially preventable through better cognitive strategies.

What are those strategies?

One is what's called a "diagnostic timeout" — deliberately pausing after you've reached a provisional diagnosis and asking yourself: "What else could this be? What doesn't fit? If I'm wrong, what am I most likely wrong about?" It's a structured way to combat premature closure.

You're forcing yourself to generate alternative hypotheses.

Another technique is "prospective hindsight" or a pre-mortem. You imagine the patient has already had a bad outcome, and you work backward to figure out what you might have missed. It sounds morbid, but it's remarkably effective at surfacing hidden assumptions.

I can see how that would work. It changes the cognitive frame from "what do I think is happening" to "how could I be wrong.

Those are different cognitive operations. The first tends to confirm what you already believe. The second actively searches for disconfirming evidence.

Let me ask you about something under-discussed. You mentioned that experienced clinicians develop illness scripts that encode base rates implicitly. But doesn't that also encode the clinician's own practice demographics? A GP in rural Vermont and a GP in downtown Miami are seeing very different patient populations with different baseline risks.

That's a really sharp observation. The base rates are local. Lyme disease has a completely different pre-test probability in Connecticut than it does in Arizona. A GP who trained in one setting and then moves to a very different population has to deliberately recalibrate their internal base rates, and that's actually quite difficult to do.

Because it's not conscious. You're not thinking "the base rate of Lyme here is X percent." You just know that certain symptom clusters feel more or less concerning based on your accumulated experience.

And this is one of the arguments for formal Bayesian reasoning as a check on pattern recognition. If you can step back and say, "Okay, my gut says this is unlikely, but what are the actual numbers for this population?" — that's a valuable corrective.

What about the role of technology here? Clinical decision support tools, AI diagnostic assistance — is that changing how this gets taught?

It's starting to, and I think we're at a interesting inflection point. There are systems now that can take a patient's presentation and generate a differential diagnosis with associated probabilities, pulling from massive datasets. The question becomes: how do you teach a trainee to use that without becoming dependent on it?

The autopilot problem. If the computer is always doing the Bayesian math for you, do you ever develop your own internal sense of it?

There's good research from aviation — which has been dealing with this for decades — showing that over-reliance on automation erodes fundamental skills. Pilots who always use autoland lose their ability to hand-fly an approach in bad weather. The parallel to clinical reasoning is pretty direct.

By the way, DeepSeek V four Pro is writing our script today, which is mildly ironic given what we're discussing.

It really is. We're talking about human judgment versus algorithmic assistance, and an AI is literally constructing our dialogue about it.

Do you see this going the same way as aviation? Where the solution is to deliberately maintain manual skills through periodic practice?

I think that's part of it. But there's also an argument that the comparison is imperfect because clinical reasoning involves a lot more ambiguity than flying an aircraft. An ILS approach has a defined glidepath. A patient with "I just don't feel right" does not.

That's fair. Let's talk about the zebra problem specifically. Daniel mentioned "the dangers of zebras actually being zebras." How do you teach someone to know when to break the heuristic?

This is one of the hardest things in clinical medicine. The heuristic exists for a reason — it's correct the vast majority of the time. But when it's wrong, the consequences can be catastrophic. The features that distinguish a zebra from a horse are often subtle and easy to dismiss if you're not actively looking for them.

What's a concrete example?

Take Addison's disease — primary adrenal insufficiency. It's rare. The presenting symptoms are fatigue, weight loss, maybe some abdominal pain, maybe some hyperpigmentation if you're looking carefully. Every single one of those symptoms is incredibly common and non-specific. A GP might see a hundred patients with fatigue for every one with Addison's. The base rate screams "this is depression, or anemia, or just life.

What triggers the clinician to think "maybe this is Addison's"?

Often it's a constellation of subtle things that don't quite fit the common explanation. The fatigue is progressive rather than fluctuating. The weight loss is unexplained. Maybe the patient mentions craving salt — which sounds quirky and easy to dismiss, but is actually a classic Addison's feature. And then there are what we call "red flags" or "alarm features" — things that should make you expand your differential even if the base rate is low.

Those red flags are explicitly taught?

They are, and they're one of the core things drilled in residency. Every common presentation has its associated red flags. Headache with positional variation or morning vomiting — think mass lesion. Back pain with nighttime pain or constitutional symptoms — think malignancy or infection. The red flags are basically the system's way of saying "the base rate for something serious just went up enough that you need to think about it.

It's like a structured override mechanism.

That's exactly what it is. And the art is knowing when to pull that override. If you pull it too often, you're the doctor who orders a million-dollar workup for every headache. If you never pull it, you're the doctor who misses the brain tumor.

How much of this is teachable versus just — and I hesitate to use this word — talent?

I think a lot of it is teachable, but not all of it. Some people do seem to have a natural aptitude for holding multiple possibilities in mind simultaneously, for staying cognitively flexible. But even those people benefit enormously from structured training. What's harder to teach is the emotional regulation piece.

Say more about that.

A lot of diagnostic error isn't really about cognitive bias in the pure sense. It's about emotional factors. You're tired, you're behind schedule, you're dealing with a difficult patient, you're worried about a different patient who's much sicker, and you just want to wrap this one up. Premature closure is often as much about emotional closure as cognitive closure. You want the case to be done.

That's a very human thing. And it's probably under-discussed in medical training.

Massively under-discussed. There's this culture of stoicism in medicine that makes it hard to acknowledge that your reasoning might be compromised by the fact that you haven't eaten in eight hours and you got four hours of sleep and your last patient yelled at you. But those things matter.

What's the fix? Mandatory lunch breaks?

Some of it is structural — reasonable workloads, adequate staffing, time to think. But there's also a growing emphasis on "metacognition" in medical education — teaching trainees to monitor their own cognitive state. "Am I tired right now? Am I frustrated with this patient? Is that affecting my judgment?" Just asking those questions can help.

It's almost like the diagnostic timeout you mentioned earlier, but applied to your own mental state rather than the clinical reasoning.

Some programs are starting to incorporate this explicitly. There's work coming out of places like the University of Toronto and Harvard on teaching cognitive debiasing strategies, and the results are promising but mixed. The honest answer is that we're still figuring out the best way to do this.

I want to circle back to something you touched on earlier — the longitudinal feedback loop with patients. You said that's where real judgment gets forged. But doesn't that take years? How do you accelerate that for trainees?

You can't fully accelerate it. There's no substitute for seeing ten thousand patients over ten years. But you can compress some of the learning through deliberate practice with feedback. Simulation cases, standardized patients, structured case discussions where an experienced clinician walks through their own reasoning step by step — making the implicit explicit.

Making the implicit explicit. That seems like the theme here.

It really is. The expert clinician's gut feeling isn't magic — it's pattern recognition built on thousands of cases with feedback. The problem is that the expert often can't articulate why they know something. They just know. The teaching challenge is to excavate that tacit knowledge and make it transferable.

Which is hard because the expert might not even be consciously aware of what cues they're responding to.

There's a whole body of research on this in cognitive psychology — it goes back to work on expert-novice differences in the eighties and nineties. Experts in any domain, including medicine, tend to chunk information differently. A novice sees individual data points — blood pressure, heart rate, temperature, physical exam findings. An expert sees patterns — "this looks like sepsis" or "this looks like dehydration" — without being able to decompose exactly how they got there.

You can't just tell the novice to "see patterns.

You have to build the patterns through exposure and feedback. This is why residency is structured the way it is — it's essentially a supervised pattern-acquisition period. The resident sees high volumes of patients, makes decisions, gets immediate feedback from an attending physician, and gradually internalizes the patterns.

What about the role of specific diagnostic tests in all this? Because one of the things that's changed dramatically in the last few decades is the availability of rapid, accurate testing. Does that change the calculus?

It changes it profoundly, and not always for the better. On one hand, having a rapid strep test or a troponin level or a CT scan can quickly resolve uncertainty. On the other hand, it can create a crutch. Why develop sophisticated clinical judgment about whether this abdominal pain is surgical when you can just scan everyone?

Which brings us back to the false positive problem.

The cost problem, and the radiation exposure problem, and the incidentaloma problem. You scan a hundred people with abdominal pain to find the one with appendicitis, and in the process you find three ovarian cysts, two adrenal nodules, and a liver hemangioma — none of which are causing symptoms, but all of which now need follow-up and generate anxiety and sometimes lead to unnecessary procedures.

The clinical judgment isn't just about making the diagnosis. It's about knowing when testing is actually going to help versus when it's going to open a can of worms.

That's beautifully put. And it's one of the hardest things to teach because the harms of over-testing are often invisible. The patient who got the unnecessary biopsy that came back negative goes home and feels relieved — they don't realize the biopsy was probably never indicated in the first place. The harm is hidden.

Whereas the harm of missing something is very visible. It's a lawsuit, it's a bad outcome, it's a morbidity and mortality conference.

The incentive structure in medicine, both legal and psychological, strongly favors over-testing. The clinician who orders too many tests is seen as thorough. The clinician who orders too few is seen as reckless. Even if, from a population health perspective, the over-tester is causing more net harm.

This is the defensive medicine problem.

Studies have estimated that defensive medicine accounts for somewhere between two and ten percent of healthcare spending in the United States. That's tens of billions of dollars annually.

How do you teach a trainee to navigate that tension? To be appropriately conservative without being reckless?

Part of it is teaching them to explicitly assess pre-test probability and understand the limitations of testing. Part of it is modeling — watching experienced clinicians who are comfortable with uncertainty and who can explain their reasoning. And part of it, honestly, is giving them enough supervised experience that they develop confidence in their own judgment. A resident who's never been allowed to make a decision without a CT scan won't develop the confidence to make decisions without one.

You're describing a kind of calibrated courage.

That's a great phrase. It's not recklessness — it's the willingness to act on a reasoned assessment of probability, accepting that uncertainty is inherent and that occasional errors are inevitable, while having systems in place to catch those errors before they cause harm.

The systems piece seems important. You mentioned earlier that feedback loops are often broken in modern healthcare. What would a good feedback system look like?

Ideally, every clinician would have access to data on their own diagnostic accuracy over time. How often were they right? What did they miss? What patterns emerge? This is starting to happen in some integrated systems — places like Kaiser Permanente or the VA, where they have longitudinal electronic health records and can track outcomes. But for most clinicians, especially in fragmented healthcare systems, that data simply doesn't exist.

You're operating blind. You don't know your own error rate.

You don't. And that's terrifying when you think about it. Every other high-stakes profession — aviation, nuclear power, even professional sports — has robust feedback systems. Pilots get debriefed after every flight, with data from the flight recorder. Baseball players know their batting average against every pitcher they've faced. Doctors often have no idea how accurate their diagnostic judgments actually are.

That seems like a solvable problem with modern electronic health records.

It is solvable, technically. The barriers are mostly organizational and cultural. There's resistance to measurement, fear of liability, concerns about how the data would be used. But I think it's inevitable that we'll move in that direction eventually.

Let me ask you one more thing before we wrap. Daniel's prompt was specifically about GPs and family doctors — the generalists who see everything. How is their diagnostic approach different from a specialist's?

The fundamental difference is in the prior probabilities. A cardiologist's patient population has been pre-filtered — either they have known cardiac disease, or someone thought they might. The base rate of cardiac pathology in a cardiologist's waiting room is much higher than in a GP's waiting room. So the cardiologist is operating in a world where zebras in their domain are actually reasonably common.

The specialist's intuition is calibrated to a completely different population.

And this is why you sometimes see friction between generalists and specialists. The GP sends a patient to the cardiologist for chest pain, the cardiologist does a million-dollar workup that's completely appropriate for their population, and the GP thinks it was overkill. Both are reasoning correctly based on their own base rates, but the base rates are different.

Which is why the GP's role as a gatekeeper is so important. They're the ones who decide which patients need to enter the high-base-rate specialist world.

That's arguably the most important skill in primary care — knowing when to escalate and when to manage expectantly. It's not about knowing everything. It's about knowing what you don't know, and knowing when the uncertainty is dangerous versus when it's tolerable.

The core skill is uncertainty management.

It really is. And that's what makes it so hard to teach, because uncertainty is uncomfortable, and the natural human response is to try to eliminate it through testing or referral or just deciding prematurely so you can stop feeling uncertain. The great clinicians are the ones who can sit with uncertainty, think probabilistically, and make reasoned decisions anyway.

That's a good place to land. One thing I'm left wondering about — you mentioned earlier that the diagnostic AI tools are at an inflection point. Do you think in ten or twenty years, this whole skill set we've been discussing becomes obsolete? That the machine just does it better?

I don't think it becomes obsolete, but I think it transforms. The clinician's role shifts from generating the differential to curating it — understanding the output, contextualizing it, communicating it to the patient. The probabilistic reasoning doesn't go away; it just moves up a level of abstraction. But the human skills — the calibrated courage, the emotional attunement, the ability to sit with uncertainty — those don't get automated.

Because the patient still wants a human to look them in the eye and say "I think you're going to be okay.

To know when not to say that. Which might be even harder for an algorithm.

Now: Hilbert's daily fun fact.

Hilbert: In the nineteen fifties, researchers first measured the ampullae of Lorenzini — the electroreceptive organs in sharks — and found they can detect voltage gradients as faint as five-billionths of a volt per centimeter. For comparison, that is roughly the electrical field produced by a single double-A battery if you stretched its terminals from the Yamal Peninsula all the way to Cape Town.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoy the show, leave us a review — it helps. We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2739: When Hoofbeats Are Zebras: How Doctors Learn to Think

Downloads

You Might Also Like

#2739: When Hoofbeats Are Zebras: How Doctors Learn to Think