#2739: When Hoofbeats Are Zebras: How Doctors Learn to Think

How family doctors develop clinical judgment—pattern recognition, Bayesian reasoning, and the cognitive traps that lead to diagnostic errors.

Featuring
Listen
0:00
0:00
Episode Details
Episode ID
MWP-2900
Published
Duration
27:55
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Clinical reasoning is fundamentally a probabilistic exercise under uncertainty—doctors never operate with complete information, yet must make decisions anyway. This episode explores how family physicians develop that critical judgment, from the formal Bayesian frameworks taught in medical school to the "illness scripts" that encode base rate information implicitly through years of patient experience.

The single most powerful mechanism for developing clinical judgment is longitudinal feedback loops—seeing the same patients over time and learning whether your diagnoses were correct. Modern healthcare structures often break these loops: hospitalists and ER physicians frequently never learn what happened after patients left their care, while family doctors in continuity practices forge their judgment in the crucible of follow-up visits and disease natural history.

The episode also examines the cognitive failure modes that drive diagnostic errors, which Pat Croskerry's work showed are about seventy-five percent cognitive bias rather than knowledge gaps. Anchoring—latching onto an early feature and failing to update—and premature closure—accepting a diagnosis before full verification—are among the most common. Strategies like diagnostic timeouts and prospective hindsight (pre-mortems) can combat these biases, but the deeper challenge is that base rates are local: a clinician moving from rural Vermont to downtown Miami must deliberately recalibrate their internal probabilities for conditions like Lyme disease.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2739: When Hoofbeats Are Zebras: How Doctors Learn to Think

Corn
Daniel sent us this one — he wants to talk about clinical diagnostics, that gut sense doctors develop for knowing what's likely versus what's rare. The old saying, "when you hear hooves, think horses, not zebras." He's asking how family doctors and GPs, who see absolutely everything walk through the door, develop real clinical judgment. How it's taught in med school, refined in residency, how experienced clinicians pass it to juniors. He specifically flagged Bayesian reasoning, base rates, pattern recognition, feedback loops, and the famous failure modes — anchoring, premature closure, and the times when the hoofbeats actually are zebras.
Herman
Oh, this is such a rich topic. And honestly, the public understanding of how doctors think bears almost no resemblance to what's actually going on cognitively. Most people imagine it's some kind of encyclopedic recall contest — the doctor who memorized the most rare diseases wins. But that's exactly backwards.
Corn
Right, because the doctor who chases zebras is the one who orders a spinal tap for a tension headache. That's not just wasteful, it's actively harmful.
Herman
The fundamental insight of clinical reasoning — and this goes back to work in the nineteen-seventies that really formalized it — is that diagnosis is fundamentally a probabilistic exercise under uncertainty. You're never operating with complete information. You have to make decisions anyway.
Corn
Walk me through how this actually gets taught. Because I imagine you don't hand a twenty-three-year-old med student a textbook on Bayesian statistics and say "good luck.
Herman
No, and that's been one of the big pedagogical debates for decades. There's this tension between teaching formal probabilistic reasoning — literally Bayes' theorem, pre-test probability, likelihood ratios — versus what's sometimes called "illness scripts," which is more pattern recognition based. Experienced clinicians use both, but at different stages.
Corn
Explain illness scripts. I've heard you use that term before.
Herman
An illness script is a mental model of how a disease typically presents. It's not just a list of symptoms — it includes context, patient demographics, temporal pattern, setting. A seasoned GP's script for community-acquired pneumonia isn't just "cough, fever, shortness of breath." It's more like: "Winter months, older patient, sudden onset, productive cough, unilateral chest findings, maybe their neighbor had something similar last week." It's a narrative structure that encodes base rate information implicitly.
Corn
— the base rates are baked into the script rather than calculated explicitly.
Herman
And that's how most experienced clinicians operate most of the time. They're not doing explicit math. But — and this is crucial — the explicit Bayesian framework is what you fall back on when pattern recognition fails, or when you're in unfamiliar clinical territory where you don't have well-developed illness scripts yet.
Corn
The teaching sequence would be — start with the formal framework, then build the pattern recognition on top of it?
Herman
That's the ideal, and it's how a lot of modern curricula are structured. There's been a shift away from the old Flexner model — two years of basic sciences, two years of clinical rotations — toward more integrated approaches. Places like McMaster in Canada pioneered problem-based learning where students encounter clinical cases from day one and reason through them explicitly.
Corn
When did that shift happen?
Herman
McMaster launched their program in nineteen sixty-nine, and it was genuinely revolutionary. The idea was that you learn the basic sciences in the context of clinical problems, rather than as abstract prerequisites. Instead of memorizing all of biochemistry and then two years later learning how it applies to diabetes, you encounter a diabetic patient case and work backward to understand the underlying biochemistry.
Corn
Which seems like it would make the probabilistic reasoning more intuitive from the start.
Herman
It does, but it also has its own failure modes. One criticism of pure problem-based learning is that students can end up with patchy knowledge — they know the mechanisms they've encountered through cases, but might have gaps in areas that didn't come up organically. So most places now do some hybrid. But let me get to something even more important than the formal curriculum, and it ties directly to what Daniel was asking about how this gut sense actually develops. The single most powerful mechanism is feedback loops — specifically, longitudinal feedback where you see the same patients over time.
Corn
You make a diagnosis, you see whether you were right.
Herman
This is where the structure of modern healthcare actually works against developing good clinical judgment. If you're a hospitalist or an emergency department physician, you often never find out what happened to your patients after they left your care. You made a diagnosis, initiated treatment, they went upstairs or went home, and you never saw them again. That's a broken feedback loop.
Corn
Whereas a family doctor in a small practice sees the same patients for years.
Herman
And that's the crucible where real clinical judgment gets forged. You make a call — "I think this is viral, come back if it doesn't resolve in five days" — and then they actually come back, or they don't. You see the natural history of disease play out. You learn what "sick" looks like versus "not sick" in a way no textbook can teach you.
Corn
There's something almost apprenticeship-like about that. You're learning from the disease itself.
Herman
And this is also how experienced clinicians teach juniors. The classic teaching method in primary care is the one-minute preceptor model — the trainee sees the patient, presents the case, and the preceptor probes their reasoning. "What do you think is going on? What else could it be? What makes you lean toward that diagnosis?" It's Socratic, but in the context of real patients with real consequences.
Corn
Let's dig into the Bayesian piece more concretely. You mentioned pre-test probability — can you walk through how that actually works in practice?
Herman
Let's take chest pain in a primary care setting. If a thirty-five-year-old woman with no cardiac risk factors presents with chest pain, the pre-test probability of coronary artery disease is extremely low — well under one percent. Even if the pain has some features consistent with angina, the base rate dominates. You're almost certainly dealing with something musculoskeletal, gastrointestinal, or anxiety-related.
Corn
The same symptom in a sixty-five-year-old diabetic smoker is a completely different calculus.
Herman
The pre-test probability there might be twenty or thirty percent before you've even done an EKG. This is where the Bayesian framework really shines — it forces you to be explicit about your starting point. The same test result means different things depending on where you started.
Corn
This is the thing people get wrong about medical testing, right? The false positive problem.
Herman
Imagine a disease that affects one in a thousand people, and you have a test that's ninety-nine percent sensitive and ninety-nine percent specific. Sounds nearly perfect. If you test ten thousand people at random, you'll catch about ten true cases. But you'll also get about a hundred false positives — one percent of the healthy nine thousand nine hundred ninety. So someone with a positive result actually has only about a nine percent chance of having the disease. The base rate completely transforms what a positive result means.
Corn
This is why screening everyone for everything is a terrible idea.
Herman
It's harmful. You end up with cascades of follow-up testing, anxiety, sometimes invasive procedures — all triggered by false positives in a low-prevalence population. This is underappreciated in public discourse. People think "more testing equals better medicine," and it's just not true.
Corn
Let's talk about the failure modes Daniel mentioned. Anchoring — what's the classic presentation?
Herman
Anchoring is when you latch onto a particular feature of the presentation early and can't let go, even as new information contradicts your initial impression. The classic example is the patient with a headache who mentions a recent stressful life event, and the clinician anchors on "tension headache due to stress" — then discounts the progressively worsening nature of the pain, the fact that it's worse in the morning, the subtle neurological findings.
Corn
Meanwhile it's a brain tumor.
Herman
Or a subarachnoid hemorrhage, or meningitis. The anchoring itself prevents you from updating your probabilities as you should.
Corn
Premature closure — is that the same thing or different?
Herman
Related but distinct. Premature closure is accepting a diagnosis before it's been fully verified, and stopping the diagnostic process too early. You've found one thing that fits, and you stop looking. Pat Croskerry — one of the giants in diagnostic error research — estimated that about seventy-five percent of diagnostic errors have a cognitive component, with premature closure being one of the most common.
Corn
Seventy-five percent is enormous.
Herman
Croskerry's work, which really took off in the early two-thousands, was a wake-up call. Before that, there was this assumption that diagnostic errors were mostly knowledge deficits — the doctor just didn't know enough. But Croskerry showed that it's much more often about how we process information, not what we know.
Corn
It's a cognitive bias problem, not a knowledge problem.
Herman
And that's both humbling and hopeful. Humbling because even brilliant, knowledgeable clinicians make these errors. Hopeful because they're potentially preventable through better cognitive strategies.
Corn
What are those strategies?
Herman
One is what's called a "diagnostic timeout" — deliberately pausing after you've reached a provisional diagnosis and asking yourself: "What else could this be? What doesn't fit? If I'm wrong, what am I most likely wrong about?" It's a structured way to combat premature closure.
Corn
You're forcing yourself to generate alternative hypotheses.
Herman
Another technique is "prospective hindsight" or a pre-mortem. You imagine the patient has already had a bad outcome, and you work backward to figure out what you might have missed. It sounds morbid, but it's remarkably effective at surfacing hidden assumptions.
Corn
I can see how that would work. It changes the cognitive frame from "what do I think is happening" to "how could I be wrong.
Herman
Those are different cognitive operations. The first tends to confirm what you already believe. The second actively searches for disconfirming evidence.
Corn
Let me ask you about something under-discussed. You mentioned that experienced clinicians develop illness scripts that encode base rates implicitly. But doesn't that also encode the clinician's own practice demographics? A GP in rural Vermont and a GP in downtown Miami are seeing very different patient populations with different baseline risks.
Herman
That's a really sharp observation. The base rates are local. Lyme disease has a completely different pre-test probability in Connecticut than it does in Arizona. A GP who trained in one setting and then moves to a very different population has to deliberately recalibrate their internal base rates, and that's actually quite difficult to do.
Corn
Because it's not conscious. You're not thinking "the base rate of Lyme here is X percent." You just know that certain symptom clusters feel more or less concerning based on your accumulated experience.
Herman
And this is one of the arguments for formal Bayesian reasoning as a check on pattern recognition. If you can step back and say, "Okay, my gut says this is unlikely, but what are the actual numbers for this population?" — that's a valuable corrective.
Corn
What about the role of technology here? Clinical decision support tools, AI diagnostic assistance — is that changing how this gets taught?
Herman
It's starting to, and I think we're at a interesting inflection point. There are systems now that can take a patient's presentation and generate a differential diagnosis with associated probabilities, pulling from massive datasets. The question becomes: how do you teach a trainee to use that without becoming dependent on it?
Corn
The autopilot problem. If the computer is always doing the Bayesian math for you, do you ever develop your own internal sense of it?
Herman
There's good research from aviation — which has been dealing with this for decades — showing that over-reliance on automation erodes fundamental skills. Pilots who always use autoland lose their ability to hand-fly an approach in bad weather. The parallel to clinical reasoning is pretty direct.
Corn
By the way, DeepSeek V four Pro is writing our script today, which is mildly ironic given what we're discussing.
Herman
It really is. We're talking about human judgment versus algorithmic assistance, and an AI is literally constructing our dialogue about it.
Corn
Do you see this going the same way as aviation? Where the solution is to deliberately maintain manual skills through periodic practice?
Herman
I think that's part of it. But there's also an argument that the comparison is imperfect because clinical reasoning involves a lot more ambiguity than flying an aircraft. An ILS approach has a defined glidepath. A patient with "I just don't feel right" does not.
Corn
That's fair. Let's talk about the zebra problem specifically. Daniel mentioned "the dangers of zebras actually being zebras." How do you teach someone to know when to break the heuristic?
Herman
This is one of the hardest things in clinical medicine. The heuristic exists for a reason — it's correct the vast majority of the time. But when it's wrong, the consequences can be catastrophic. The features that distinguish a zebra from a horse are often subtle and easy to dismiss if you're not actively looking for them.
Corn
What's a concrete example?
Herman
Take Addison's disease — primary adrenal insufficiency. It's rare. The presenting symptoms are fatigue, weight loss, maybe some abdominal pain, maybe some hyperpigmentation if you're looking carefully. Every single one of those symptoms is incredibly common and non-specific. A GP might see a hundred patients with fatigue for every one with Addison's. The base rate screams "this is depression, or anemia, or just life.
Corn
What triggers the clinician to think "maybe this is Addison's"?
Herman
Often it's a constellation of subtle things that don't quite fit the common explanation. The fatigue is progressive rather than fluctuating. The weight loss is unexplained. Maybe the patient mentions craving salt — which sounds quirky and easy to dismiss, but is actually a classic Addison's feature. And then there are what we call "red flags" or "alarm features" — things that should make you expand your differential even if the base rate is low.
Corn
Those red flags are explicitly taught?
Herman
They are, and they're one of the core things drilled in residency. Every common presentation has its associated red flags. Headache with positional variation or morning vomiting — think mass lesion. Back pain with nighttime pain or constitutional symptoms — think malignancy or infection. The red flags are basically the system's way of saying "the base rate for something serious just went up enough that you need to think about it.
Corn
It's like a structured override mechanism.
Herman
That's exactly what it is. And the art is knowing when to pull that override. If you pull it too often, you're the doctor who orders a million-dollar workup for every headache. If you never pull it, you're the doctor who misses the brain tumor.
Corn
How much of this is teachable versus just — and I hesitate to use this word — talent?
Herman
I think a lot of it is teachable, but not all of it. Some people do seem to have a natural aptitude for holding multiple possibilities in mind simultaneously, for staying cognitively flexible. But even those people benefit enormously from structured training. What's harder to teach is the emotional regulation piece.
Corn
Say more about that.
Herman
A lot of diagnostic error isn't really about cognitive bias in the pure sense. It's about emotional factors. You're tired, you're behind schedule, you're dealing with a difficult patient, you're worried about a different patient who's much sicker, and you just want to wrap this one up. Premature closure is often as much about emotional closure as cognitive closure. You want the case to be done.
Corn
That's a very human thing. And it's probably under-discussed in medical training.
Herman
Massively under-discussed. There's this culture of stoicism in medicine that makes it hard to acknowledge that your reasoning might be compromised by the fact that you haven't eaten in eight hours and you got four hours of sleep and your last patient yelled at you. But those things matter.
Corn
What's the fix? Mandatory lunch breaks?
Herman
Some of it is structural — reasonable workloads, adequate staffing, time to think. But there's also a growing emphasis on "metacognition" in medical education — teaching trainees to monitor their own cognitive state. "Am I tired right now? Am I frustrated with this patient? Is that affecting my judgment?" Just asking those questions can help.
Corn
It's almost like the diagnostic timeout you mentioned earlier, but applied to your own mental state rather than the clinical reasoning.
Herman
Some programs are starting to incorporate this explicitly. There's work coming out of places like the University of Toronto and Harvard on teaching cognitive debiasing strategies, and the results are promising but mixed. The honest answer is that we're still figuring out the best way to do this.
Corn
I want to circle back to something you touched on earlier — the longitudinal feedback loop with patients. You said that's where real judgment gets forged. But doesn't that take years? How do you accelerate that for trainees?
Herman
You can't fully accelerate it. There's no substitute for seeing ten thousand patients over ten years. But you can compress some of the learning through deliberate practice with feedback. Simulation cases, standardized patients, structured case discussions where an experienced clinician walks through their own reasoning step by step — making the implicit explicit.
Corn
Making the implicit explicit. That seems like the theme here.
Herman
It really is. The expert clinician's gut feeling isn't magic — it's pattern recognition built on thousands of cases with feedback. The problem is that the expert often can't articulate why they know something. They just know. The teaching challenge is to excavate that tacit knowledge and make it transferable.
Corn
Which is hard because the expert might not even be consciously aware of what cues they're responding to.
Herman
There's a whole body of research on this in cognitive psychology — it goes back to work on expert-novice differences in the eighties and nineties. Experts in any domain, including medicine, tend to chunk information differently. A novice sees individual data points — blood pressure, heart rate, temperature, physical exam findings. An expert sees patterns — "this looks like sepsis" or "this looks like dehydration" — without being able to decompose exactly how they got there.
Corn
You can't just tell the novice to "see patterns.
Herman
You have to build the patterns through exposure and feedback. This is why residency is structured the way it is — it's essentially a supervised pattern-acquisition period. The resident sees high volumes of patients, makes decisions, gets immediate feedback from an attending physician, and gradually internalizes the patterns.
Corn
What about the role of specific diagnostic tests in all this? Because one of the things that's changed dramatically in the last few decades is the availability of rapid, accurate testing. Does that change the calculus?
Herman
It changes it profoundly, and not always for the better. On one hand, having a rapid strep test or a troponin level or a CT scan can quickly resolve uncertainty. On the other hand, it can create a crutch. Why develop sophisticated clinical judgment about whether this abdominal pain is surgical when you can just scan everyone?
Corn
Which brings us back to the false positive problem.
Herman
The cost problem, and the radiation exposure problem, and the incidentaloma problem. You scan a hundred people with abdominal pain to find the one with appendicitis, and in the process you find three ovarian cysts, two adrenal nodules, and a liver hemangioma — none of which are causing symptoms, but all of which now need follow-up and generate anxiety and sometimes lead to unnecessary procedures.
Corn
The clinical judgment isn't just about making the diagnosis. It's about knowing when testing is actually going to help versus when it's going to open a can of worms.
Herman
That's beautifully put. And it's one of the hardest things to teach because the harms of over-testing are often invisible. The patient who got the unnecessary biopsy that came back negative goes home and feels relieved — they don't realize the biopsy was probably never indicated in the first place. The harm is hidden.
Corn
Whereas the harm of missing something is very visible. It's a lawsuit, it's a bad outcome, it's a morbidity and mortality conference.
Herman
The incentive structure in medicine, both legal and psychological, strongly favors over-testing. The clinician who orders too many tests is seen as thorough. The clinician who orders too few is seen as reckless. Even if, from a population health perspective, the over-tester is causing more net harm.
Corn
This is the defensive medicine problem.
Herman
Studies have estimated that defensive medicine accounts for somewhere between two and ten percent of healthcare spending in the United States. That's tens of billions of dollars annually.
Corn
How do you teach a trainee to navigate that tension? To be appropriately conservative without being reckless?
Herman
Part of it is teaching them to explicitly assess pre-test probability and understand the limitations of testing. Part of it is modeling — watching experienced clinicians who are comfortable with uncertainty and who can explain their reasoning. And part of it, honestly, is giving them enough supervised experience that they develop confidence in their own judgment. A resident who's never been allowed to make a decision without a CT scan won't develop the confidence to make decisions without one.
Corn
You're describing a kind of calibrated courage.
Herman
That's a great phrase. It's not recklessness — it's the willingness to act on a reasoned assessment of probability, accepting that uncertainty is inherent and that occasional errors are inevitable, while having systems in place to catch those errors before they cause harm.
Corn
The systems piece seems important. You mentioned earlier that feedback loops are often broken in modern healthcare. What would a good feedback system look like?
Herman
Ideally, every clinician would have access to data on their own diagnostic accuracy over time. How often were they right? What did they miss? What patterns emerge? This is starting to happen in some integrated systems — places like Kaiser Permanente or the VA, where they have longitudinal electronic health records and can track outcomes. But for most clinicians, especially in fragmented healthcare systems, that data simply doesn't exist.
Corn
You're operating blind. You don't know your own error rate.
Herman
You don't. And that's terrifying when you think about it. Every other high-stakes profession — aviation, nuclear power, even professional sports — has robust feedback systems. Pilots get debriefed after every flight, with data from the flight recorder. Baseball players know their batting average against every pitcher they've faced. Doctors often have no idea how accurate their diagnostic judgments actually are.
Corn
That seems like a solvable problem with modern electronic health records.
Herman
It is solvable, technically. The barriers are mostly organizational and cultural. There's resistance to measurement, fear of liability, concerns about how the data would be used. But I think it's inevitable that we'll move in that direction eventually.
Corn
Let me ask you one more thing before we wrap. Daniel's prompt was specifically about GPs and family doctors — the generalists who see everything. How is their diagnostic approach different from a specialist's?
Herman
The fundamental difference is in the prior probabilities. A cardiologist's patient population has been pre-filtered — either they have known cardiac disease, or someone thought they might. The base rate of cardiac pathology in a cardiologist's waiting room is much higher than in a GP's waiting room. So the cardiologist is operating in a world where zebras in their domain are actually reasonably common.
Corn
The specialist's intuition is calibrated to a completely different population.
Herman
And this is why you sometimes see friction between generalists and specialists. The GP sends a patient to the cardiologist for chest pain, the cardiologist does a million-dollar workup that's completely appropriate for their population, and the GP thinks it was overkill. Both are reasoning correctly based on their own base rates, but the base rates are different.
Corn
Which is why the GP's role as a gatekeeper is so important. They're the ones who decide which patients need to enter the high-base-rate specialist world.
Herman
That's arguably the most important skill in primary care — knowing when to escalate and when to manage expectantly. It's not about knowing everything. It's about knowing what you don't know, and knowing when the uncertainty is dangerous versus when it's tolerable.
Corn
The core skill is uncertainty management.
Herman
It really is. And that's what makes it so hard to teach, because uncertainty is uncomfortable, and the natural human response is to try to eliminate it through testing or referral or just deciding prematurely so you can stop feeling uncertain. The great clinicians are the ones who can sit with uncertainty, think probabilistically, and make reasoned decisions anyway.
Corn
That's a good place to land. One thing I'm left wondering about — you mentioned earlier that the diagnostic AI tools are at an inflection point. Do you think in ten or twenty years, this whole skill set we've been discussing becomes obsolete? That the machine just does it better?
Herman
I don't think it becomes obsolete, but I think it transforms. The clinician's role shifts from generating the differential to curating it — understanding the output, contextualizing it, communicating it to the patient. The probabilistic reasoning doesn't go away; it just moves up a level of abstraction. But the human skills — the calibrated courage, the emotional attunement, the ability to sit with uncertainty — those don't get automated.
Corn
Because the patient still wants a human to look them in the eye and say "I think you're going to be okay.
Herman
To know when not to say that. Which might be even harder for an algorithm.
Corn
Now: Hilbert's daily fun fact.

Hilbert: In the nineteen fifties, researchers first measured the ampullae of Lorenzini — the electroreceptive organs in sharks — and found they can detect voltage gradients as faint as five-billionths of a volt per centimeter. For comparison, that is roughly the electrical field produced by a single double-A battery if you stretched its terminals from the Yamal Peninsula all the way to Cape Town.
Corn
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoy the show, leave us a review — it helps. We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.