Daniel sent us this one — he's asking about correlation analysis, both the basics and the advanced stuff. He wants to know what techniques are out there, what the pitfalls are, and how to think about correlation beyond just "here's a number, ship it." There's a lot lurking under the surface here, because correlation is one of those things everyone learns in week one of statistics and then... mostly gets wrong for the rest of their career.
The prompt gets at something really important — what do you actually do when the obvious Pearson correlation isn't enough? Because the standard intro stats version of correlation is basically the musical equivalent of beige wallpaper. It works fine on tidy, linear, normally-distributed data and completely falls apart everywhere else.
Which is everywhere.
Which is basically everywhere, yes. So let's start with the foundation and then build up. Pearson's r — the one everyone knows. It measures linear correlation between two continuous variables, ranges from negative one to positive one, zero means no linear relationship. It was developed by Karl Pearson in the eighteen nineties, building on Francis Galton's earlier work on regression.
Galton, who also gave us eugenics and the concept of statistical regression to the mean. A mixed legacy, let's say.
But the math stuck. Pearson's r is essentially covariance divided by the product of standard deviations. The formula normalizes everything so you get a unitless number. And that is both its strength and its weakness — it's wonderfully interpretable until it isn't.
The "until it isn't" is doing a lot of work there. What breaks first?
Pearson's r is catastrophically sensitive to outliers. A single extreme point can drag your correlation from zero point eight to zero point two, or create a phantom correlation where none exists. There's a classic dataset — the Anscombe quartet from nineteen seventy-three — four sets of data with identical means, variances, and correlations, but the scatterplots look completely different.
Anscombe's quartet should be mandatory viewing before anyone is allowed to report a correlation coefficient. One of the sets is basically a straight line with one outlier that destroys the fit. Another is a perfect parabola with zero linear correlation. All four give you the same r.
And that brings us to the first rule of correlation analysis: always plot your data. A correlation coefficient without a scatterplot is a press release, not an analysis.
"A press release, not an analysis." I'm putting that on a mug.
That's Pearson. Now, what do you do when your data isn't normally distributed or the relationship isn't linear? That's where Spearman's rank correlation comes in. Spearman's rho — developed by Charles Spearman in nineteen oh four — works by converting your data to ranks and then computing Pearson's r on those ranks.
Instead of asking "do these numbers move together," you're asking "do these rankings move together.
And that makes it robust to outliers and works for monotonic relationships — relationships that always go up or always go down, even if they're not straight lines. If Y consistently increases as X increases, but the curve bends, Spearman catches it. Pearson might miss it.
Monotonic but not linear — like diminishing returns.
And Spearman doesn't assume normality, which makes it the go-to for a lot of real-world data. Then there's Kendall's tau, developed by Maurice Kendall in nineteen thirty-eight. It's also rank-based but uses a different approach — it looks at concordant and discordant pairs. For every pair of observations, it asks: do they point in the same direction or opposite directions?
It's counting agreements versus disagreements in ordering.
Kendall's tau tends to be more robust than Spearman with small samples and handles ties better. It's also got a cleaner probabilistic interpretation — tau is basically the probability of concordance minus the probability of discordance for a randomly selected pair. I find it elegant in a way Spearman isn't.
When would you reach for Kendall over Spearman?
Small samples, lots of ties in the data, or when you want that probabilistic interpretation. In practice, Spearman is more common, but Kendall is arguably better behaved statistically. They'll usually give you similar answers, but when they diverge, trust Kendall.
We've covered the big three. Pearson for linear and normal, Spearman for monotonic and robust, Kendall for small samples and ties. That's the starter pack.
Here's where most people stop. But the prompt is asking about advanced techniques, and this is where it gets genuinely interesting. Because the fundamental limitation of all three of these is that they measure bivariate association — the relationship between exactly two variables. In the real world, you almost always have more than two variables interacting.
That's where you get the classic "ice cream sales correlate with drowning deaths" problem. Both go up in summer. The correlation is real but the causal interpretation is nonsense.
The lurking variable problem. And the technique designed to handle this is partial correlation. Partial correlation measures the relationship between two variables while controlling for one or more other variables. You're essentially asking: if I hold Z constant, what's the residual correlation between X and Y?
In the ice cream example, you'd control for temperature and watch the correlation between ice cream sales and drownings vanish.
The math is straightforward — you regress X on Z, regress Y on Z, take the residuals from both regressions, and then correlate those residuals. What's left is the relationship between X and Y that isn't explained by Z.
This scales to multiple control variables?
You can compute partial correlations controlling for entire sets of variables. This is foundational in fields like epidemiology and econometrics, where you're constantly trying to isolate relationships from confounders. But there's a trap here — partial correlation assumes linear relationships among all variables. If the confounder relationship is nonlinear, your partial correlation can be misleading.
Everything in statistics is a model with assumptions, and every assumption is a lie waiting to be exposed.
That's the most Corn thing you've ever said.
I have my moments. So partial correlation handles confounders — what's next?
Let's talk about distance correlation. This is a much more recent development — Gábor Székely and his colleagues introduced it in two thousand seven. And it solves a fundamental problem. Pearson, Spearman, Kendall — they all measure some specific kind of association. If two variables are related but the relationship is non-monotonic, those measures can give you zero even when there's a clear dependence.
Give me an example of non-monotonic dependence.
Y equals X squared. As X increases, Y first decreases, then increases. Pearson gives you zero. Spearman gives you zero. There's clearly a relationship — a perfect deterministic one — but all the standard measures miss it completely.
Because they're all measuring "does Y go up when X goes up" in some form, and here Y goes up and down.
Distance correlation solves this. The intuition is beautiful: instead of measuring how values covary, you measure how distances between pairs of points covary. For any two pairs of observations, you look at the distance between the X values and the distance between the Y values. If X and Y are related in any way, pairs that are close in X will tend to be close in Y in some systematic pattern.
You're comparing the distance matrices of the two variables.
And distance correlation has a property that is almost magical: it equals zero if and only if the variables are statistically independent. Not just linearly independent, not just monotonic independent — fully independent. For any dependence at all, distance correlation is positive.
That's a strong claim. Any dependence whatsoever?
In the population limit, at least. With finite samples, you're estimating it, so there's noise. But the theoretical property is that distance correlation characterizes independence completely. No other correlation measure does that.
Why isn't this everywhere? Why do people still use Pearson?
A few reasons. It's computationally heavier — you're working with n-by-n distance matrices, so it scales poorly to massive datasets. It's less intuitive to explain. And honestly, a lot of fields are just slow to adopt new statistical methods. But for moderate-sized datasets where you don't know the functional form of the relationship, distance correlation is arguably the best tool we have.
There are tests based on it?
Yes, there's a permutation test for distance correlation that gives you a p-value for dependence. You shuffle one variable, recompute the distance correlation thousands of times, and see where your observed value falls in that null distribution. It's computationally intensive but conceptually clean.
We've got Pearson, Spearman, Kendall, partial correlation, distance correlation. What's the next rung up the ladder?
Canonical correlation analysis. This is where we stop looking at pairs of variables and start looking at sets of variables. CCA finds linear combinations of variables in one set that are maximally correlated with linear combinations of variables in another set.
Instead of correlating X with Y, you're correlating a weighted sum of X-one through X-n with a weighted sum of Y-one through Y-m.
And it gives you not just one pair of canonical variates but multiple pairs, each orthogonal to the previous ones, each capturing the next strongest mode of covariation between the two sets. It's like principal component analysis, but instead of maximizing variance within one set, you're maximizing correlation between two sets.
This sounds like the kind of thing that gets used in genomics or neuroscience, where you have thousands of measurements on one side and behavioral outcomes on the other.
Classic use case. You've got gene expression data — thousands of genes — and you want to see how they relate to a set of clinical measurements. CCA finds the combinations of genes that most strongly correlate with combinations of symptoms. Harold Hotelling developed it in nineteen thirty-five, and it's been a workhorse in multivariate statistics ever since.
Overfitting is the big one. With high-dimensional data — more variables than observations — CCA will find perfect correlations even with random noise. Regularization is essential in modern applications. There's regularized CCA, sparse CCA that forces many weights to zero for interpretability, kernel CCA for nonlinear relationships. The basic idea has spawned a whole family of methods.
Kernel CCA — that's where you map everything into a higher-dimensional space first?
Right, using the kernel trick from machine learning. You implicitly transform your variables into a feature space where relationships that are nonlinear in the original space become linear. Then you do CCA in that transformed space. It's powerful but even more prone to overfitting, so you need careful cross-validation.
We've covered the spectrum from "correlate two columns in Excel" to "kernelized regularized sparse canonical correlation analysis." There's something satisfying about that arc.
There's one more I want to mention because it addresses a specific pain point: autocorrelation in time series. If you compute a standard correlation between two time series — say, GDP and stock prices — you're going to get a misleading answer because both series are correlated with their own past values.
The spurious regression problem. Granger and Newbold's classic paper.
From nineteen seventy-four. They showed that if you regress two independent random walks on each other, you get statistically significant correlations most of the time. The solution for correlation specifically is to work with differenced data or to use the cross-correlation function, which computes correlation at different time lags.
You're asking not just "are these related" but "does X lead Y by three months.
The cross-correlation function gives you a correlation coefficient for each lag. X at time t correlated with Y at time t-minus-k. This is fundamental in time series econometrics, signal processing, any domain where timing matters. And it surfaces lead-lag relationships that simple contemporaneous correlation completely misses.
Which brings us to the philosophical question lurking behind all of this. Correlation is not causation — everyone knows that. But what is correlation actually telling you?
That's the question, isn't it? Correlation is a measure of association, and association can arise for many reasons. Direct causation, reverse causation, common causes, selection bias, measurement error, sheer coincidence. The correlation coefficient itself is silent on the mechanism.
Yet people desperately want it to speak. There's an entire industry of "X is correlated with Y, therefore you should do Z" that skips the hard work of identifying mechanisms.
This is where the techniques we've discussed become tools for investigation rather than endpoints. Partial correlation helps you rule out confounders. Cross-correlation helps you establish temporal precedence — which is one of the Bradford Hill criteria for causation. Distance correlation tells you whether there's any dependence at all worth investigating.
Bradford Hill — the nine criteria for causal inference in epidemiology. Strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy.
Temporality — the cause must precede the effect — is the only one that's strictly necessary. Cross-correlation directly addresses that for time series. But even with a strong cross-correlation at a plausible lag, you still haven't proven causation. You've just made the case more plausible.
Let's talk about a specific pitfall that I think deserves more attention: restriction of range. If you only look at a narrow slice of your data, correlations can vanish or appear out of nowhere.
Classic example is SAT scores and college GPA. Within a highly selective university, the correlation might look weak because everyone has high SAT scores. You've restricted the range on the predictor. But across the full population of college students, the correlation is substantial. This trips up people constantly.
Because they're computing correlations on their existing customers, or their current employees, or whatever convenient sample is sitting in the database, and forgetting that the sample isn't representative of the population they're trying to reason about.
And the math is unforgiving — the formula for the correlation in the full population given the restricted correlation involves the ratio of variances. If the variance in your sample is much smaller than in the population, your observed correlation can be dramatically attenuated.
Another one: ecological correlation. Correlating group-level averages and then interpreting the result as if it applies to individuals.
The ecological fallacy. A classic from sociology — Emile Durkheim found that Protestant regions had higher suicide rates, but that doesn't mean individual Protestants were more likely to commit suicide. The correlation at the aggregate level doesn't necessarily hold at the individual level. And in fact, the sign can flip — that's Simpson's paradox territory.
Simpson's paradox is the ultimate cautionary tale for correlation analysis. A trend appears in several groups of data but disappears or reverses when the groups are combined.
The famous Berkeley graduate admissions case from nineteen seventy-three. Overall, men were admitted at a higher rate than women. But broken down by department, women were admitted at equal or higher rates in most departments. The apparent discrimination was an artifact of women applying to more competitive departments.
The aggregate correlation pointed in one direction, the disaggregated correlations pointed in another. And both were "correct" in the sense that the math was fine — the interpretation was the problem.
Which brings us to a practical framework. When I'm doing correlation analysis, I try to follow a workflow. Step one: plot everything. Scatterplots, distributions, time series plots if relevant. Step two: check for outliers and decide how to handle them — remove, transform, or use robust methods. Step three: choose your correlation measure based on what you've seen in the plots.
Step four: think about confounders. What else could be driving both variables? Compute partial correlations if you have data on plausible confounders. Step five: check for subgroup effects — could the relationship differ across categories? Step six: if it's time series, check for autocorrelation and use cross-correlation functions. Step seven: don't overinterpret. Report the uncertainty. Report the assumptions. And never, ever say "therefore" when you mean "is associated with.
"Never say 'therefore' when you mean 'is associated with.'" That might be the best statistical advice I've ever heard.
It's aspirational. I violate it all the time. But I try.
Let's dig into something you mentioned earlier — the computational scaling issue with distance correlation. You said it works with n-by-n distance matrices. At what point does it become impractical?
The naive computation is O of n squared in both time and memory. For a million observations, that's a trillion pairwise distances. You're not doing that on a laptop. But there's been progress. Some recent work uses random projections or binning to approximate distance correlation in near-linear time. The two thousand nineteen paper by Chaudhuri and Hu introduced a fast algorithm based on averaging over random one-dimensional projections.
You project the data onto random lines, compute one-dimensional distance correlations, and average?
The one-dimensional distance correlation has a closed form that's fast to compute, and by averaging over enough random projections, you get an unbiased estimate of the full distance correlation. It's a clever trick that makes the method feasible for much larger datasets.
The trade-off is variance — you're introducing approximation error.
Right, but with enough projections, that variance becomes manageable. It's the same principle as Monte Carlo integration. You're trading computational cost for statistical precision in a controlled way.
Let's circle back to something more basic that I think gets overlooked: the distinction between correlation and agreement. A correlation of zero point nine doesn't mean two measurements agree — it means they move together. If I have a scale that's consistently ten pounds too high, it correlates perfectly with the true weight but doesn't agree with it at all.
That's such an important point. Correlation measures linear association, not agreement. For agreement, you want measures like the intraclass correlation coefficient or Bland-Altman plots. The ICC is specifically designed for situations where you want to know if two measurements are interchangeable — same mean, same scale. Pearson doesn't care about either.
This trips up people in medical research constantly. Two devices measuring the same thing, high Pearson correlation, everyone concludes they're equivalent. But the new device could be systematically biased and wildly variable, and Pearson wouldn't flag it.
Bland and Altman's nineteen eighty-six Lancet paper on this is one of the most cited statistical papers of all time, and people still get it wrong. Their method is to plot the difference between measurements against their average. It's brilliantly simple — you can see bias, heteroscedasticity, outliers, all at a glance.
Heteroscedasticity — the variance changes across the range. Another thing standard correlation methods don't handle well.
Another reason to plot before computing. A funnel shape in the scatterplot — narrow at one end, wide at the other — tells you the correlation isn't uniform across the range. Maybe it's strong for low values and weak for high values, or vice versa. A single correlation coefficient collapses all of that variation into one misleading number.
We've got agreement versus association, heteroscedasticity, restriction of range, ecological fallacy, Simpson's paradox, autocorrelation, outliers, nonlinearity, confounding. The list of ways to misuse correlation is longer than the list of correlation techniques.
That's before we get to the multiple testing problem. If you compute correlations between every pair of variables in a dataset with a hundred variables, that's nearly five thousand correlations. At a five percent significance level, you expect about two hundred fifty false positives. Without correction, you'll "discover" patterns in pure noise.
The garden of forking paths, correlation edition. P-hacking by browsing.
Bonferroni correction is the simplest fix — divide your significance threshold by the number of tests. But it's conservative. The Benjamini-Hochberg procedure controls the false discovery rate and is more powerful. Either way, if you're doing exploratory correlation analysis on high-dimensional data, you need some form of multiple testing correction, or you're just telling stories about noise.
Which, to be fair, is a thriving industry.
It's the business model of a surprising number of things, yes.
Let's talk about mutual information for a moment. It's not a correlation measure in the classical sense, but it captures dependence. How does it fit into this landscape?
Mutual information comes from information theory — it's Shannon's concept from nineteen forty-eight. It measures how much knowing one variable reduces your uncertainty about another. If X and Y are independent, mutual information is zero. If Y is a deterministic function of X, mutual information is the entropy of Y.
Like distance correlation, it captures any form of dependence, not just linear or monotonic.
And it's got a beautiful mathematical foundation. The catch is estimation — estimating mutual information from data is hard, especially in high dimensions. There are k-nearest-neighbor methods, kernel density methods, and more recently, neural network-based estimators. But none of them are as straightforward as computing Pearson's r.
The curse of dimensionality hits mutual information estimation particularly hard.
The k-NN estimator by Kraskov, Stögbauer, and Grassberger from two thousand four is the most widely used — it's in basically every mutual information toolbox. But it's still finicky with small samples and high dimensions.
For practical dependence detection, distance correlation often wins on ease of use, even though mutual information has the deeper theoretical pedigree.
That's my read, yes. Distance correlation has the "drop it in and it works" property that mutual information estimation hasn't quite achieved yet. But the field is moving fast.
What about correlation for categorical data? We've been mostly talking about continuous variables.
That's a whole parallel universe. For nominal categories, you've got Cramér's V, which is derived from the chi-squared statistic and ranges from zero to one. For ordinal categories — Likert scales, rankings — you've got polychoric correlation, which assumes there's an underlying continuous normal variable that's been discretized.
Polychoric correlation — that's the one where you're estimating what the Pearson correlation would be if you could observe the underlying continuous variable.
It's widely used in psychometrics and structural equation modeling. For binary variables, it reduces to the tetrachoric correlation. These are all based on the idea that the categories are crude measurements of something continuous.
If you don't buy that assumption?
Then you're back to rank-based methods or Cramér's V. There's no single right answer — it depends on what you believe about the data-generating process. Which is true of basically everything we've discussed.
That might be the unifying theme of this entire episode. Every technique has assumptions. The skill isn't in knowing the formulas — it's in knowing when the assumptions are violated and what to do about it.
That's what separates statistical literacy from recipe-following. A recipe says "compute Pearson correlation, if p is less than zero point zero five, you win." Statistical literacy says "look at the data, think about the generating process, choose the right tool, report the uncertainty, and for heaven's sake don't claim causation without a mechanism.
The prompt asked about basic and advanced techniques. I think we've covered the techniques. But the meta-lesson might be more valuable: correlation analysis is as much about skepticism as it is about computation.
The best correlation analyst is the one who trusts their own results the least.
Verifies everything three different ways before believing it.
Which is exhausting, but it's the job.
To summarize the toolkit for anyone keeping score at home. Basic: Pearson, Spearman, Kendall. Always plot first. Intermediate: partial correlation for controlling confounders, cross-correlation for time series, Cramér's V and polychoric for categorical data. Advanced: distance correlation for detecting any form of dependence, canonical correlation for relating sets of variables, mutual information for information-theoretic dependence. And throughout: check for outliers, check for range restriction, check for subgroup effects, correct for multiple comparisons, and never confuse correlation with agreement or association with causation.
That's the syllabus. And honestly, if someone internalizes even half of that, they're ahead of ninety percent of people who report correlations for a living.
The bar is low, but the ladder is tall.
That's another mug.
I'm building a whole kitchen set at this point.
One thing I want to add before we wrap — there's been some interesting work recently on correlation in high-dimensional settings. When you have more variables than observations, the sample correlation matrix is a terrible estimate of the population correlation matrix. Random matrix theory gives you tools to understand what's signal and what's noise.
This is the Marcenko-Pastur distribution territory?
The eigenvalue distribution of a random correlation matrix follows a known law. Eigenvalues that fall outside that distribution are potentially meaningful structure. It's used in finance for cleaning correlation matrices before portfolio optimization — you essentially shrink the noisy eigenvalues and keep the ones that stick out.
Even the humble correlation matrix has a whole field of study dedicated to figuring out which parts of it are real.
Which feels like a fitting place to land. Correlation is simple to compute and bottomless to understand. Every time you think you've mastered it, there's another layer.
Somewhere out there, someone is computing a Pearson correlation on two columns of dirty data, not plotting anything, and writing a press release about their groundbreaking discovery.
The circle of life.
And now: Hilbert's daily fun fact.
Hilbert: In eighteen twelve, astronomer Honoré Flaugergues observed a bright, transient lunar phenomenon from his observatory in French Guiana — he named the effect "clair de terre lunaire," believing it was reflected earthlight, though modern astronomers still debate what he actually saw.
...clair de terre lunaire. That's going to be stuck in my head all day.
This has been My Weird Prompts. Our producer is Hilbert Flumingtop. If you enjoyed this, leave us a review wherever you get your podcasts — it helps more than you'd think. I'm Herman Poppleberry.
I'm Corn. Go plot your data.