Daniel sent us this one — a foundational explainer on backpropagation. He wants us to walk through how neural networks actually learn, the chain rule mechanics, gradient descent, and why that 1986 paper by Rumelhart, Hinton, and Williams was the unlock that made today's large language models possible. He also wants us to distinguish the backward pass from the forward pass, explain the credit-assignment problem, cover vanishing and exploding gradients, and then end on why saying "it's just calculus" badly understates how counterintuitive and powerful this idea really is. Which, I have to say, is exactly the kind of thing people say when they want to sound like they've mastered something they've only skimmed.
That phrase drives me up the wall. "It's just the chain rule." Technically true in the same way that a symphony is just vibrations in air. The gap between knowing the chain rule for a single function and applying it efficiently across a computational graph with millions of parameters — that gap is where all the engineering insight lives. And I should mention, by the way, that today's script is coming to us from DeepSeek V four Pro, so if I get especially enthusiastic about calculus, you know who to thank.
Alright, let's start with what backpropagation actually is, because I think even people who use these tools daily sometimes have a fuzzy picture of what happens under the hood. You've got a neural network — layers of neurons, each connected to the next with weights. Data goes in, flows forward, and a prediction comes out. That's the forward pass. But then the network has to learn from its mistakes, and that's where things get interesting.
The forward pass is just arithmetic — you multiply inputs by weights, sum them up, apply an activation function, pass the result to the next layer, repeat until you get an output. Then you compare that output to the correct answer using a loss function, which spits out a single number representing how wrong you are. The forward pass tells you how wrong you are, but it doesn't tell you which weights to adjust or by how much. That's the backward pass — backpropagation. It computes the gradient of the loss with respect to every single weight and bias in the entire network.
The gradient is just the direction and magnitude of change needed for each parameter to reduce the loss, right? So if a weight has a large positive gradient, nudging it downward will improve performance.
The gradient is the partial derivative of the loss with respect to that parameter. And backpropagation computes all of those partial derivatives in one efficient backward sweep. The key insight — and this is what Rumelhart, Hinton, and Williams crystallized in 1986 — is that you can reuse computations from later layers to compute gradients for earlier layers. You don't have to recompute everything from scratch for each weight. That's what makes it feasible for networks with more than a handful of parameters.
Let's make this concrete. Walk me through a tiny example so listeners can feel the chain rule in action.
I love this example from the Stanford CS two thirty-one N course. Imagine a simple circuit with three inputs — x equals negative two, y equals five, z equals negative four. The forward pass adds x and y to get q equals three, then multiplies q by z to get f equals negative twelve. So f is our output. Now we want to know how much each input contributed to that output. We start backward. The derivative of f with respect to z is just q, which is three. The derivative of f with respect to q is z, which is negative four. Now, to get the derivative with respect to x, we apply the chain rule — it's the derivative of f with respect to q, times the derivative of q with respect to x. Since q equals x plus y, the derivative of q with respect to x is one. So we multiply negative four by one and get negative four. Same for y. Every operation in the circuit computes its local gradient, and then the chain rule multiplies that local gradient by whatever upstream gradient is flowing back from the output.
The "back" in backpropagation is literal — you're propagating error signals backward through the exact same connections that carried data forward. That's the part I think people miss when they say it's just calculus. The algorithm isn't just doing derivatives. It's caching values from the forward pass, reusing them in reverse order, and applying the multivariable chain rule at every fork in the graph where gradients from different paths add together.
That caching step is crucial. During the forward pass, you store the intermediate values — the q in our example, the activations of each layer. Then during the backward pass, you need those stored values to compute the local gradients. Without caching, you'd have to recompute the entire forward pass for every single parameter, which would be astronomically expensive. This is reverse-mode automatic differentiation, and for networks with many inputs and few outputs — which is the typical case in machine learning — it's dramatically more efficient than forward-mode differentiation. Christopher Olah had a great post on this in twenty fifteen where he showed that reverse-mode can be up to ten million times faster than naive methods for modern networks.
Ten million times faster. That's not an optimization — that's the difference between something being physically possible and being science fiction. Which brings us to the credit-assignment problem, which is really what backpropagation was invented to solve.
The credit-assignment problem is deceptively simple to state but fiendishly hard to solve. In a multi-layer network, when you get an error at the output, how do you decide which weights in the earlier layers were responsible? If a hidden neuron in layer two fired strongly and the final prediction was wrong, was that neuron's weight too large? Or was the problem actually in layer one, which fed it bad inputs? In a deep network, every neuron's output is a function of thousands of upstream weights, and the final error is a function of all of them. Untangling that web of responsibility is what backpropagation does — it assigns credit or blame to each parameter in proportion to its actual influence on the output.
Before backpropagation, this was the bottleneck that kept neural networks shallow. You could train a single-layer perceptron because the error signal could be applied directly to the weights connecting inputs to outputs. But as soon as you added hidden layers, the error signal had no clear path to the weights in those layers. Researchers in the nineteen sixties and seventies knew multi-layer networks would be more powerful in principle, but they couldn't figure out how to train them.
This is where the history gets fascinating, and also a bit messy. The nineteen eighty-six Nature paper by Rumelhart, Hinton, and Williams is rightly famous — it's the paper that brought backpropagation to the wider scientific community and revived interest in neural networks. The title was "Learning representations by back-propagating errors," and it showed that hidden units in multi-layer networks could automatically learn useful internal representations of task features. This was huge — it meant the network wasn't just memorizing, it was discovering structure.
They weren't the first to describe the algorithm, were they?
No, and this is where the priority debate gets thorny. Paul Werbos described backpropagation in his nineteen seventy-four Harvard PhD dissertation, where he framed it as reverse-mode optimization for dynamic systems. And even earlier, in nineteen seventy, Seppo Linnainmaa's master's thesis on automatic differentiation provided key mathematical foundations. But Werbos's work was buried in an obscure dissertation during what we now call the first AI winter — nobody in the neural network community was reading it. And Linnainmaa wasn't even working on neural networks at all. So Rumelhart, Hinton, and Williams independently reinvented the algorithm and, crucially, demonstrated its power on neural networks in a top journal at exactly the moment when the field was ready to receive it.
There's a real question here about scientific credit. Does "who invented it" matter, or does "who made it work and convinced the world" matter more? I lean toward the latter, but I can see why Werbos might feel otherwise.
The history of science is full of these cases. The important thing is that the nineteen eighty-six paper was the inflection point. After it was published, researchers could finally train multi-layer networks to do interesting things. The field exploded — for a while. Then it hit a wall.
The vanishing gradient problem.
And this is the part of the story that most popular accounts skip. Backpropagation was theoretically sufficient to train deep networks, but practically it broke down once networks got beyond a few layers. The problem is in the activation functions and the chain rule itself. When you use sigmoid or tanh activation functions, their derivatives are bounded between zero and one. The sigmoid derivative maxes out at zero point two five. Now imagine you have a ten-layer network. As you backpropagate the error signal, you're multiplying by a number less than zero point two five at every layer. After ten layers, the gradient has shrunk by a factor of zero point two five to the tenth power — that's about one in a million. The early layers get essentially zero learning signal.
The network isn't learning anything in those early layers. They're just frozen in place with whatever random initialization they started with.
And the deeper the network, the worse it gets. This was formally identified by Sepp Hochreiter in his nineteen ninety-one diploma thesis, and it basically killed deep learning for almost two decades. You could train shallow networks just fine, but deep networks — the kind that might actually do something interesting — were untrainable. The field entered a second dark age. Researchers moved on to support vector machines and other methods that didn't have this problem.
Then there's the flip side — exploding gradients — where the gradients grow exponentially large and your parameter updates become so huge that the network's weights just blow up to infinity or NaN.
Exploding gradients are actually easier to deal with because they're obvious. Your loss goes to infinity, your weights become NaN, training collapses immediately. You know something's wrong. Vanishing gradients are insidious because training still runs — the loss decreases, the metrics look okay — but the early layers aren't learning anything useful. You're essentially training a shallow network with a bunch of dead weight attached to the front. The fix for exploding gradients is gradient clipping — you just cap the gradient magnitude at some maximum value. Simple, crude, effective.
Vanishing gradients required a whole suite of solutions, and they didn't all arrive at once.
The solutions came in waves. The first big one was better activation functions. The ReLU — rectified linear unit — has a derivative of one for all positive inputs. Multiply by one across a hundred layers and you still have one. That alone made it possible to train much deeper networks. But ReLU isn't perfect — it has the "dying ReLU" problem where neurons can get stuck outputting zero and never recover. So we got variants like Leaky ReLU and ELU.
Then there's weight initialization.
Xavier initialization, also called Glorot initialization, and later Kaiming He initialization. The idea is to set the initial weights so that the variance of the activations stays roughly constant across layers. If you initialize too small, the signal vanishes on the forward pass. Too large, it explodes. Getting this right meant you could start training from a reasonable place instead of having the network collapse in the first few iterations.
Batch normalization, introduced by Sergey Ioffe and Christian Szegedy in twenty fifteen, was a game-changer. It normalizes the activations within each mini-batch to have zero mean and unit variance, with learnable scale and shift parameters. This smooths the optimization landscape, reduces sensitivity to initialization, and acts as a regularizer. It also helps with vanishing gradients because normalized activations keep the gradients flowing.
The one I find most conceptually elegant is residual connections — the skip connections from ResNets.
ResNets from twenty fifteen, yes. The insight is beautifully simple: instead of forcing each layer to learn the full transformation, you let it learn the residual — the difference between the input and the desired output. So instead of computing F of x, you compute F of x plus x, where the plus x is the skip connection that bypasses the layer entirely. During backpropagation, the gradient can flow through the skip connection unimpeded — no multiplication by small derivatives, no vanishing. It creates a gradient highway straight back to the earliest layers. This is what made it possible to train networks with hundreds or even thousands of layers.
We went from networks of three or four layers being the practical limit in the nineteen nineties to networks with over a hundred layers being routine by twenty sixteen. That's not incremental progress — that's a phase change enabled by solving the gradient flow problem.
For recurrent neural networks, which have their own version of vanishing gradients through time, the solution was LSTMs — long short-term memory networks — introduced by Hochreiter and Schmidhuber in nineteen ninety-seven. LSTMs have gating mechanisms that let the network learn when to remember and when to forget, creating paths where gradients can flow across many time steps without decaying. This is what made sequence modeling — language, speech, time series — actually work.
Which brings us to today's large language models. When people look at something like Claude or GPT and marvel at how it works, what they're really marveling at — at the most fundamental level — is backpropagation working at scale. Billions of parameters, all being updated by gradients computed through this same chain rule mechanism we've been describing.
It's worth pausing on that scale. A modern large language model might have hundreds of billions of parameters. Every single training step, backpropagation computes the gradient for every single one of those parameters. That's hundreds of billions of partial derivatives, computed in one backward pass that takes a fraction of a second on specialized hardware. If you tried to compute those gradients by finite differences — perturb each parameter slightly, run the forward pass, measure the change — you'd need one forward pass per parameter. For a model with a hundred billion parameters, that's a hundred billion forward passes per training step. Even on the fastest hardware available, one training step would take years.
That's the ten million times faster figure from earlier. It's not just an optimization — it's the difference between deep learning existing and not existing.
This is why I push back so hard on the "it's just calculus" dismissal. Yes, the chain rule was discovered by Leibniz in the seventeenth century. But knowing the chain rule doesn't tell you how to organize computation across a graph with millions of nodes. It doesn't tell you to cache forward-pass values for reuse in the backward pass. It doesn't tell you that reverse-mode differentiation is exponentially more efficient than forward-mode for the many-inputs-few-outputs case. Those are algorithmic insights, not mathematical ones.
There's also something deeply counterintuitive about the backward pass that I think gets lost in the math. In the forward pass, information flows in a physically intuitive way — input to output. But in the backward pass, error signals flow in reverse through the same weights. The same connection that sent activation forward now sends sensitivity backward. That's not how biological neurons work. There's no evidence that brains propagate error signals backward through synapses.
This is the biological implausibility problem, and it's a fascinating tension in the field. Neuroscientists have pointed out for decades that backpropagation requires symmetric feedback connections — the backward path uses the exact same weights as the forward path — and biological synapses are almost certainly not symmetric. There's been work on alternatives like feedback alignment, where random feedback weights are used instead of the exact transpose, and it works surprisingly well. But the fact remains: the algorithm that powers modern AI tells us almost nothing about how brains actually learn.
Does that matter? I mean, airplanes don't flap their wings. The fact that backpropagation isn't biologically plausible doesn't make it less useful as an engineering tool.
It matters for two reasons. First, if your goal is to understand intelligence — human or animal — then backprop-trained networks might be a misleading model. They achieve similar outputs through fundamentally different mechanisms. Second, the biological implausibility might be hinting at limitations we haven't encountered yet. Backprop requires global error signals — you need to know the loss at the output to compute gradients for the earliest layers. Biological learning seems to be much more local. There might be efficiency or robustness advantages to local learning rules that we're missing because backprop works so well on our current benchmarks.
That's a fair point. Alright, let's zoom out. If someone listening wants to really understand backpropagation — not just the hand-wavy version but the actual mechanics — where should they start?
The single best resource is the three blue one brown series on backpropagation. Grant Sanderson has a gift for building intuition through visualization. He walks through the chain rule in computational graphs step by step, and by the end you can feel the gradients flowing. The Stanford CS two thirty-one N course notes are also excellent — they have the concrete circuit examples I mentioned earlier. And if you want the historical perspective, the original nineteen eighty-six Nature paper is surprisingly readable. It's only four pages.
Four pages that changed the world. There's something almost poetic about that. The algorithm that makes modern AI possible fits in four pages, and the core insight — propagate errors backward through the network — can be stated in one sentence. But the implications took decades to fully realize.
We're still realizing them. Every time someone trains a model that does something surprising — generates a coherent paragraph, writes working code, produces a plausible image — the learning mechanism underneath is still backpropagation. The architectures have changed. The hardware is unrecognizable. The datasets are enormous. But the algorithm that adjusts the weights is essentially the same one Rumelhart, Hinton, and Williams described in nineteen eighty-six, with roots going back to Werbos and Linnainmaa.
I think that's what Daniel was getting at with the prompt — this tension between simplicity and depth. The algorithm is conceptually simple enough to explain in a podcast episode, but it contains multitudes. The vanishing gradient problem, the credit-assignment insight, the computational efficiency of reverse-mode differentiation — none of that is obvious from "it's just the chain rule.
I think there's a broader lesson here about how we talk about technical ideas. When you reduce something to "it's just X," you're not demonstrating mastery — you're closing off curiosity. You're saying "there's nothing interesting here" when in fact there's a whole universe of interesting things. Backpropagation is a perfect example. If you stop at the chain rule, you miss the thirty-year struggle to make deep networks trainable. You miss the algorithmic beauty of reverse-mode autodiff. You miss the biological mystery of why a learning rule that's so effective in silicon seems to have no counterpart in carbon.
The struggle is what makes the story compelling. Backpropagation was invented, forgotten, reinvented, popularized, then nearly abandoned because of vanishing gradients, then resurrected by a series of clever innovations — ReLU, batch norm, residual connections. It's not a story of one brilliant insight solving everything. It's a story of an insight sitting dormant for decades until the surrounding ecosystem caught up.
That surrounding ecosystem includes hardware. The nineteen eighty-six paper demonstrated backprop on small networks with a few hundred weights. Training took minutes or hours on the computers of the day. Scaling to millions of parameters required GPUs, which didn't become widely used for machine learning until the late two thousands. Scaling to billions required TPUs and massive distributed training infrastructure. The algorithm was ready decades before the hardware could realize its potential.
Which raises an interesting question: are there algorithms sitting in the literature right now that will be transformative once the hardware catches up? We've been focused on backpropagation for forty years because it works and scales. But maybe the next backprop-equivalent idea is already published, waiting for its moment.
That's entirely possible. There's work on alternatives — target propagation, predictive coding, equilibrium propagation — that solve some of backprop's limitations, like the need for symmetric feedback and global error signals. None of them beat backprop on raw performance yet, but neither did backprop beat perceptrons in nineteen seventy-four. The hardware and the algorithms co-evolve.
If you're a listener who made it this far and wants a practical takeaway, here it is. When you're using these AI tools — whether it's a language model, an image generator, a recommendation system — the thing making it work is a forty-year-old algorithm that efficiently assigns credit to millions or billions of parameters. Understanding that algorithm, even at a conceptual level, gives you a better intuition for what these systems can and can't do, why they fail in certain ways, and why scaling them up has been so remarkably effective.
If you're the kind of person who learns by doing, implementing backpropagation from scratch for a tiny network — like a two-layer network on MNIST — is one of the most educational programming projects you can undertake. You'll wrestle with the chain rule, you'll debug vanishing gradients, you'll feel the satisfaction of watching the loss curve go down. It's maybe a hundred lines of Python with NumPy, and it will permanently change how you think about machine learning.
That's a solid recommendation. Alright, we should wrap this up. One forward-looking thought: as models get larger and training runs get more expensive, the efficiency of backpropagation itself becomes a bottleneck. There's active research on alternatives that could reduce the computational cost of the backward pass or eliminate it entirely — things like synthetic gradients and direct feedback alignment. Whether any of these dethrone backprop remains to be seen, but the fact that we're still using essentially the same algorithm after forty years doesn't mean we'll still be using it after eighty.
That's the kind of open question that makes this field so exciting. The foundations are solid, but they're not frozen.
Thanks as always to our producer Hilbert Flumingtop for keeping this show running. And thanks to Modal for the serverless GPU infrastructure that powers our pipeline. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com. I'm Corn.
I'm Herman Poppleberry. See you next time.