#2368: How Recommendation Engines Really Work

Unpacking the multi-stage AI pipeline behind Netflix, Spotify, and Amazon’s "you might also like" suggestions—from candidate generation to real-tim...

0:000:00
Episode Details
Episode ID
MWP-2526
Published
Duration
23:53
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
DeepSeek v3.2

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Hidden Pipeline Behind Your Recommendations

Recommendation engines—the systems powering Netflix’s "Top Picks," Spotify’s "Discover Weekly," and Amazon’s "Customers Also Bought"—are far from magic. They’re industrial-scale AI pipelines designed to balance speed, accuracy, and diversity. Here’s how they work.

The Cascade: Why One Model Isn’t Enough

Handling millions of users and items requires a multi-stage approach:

  1. Candidate Generation: Lightweight models (like two-tower architectures) scan the full catalog in milliseconds, narrowing millions of items to ~500 plausible options. Key innovation: embeddings allow new items (like a just-released show) to be recommended immediately, solving the "cold-start" problem.
  2. Ranking: Gradient-boosted trees (e.g., XGBoost) dominate here, scoring candidates using hundreds of features—historical preferences, real-time behavior (e.g., "clicked three action movies"), and context (time of day, device). Speed is critical: this stage must process ~500 items in tens of milliseconds.
  3. Re-Ranking: Adjusts the top 10–20 results for diversity and business rules (e.g., "no more than two sequels"). Emerging twist: some platforms (like Spotify) now use small, optimized LLMs here to refine selections based on nuanced context (e.g., playlist titles or lyrics).

The Glue: Feature Stores

Every stage relies on a feature store—a unified repository for both batch-computed data (e.g., "user’s favorite genre over the past month") and real-time signals (e.g., "just paused a comedy"). Consistency is key: if the candidate generator and ranking model use different definitions of "user preference," the pipeline breaks.

Trade-Offs Driving Design

  • Latency vs. Accuracy: Candidate generation is fast but approximate; re-ranking is slow but nuanced.
  • Hybrid Systems: Modern engines fuse collaborative filtering (user-item interactions), content-based signals (metadata), and real-time context.
  • Scalability: Techniques like pre-computing item embeddings (via offline "towers") and tiered caching (e.g., Redis for hot features) keep systems responsive.

The next frontier? Deeper integration of LLMs for semantic understanding—without sacrificing the hard-won efficiency of today’s pipelines.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2368: How Recommendation Engines Really Work

Corn
Daniel sent us this one. He's asking us to pull back the curtain on the digital infrastructure behind recommendation systems. You know, the "you might also enjoy this" rows on Netflix, Spotify's Discover Weekly, Amazon's "customers also bought." He wants to know how that data fusion actually works across candidate generation, ranking, and re-ranking, plus the feature stores feeding the whole pipeline. And crucially, he's asking where modern AI—embeddings, two-tower models, LLM-based rerankers—fits into a stack that's been historically dominated by techniques like matrix factorization and gradient-boosted trees.
Herman
Oh, I love this question. It's a perfect blend of classic engineering and cutting-edge machine learning. And the timing is ideal, because the shift from pure collaborative filtering to these hybrid, real-time systems is happening right now.
Corn
Fun fact—deepseek-v3.2 is writing our script today.
Herman
Is that right? Well, hopefully it gets the technical details of feature stores correct. That's a high bar.
Corn
I have faith. Now, to Daniel's point about the "magic" feeling. You open Netflix and it surfaces a show you end up loving, something you never would have searched for. That's not magic; it's a staggeringly complex, multi-stage industrial process. And it's responsible for an enormous amount of what we consume. Recent data suggests Netflix's recommendation system alone serves about eighty percent of all watched content.
Herman
Which translates to something like a billion dollars in annual customer retention savings for them, according to their own reports. So this isn't just a nice-to-have feature; it's the core engine of engagement for these platforms. And the reason it's evolving so fast now is the collision of two worlds: the battle-tested, scalable architectures built over the last fifteen years, and the new capabilities from modern AI models that can understand context and semantics in ways older methods simply couldn't.
Corn
Where do we even start? Do we begin with the user clicking play, or with the mountain of data that makes that click predictable?
Herman
We start with the pipeline because handling millions of users and tens of millions of items requires a cascade, not just one giant model. The first stage is always candidate generation.
Corn
Right, that candidate generation acts as a high-speed filter - taking a catalog of millions of movies or songs and whittling it down to a few hundred plausible options for a specific person, all in milliseconds.
Herman
And to understand why that's necessary, you have to define what a modern recommendation system actually is. At its core, it's a prediction engine. It predicts the probability that a user will engage with an item—click, watch, buy. The oldest and simplest method is collaborative filtering.
Corn
"People who liked X also liked Y.
Herman
That's matrix factorization at work. You have this gigantic, sparse matrix of users and items, and you factor it down to latent features. It’s mathematically elegant and it powered the first wave of these systems. Then you have content-based filtering, which looks at item attributes. But modern systems are almost all hybrids. They fuse collaborative signals, content metadata, and real-time user context.
Corn
Effectiveness comes from that fusion. No single signal is perfect. My watch history tells you something, what similar users watch tells you something else, and the fact that it's Tuesday night and I'm on my phone tells you a third thing. The magic is knitting those together.
Herman
That knitting happens in a very specific, staged pipeline. You mentioned the four key components: candidate generation, ranking, re-ranking, and the feature store that feeds them all. The cascade exists because of a brutal trade-off between accuracy and latency. You could use one incredibly deep neural network to score every single item in the catalog for a user, but it would take minutes. You need speed, so you split the work.
Corn
A quick, dirty filter, then a more careful sort, then a final polish.
Herman
Candidate generation is that quick filter. It uses lightweight models—often two-tower architectures—to scan the entire catalog and pull, say, five hundred candidates. Then the ranking stage takes those five hundred and scores them more precisely, using a richer set of features. Finally, re-ranking might adjust the top ten for diversity, freshness, or business rules.
Corn
The feature store is the nervous system. It’s the unified repository that serves up consistent user and item features—both pre-computed historical ones and real-time signals—to every stage of that pipeline.
Herman
That's the infrastructure. Now, what makes it modern is where the new AI fits into each of those old stages. Take candidate generation, for example - those lightweight models and two-tower architectures. That’s where things get really interesting.
Corn
So let's get concrete. When a user opens the app, walk me through what happens in that first half-second with those models.
Herman
The system immediately queries the feature store for that user's latest embeddings and historical signals. Meanwhile, it's got a pre-computed index of all item embeddings—every movie, every song, every product. The candidate model, which is often a simple neural network with two separate "towers," calculates the similarity between the user vector and the item vectors. One tower encodes the user, the other encodes the item. It's a massive parallel similarity search.
Corn
This is where the shift from classic matrix factorization to embeddings really changed the game.
Herman
Netflix's public shift around twenty twenty-four is a great case study. With matrix factorization, you're dealing with those latent factors derived purely from interaction data. If a movie is new, it has no interactions—the cold-start problem. Embeddings changed that. You can create an initial embedding for a new item based on its content—the actors, the genre, the synopsis. So you can recommend a brand-new show on day one.
Corn
Which is a business imperative. So the two-tower model spits out, say, five hundred candidate items. What makes it "lightweight" enough to scan millions?
Herman
The key is the separation of concerns. The item tower runs offline, batch-processing the entire catalog to pre-compute those item vectors. So at serving time, you're not running a model on millions of items; you're just doing a fast similarity search—often using a specialized library like FAISS—between one user vector and a huge, pre-built index of item vectors. It's all about minimizing real-time computation.
Corn
So we have our five hundred candidates. Now we move from "plausible" to "precise." Enter the ranking stage.
Herman
This is where gradient-boosted trees, or XGBoost, have been the undisputed champion for years. Amazon practically wrote the book on this for real-time scoring. The ranking model takes each candidate and scores it with a much more complex set of features.
Corn
What kind of features?
Herman
Long-term user preferences from the feature store, real-time session data like "they just clicked on three action movies in a row," contextual features like time of day and device, and deep cross-features combining them. The model is answering: given this user, this item, and this exact context, what's the probability of a click or a watch?
Corn
Why are gradient-boosted trees so dominant here? Why not a deep neural network?
Herman
It's the classic trade-off. For tabular data—which is what these hundreds of features are—boosted trees are incredibly efficient, interpretable to a degree, and handle mixed data types beautifully. A deep neural network might gain a fraction of a percent in accuracy, but at a huge cost in training time and serving latency. When you need to score five hundred items in tens of milliseconds, you need a workhorse. That's the ranking stage's job: precision at speed.
Corn
Then we have a third stage: re-ranking. If ranking is so precise, why tweak the top results?
Herman
Because the ranking model optimizes for a single metric, usually predicted engagement. If you just take the top ten highest-scoring items, you might get ten very similar things. For Netflix, that could be ten dark Scandinavian crime dramas. For Spotify, ten songs by the same artist. That's a poor user experience.
Corn
Re-ranking introduces business logic and diversity.
Herman
It might enforce rules like "no more than two sequels in the top row" or "promote one new release." And this is where modern AI, specifically large language models, is making a fascinating entry. Spotify has talked about using LLM-based rerankers for Discover Weekly, for example.
Corn
How does an LLM help with that? It seems too slow.
Herman
It's used judiciously. You might take the top fifty items from the ranking stage and feed them, along with rich user context, into a smaller, optimized LLM. The LLM's strength is holistic understanding. It can read a user's recent playlist titles, their listening notes, and the lyrics and mood of candidate songs to make nuanced adjustments for diversity and thematic cohesion that a simple diversity algorithm might miss.
Corn
The LLM isn't starting from scratch; it's fine-tuning the final stack.
Herman
It's a specialized tool in the last stage. There's a framework called CARE—Contextual Adaptation of Recommenders—that's all about this modular approach. You keep your core, scalable recommenders, and you augment them with an LLM for that final contextual polish. It’s about adaptability without throwing out a billion dollars of infrastructure.
Corn
Which brings us back to the core of Daniel's question. This is why systems use multiple stages. Each stage has a different job, a different latency budget, and a different accuracy requirement.
Herman
The trade-offs are everything. Candidate generation is fast but approximate. Ranking is slower but precise. Re-ranking is the slowest, but it operates on a tiny set of items and adds the final layer of intelligence. Break any part of that cascade, and the whole experience falls apart—which is why that pipeline needs something holding it all together.
Corn
And that's where the feature store comes in. It's the unsung hero, the central nervous system making that delicate balancing act possible. Without it, you'd have chaos.
Herman
A feature store is the unified repository that serves consistent, versioned features to every stage of that pipeline. The key insight is that you can't have the candidate generator using one definition of "user affinity for comedy" and the ranking model using a slightly different one calculated five minutes later. They need the same numbers.
Corn
It's the single source of truth. But it's not just a database. It has to handle two completely different kinds of data: batch features and real-time features.
Herman
Batch features are computed on a schedule—overnight, hourly. Things like a user's average watch duration over the last month, or their top three genres. Real-time features are things like "item clicked three seconds ago" or "current scroll velocity." Uber's Michelangelo platform was one of the early public examples of this architecture done at massive scale. Their feature store had to serve models that predicted ride pricing, estimated time of arrival, and driver dispatch, all needing a mix of historical rider behavior and real-time traffic data.
Corn
The challenge is keeping that real-time lane fresh without melting the system. If every user action triggers a cascade of feature updates for millions of other users, you’re toast.
Herman
Which is where tiered architectures come in. Netflix, for example, uses a multi-level caching strategy. The most frequently accessed features, like a user's primary embedding, might live in a blazing-fast in-memory cache like Redis. Less volatile batch features might be in a centralized warehouse. The system is designed so that the ranking model can pull hundreds of features for five hundred items in milliseconds by hitting these optimized layers.
Corn
Give me a scale example. How big does this get?
Herman
Spotify's feature store, part of their personalization infrastructure, has been reported to handle over one hundred billion feature events per day. One hundred billion. Every play, every skip, every playlist add, every search—it all flows in, gets processed, and becomes a feature available to their candidate generation and ranking models. It’s the only way to power something like Discover Weekly, which uses over ten million user playlists for its collaborative filtering signals.
Corn
That's the infrastructure. Now, let's loop back to the AI integration. We've talked about embeddings in candidate gen and LLMs in re-ranking. But where else does modern AI fit? You mentioned cold-start.
Herman
Embeddings are the primary weapon against the cold-start problem for new items, but also for new users. With matrix factorization, a new user is a blank row in the matrix; you have to wait for them to interact. Now, you can immediately generate a rough user embedding from whatever context you have—sign-up survey, device type, even the time of day they joined. It's not perfect, but it's infinitely better than nothing.
Corn
The two-tower model, powered by these rich embeddings, has largely superseded pure matrix factorization for the candidate generation stage.
Herman
In most cutting-edge systems, yes. The comparison is stark. Traditional matrix factorization is elegant but relies solely on past user-item interactions. The two-tower model with content-based embeddings can incorporate so much more: item metadata, user demographics, even the textual description of a movie. It creates a much richer semantic space for that initial similarity search. The old method asks "what have people like you liked?" The new method can ask "based on who you seem to be and what this item is about, might you like this?
Corn
That's a fundamental shift. It moves from purely behavioral to behavioral-plus-semantic.
Herman
And this is where the scalability challenges become architectural, not just algorithmic. You now have to compute and serve these dense embedding vectors for every user and item, in real time. TikTok's approach to this is fascinating; they push a lot of this real-time scoring to the edge, closer to the user's device, to shave off those critical milliseconds of network latency. It's not just about having a fast model; it's about having the data for that model physically closer.
Corn
Which introduces its own nightmare of data synchronization and consistency. If my 'likes' from ten seconds ago haven't propagated to the edge node my phone is talking to, the model is working with stale data.
Herman
That's the eternal tension: freshness versus latency. You can have perfectly fresh features, but if it takes two seconds to assemble them, the user is gone. Or you can have a blisteringly fast response with features that are five minutes old. The entire infrastructure—the feature store, the caching layers, the compute placement—is engineered to optimize that trade-off for the specific business goal. For a shopping cart recommendation, freshness is paramount. For a "top movies of all time" list, it's less critical.
Corn
The stack is now this hybrid monster. Legacy, battle-tested components like gradient-boosted trees for ranking, sitting alongside modern neural networks for candidate generation, fed by a colossal feature store, with LLMs doing final polish. It's less of a replacement and more of a symbiosis.
Herman
That's the key insight most people miss. The headline is "AI is revolutionizing recommendations." The reality is, AI is being slotted into a highly optimized, multi-stage pipeline where each component has a specific job. Nobody is throwing out XGBoost for ranking because it still wins on the speed-accuracy frontier for tabular data. They're using AI to solve the problems the old stack was worst at: understanding new items, understanding nuanced context, and adding that final layer of semantic coherence. So what you end up with is this hybrid system—part old, part new.
Corn
And that hybrid reality raises some practical questions. For someone building one of these systems, or even just evaluating one as a user, how do you navigate it? What should you actually look for?
Herman
If you're a builder, the first decision point is embeddings versus traditional collaborative filtering. Use embeddings—two-tower models, content-based approaches—when you have a cold-start problem or rich item metadata. If you're launching a new service with no user history, you need that semantic understanding of your catalog. Use matrix factorization when you have a dense, established interaction matrix and pure collaborative signals are your strongest asset. It’s a simpler, cheaper workhorse.
Corn
For the ranking stage? Is it always XGBoost?
Herman
For now, almost always, unless you have a truly massive dataset and can afford the compute for a deep learning ranking model. The rule of thumb is: if your primary features are tabular—numbers, categories, cross-products—gradient-boosted trees will give you the best bang for your buck. The marginal accuracy gain from a neural net rarely justifies the latency and engineering overhead for real-time serving.
Corn
The practical takeaway is, don't get dazzled by the AI headline. Pick the right tool for each stage's specific job.
Herman
And when it comes to evaluating whether a system is any good, the most common mistake is focusing only on precision—did we predict the click? You have to balance that with diversity and discovery. A system with perfect precision would just show you the same thing over and over. Good systems measure things like serendipity and catalogue coverage. They track how often users are exposed to new genres or artists, not just whether they clicked the obvious recommendation.
Corn
That's for the builders. What about us, the users? We're not just passive recipients. How do we actually "train" our own recommendations to be better?
Herman
The levers are simple but powerful. Use the thumbs-up, thumbs-down, or "not interested" buttons religiously. They are direct feedback signals that bypass the interpretation layer. Curating playlists or watchlists creates strong positive signals. And sometimes, the most effective thing is to actively search for what you want—that’s a high-intent signal that most systems prioritize heavily.
Corn
If your Netflix row is stuck in a rut, go search for a documentary about Mongolian yak herding.
Herman
It tells the system to reset its priors. The feature store will log that search, and the candidate generator will immediately start pulling from a different part of the embedding space. Your recommendations should update within a session or two. You are, in a real sense, retraining your personal model with every action—every click a data point, every skip a label.
Corn
That idea of retraining your model in real time—it makes me wonder where the architecture goes from here. If we're already slotting LLMs into the re-ranking stage, is the endgame one giant model that does everything? Will these staged, hybrid architectures get replaced entirely by a single monolithic AI?
Herman
That's the million-dollar open question. The current research, including frameworks like CARE, suggests a modular future, not a monolithic one. The strength of the staged pipeline is its baked-in efficiency. A single LLM trying to scan a billion-item catalog, score each one, and then re-rank for diversity would be impossibly slow and expensive. I think the future is deeper integration, not replacement. Imagine an LLM that doesn't just re-rank, but dynamically chooses which candidate generation strategy to use for you in a given session based on your apparent intent.
Corn
The LLM becomes the conductor of the orchestra, not the entire orchestra.
Herman
It manages the pipeline. The other major frontier is cross-platform recommendation. We've been talking about systems siloed inside Netflix or Spotify. But what if your music app could recommend a podcast because it knows from your calendar you have a long drive? That requires a different kind of architecture—federated learning, like Meta has been exploring, where models train on data across platforms without the raw data ever leaving your device.
Corn
That's the privacy-preserving holy grail, but it's a infrastructure nightmare of a different color. Aligning incentives between companies, standardizing feature definitions... it's a political problem as much as a technical one.
Herman
But it's the logical endpoint of hyper-personalization. The infrastructure we've been talking about today is the foundation that makes even dreaming about that possible. It's all about moving data, transforming it into signals, and making predictions at a speed and scale the human mind can barely comprehend.
Corn
With that, I think we've thoroughly unpacked Daniel's prompt. The magic isn't magic—it's a meticulously engineered cascade of models, data pipelines, and trade-offs, all humming behind a "Because you watched...My final thought is this: the next time a platform suggests something you genuinely love, appreciate the minor miracle of infrastructure that made it possible.
Herman
If it suggests something terrible, well, now you know which part of the cascade probably failed.
Corn
Our thanks, as always, to our producer, Hilbert Flumingtop, for keeping the audio feature store fresh. Thanks to Modal for providing the serverless GPUs that power our production pipeline, letting us spin up models as easily as we spin up conversations.
Herman
If you enjoyed this deep dive, please leave us a review wherever you listen. It's the strongest collaborative filtering signal you can send our way. For the full archive, visit myweirdprompts.
Corn
This has been My Weird Prompts.
Herman
Take your time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.