#2368: The Multi-Stage Pipeline Behind Netflix's Recommendations

Unpacking the multi-stage AI pipeline behind Netflix, Spotify, and Amazon’s "you might also like" suggestions—from candidate generation to real-tim...

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2526
Published: Apr 21
Updated: May 15
Duration: 23:53
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: DeepSeek v3.2
Topics: ai-models data-storage ai-training

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Hidden Pipeline Behind Your Recommendations

Recommendation engines—the systems powering Netflix’s "Top Picks," Spotify’s "Discover Weekly," and Amazon’s "Customers Also Bought"—are far from magic. They’re industrial-scale AI pipelines designed to balance speed, accuracy, and diversity. Here’s how they work.

The Cascade: Why One Model Isn’t Enough

Handling millions of users and items requires a multi-stage approach:

Candidate Generation: Lightweight models (like two-tower architectures) scan the full catalog in milliseconds, narrowing millions of items to ~500 plausible options. Key innovation: embeddings allow new items (like a just-released show) to be recommended immediately, solving the "cold-start" problem.
Ranking: Gradient-boosted trees (e.g., XGBoost) dominate here, scoring candidates using hundreds of features—historical preferences, real-time behavior (e.g., "clicked three action movies"), and context (time of day, device). Speed is critical: this stage must process ~500 items in tens of milliseconds.
Re-Ranking: Adjusts the top 10–20 results for diversity and business rules (e.g., "no more than two sequels"). Emerging twist: some platforms (like Spotify) now use small, optimized LLMs here to refine selections based on nuanced context (e.g., playlist titles or lyrics).

The Glue: Feature Stores

Every stage relies on a feature store—a unified repository for both batch-computed data (e.g., "user’s favorite genre over the past month") and real-time signals (e.g., "just paused a comedy"). Consistency is key: if the candidate generator and ranking model use different definitions of "user preference," the pipeline breaks.

Trade-Offs Driving Design

Latency vs. Accuracy: Candidate generation is fast but approximate; re-ranking is slow but nuanced.
Hybrid Systems: Modern engines fuse collaborative filtering (user-item interactions), content-based signals (metadata), and real-time context.
Scalability: Techniques like pre-computing item embeddings (via offline "towers") and tiered caching (e.g., Redis for hot features) keep systems responsive.

The next frontier? Deeper integration of LLMs for semantic understanding—without sacrificing the hard-won efficiency of today’s pipelines.

Mentions

Amazon E-commerce with recommendation engine
CARE Framework for LLM-based reranking
FAISS Facebook's similarity search library
Meta Company exploring federated learning
Michelangelo Uber's ML platform for features
Netflix Streaming service with recommendation system
Redis In-memory data store for caching
Spotify Music streaming with Discover Weekly
TikTok Social media with edge recommendations
XGBoost Gradient boosting library for ranking

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Creator's Picks 304 episodes

#2368: The Multi-Stage Pipeline Behind Netflix's Recommendations

Daniel sent us this one. He's asking us to pull back the curtain on the digital infrastructure behind recommendation systems. You know, the "you might also enjoy this" rows on Netflix, Spotify's Discover Weekly, Amazon's "customers also bought." He wants to know how that data fusion actually works across candidate generation, ranking, and re-ranking, plus the feature stores feeding the whole pipeline. And crucially, he's asking where modern AI—embeddings, two-tower models, LLM-based rerankers—fits into a stack that's been historically dominated by techniques like matrix factorization and gradient-boosted trees.

Oh, I love this question. It's a perfect blend of classic engineering and cutting-edge machine learning. And the timing is ideal, because the shift from pure collaborative filtering to these hybrid, real-time systems is happening right now.

Fun fact—deepseek-v3.2 is writing our script today.

Is that right? Well, hopefully it gets the technical details of feature stores correct. That's a high bar.

I have faith. Now, to Daniel's point about the "magic" feeling. You open Netflix and it surfaces a show you end up loving, something you never would have searched for. That's not magic; it's a staggeringly complex, multi-stage industrial process. And it's responsible for an enormous amount of what we consume. Recent data suggests Netflix's recommendation system alone serves about eighty percent of all watched content.

Which translates to something like a billion dollars in annual customer retention savings for them, according to their own reports. So this isn't just a nice-to-have feature; it's the core engine of engagement for these platforms. And the reason it's evolving so fast now is the collision of two worlds: the battle-tested, scalable architectures built over the last fifteen years, and the new capabilities from modern AI models that can understand context and semantics in ways older methods simply couldn't.

Where do we even start? Do we begin with the user clicking play, or with the mountain of data that makes that click predictable?

We start with the pipeline because handling millions of users and tens of millions of items requires a cascade, not just one giant model. The first stage is always candidate generation.

Right, that candidate generation acts as a high-speed filter - taking a catalog of millions of movies or songs and whittling it down to a few hundred plausible options for a specific person, all in milliseconds.

And to understand why that's necessary, you have to define what a modern recommendation system actually is. At its core, it's a prediction engine. It predicts the probability that a user will engage with an item—click, watch, buy. The oldest and simplest method is collaborative filtering.

"People who liked X also liked Y.

That's matrix factorization at work. You have this gigantic, sparse matrix of users and items, and you factor it down to latent features. It’s mathematically elegant and it powered the first wave of these systems. Then you have content-based filtering, which looks at item attributes. But modern systems are almost all hybrids. They fuse collaborative signals, content metadata, and real-time user context.

Effectiveness comes from that fusion. No single signal is perfect. My watch history tells you something, what similar users watch tells you something else, and the fact that it's Tuesday night and I'm on my phone tells you a third thing. The magic is knitting those together.

That knitting happens in a very specific, staged pipeline. You mentioned the four key components: candidate generation, ranking, re-ranking, and the feature store that feeds them all. The cascade exists because of a brutal trade-off between accuracy and latency. You could use one incredibly deep neural network to score every single item in the catalog for a user, but it would take minutes. You need speed, so you split the work.

A quick, dirty filter, then a more careful sort, then a final polish.

Candidate generation is that quick filter. It uses lightweight models—often two-tower architectures—to scan the entire catalog and pull, say, five hundred candidates. Then the ranking stage takes those five hundred and scores them more precisely, using a richer set of features. Finally, re-ranking might adjust the top ten for diversity, freshness, or business rules.

The feature store is the nervous system. It’s the unified repository that serves up consistent user and item features—both pre-computed historical ones and real-time signals—to every stage of that pipeline.

That's the infrastructure. Now, what makes it modern is where the new AI fits into each of those old stages. Take candidate generation, for example - those lightweight models and two-tower architectures. That’s where things get really interesting.

So let's get concrete. When a user opens the app, walk me through what happens in that first half-second with those models.

The system immediately queries the feature store for that user's latest embeddings and historical signals. Meanwhile, it's got a pre-computed index of all item embeddings—every movie, every song, every product. The candidate model, which is often a simple neural network with two separate "towers," calculates the similarity between the user vector and the item vectors. One tower encodes the user, the other encodes the item. It's a massive parallel similarity search.

This is where the shift from classic matrix factorization to embeddings really changed the game.

Netflix's public shift around twenty twenty-four is a great case study. With matrix factorization, you're dealing with those latent factors derived purely from interaction data. If a movie is new, it has no interactions—the cold-start problem. Embeddings changed that. You can create an initial embedding for a new item based on its content—the actors, the genre, the synopsis. So you can recommend a brand-new show on day one.

Which is a business imperative. So the two-tower model spits out, say, five hundred candidate items. What makes it "lightweight" enough to scan millions?

The key is the separation of concerns. The item tower runs offline, batch-processing the entire catalog to pre-compute those item vectors. So at serving time, you're not running a model on millions of items; you're just doing a fast similarity search—often using a specialized library like FAISS—between one user vector and a huge, pre-built index of item vectors. It's all about minimizing real-time computation.

So we have our five hundred candidates. Now we move from "plausible" to "precise." Enter the ranking stage.

This is where gradient-boosted trees, or XGBoost, have been the undisputed champion for years. Amazon practically wrote the book on this for real-time scoring. The ranking model takes each candidate and scores it with a much more complex set of features.

What kind of features?

Long-term user preferences from the feature store, real-time session data like "they just clicked on three action movies in a row," contextual features like time of day and device, and deep cross-features combining them. The model is answering: given this user, this item, and this exact context, what's the probability of a click or a watch?

Why are gradient-boosted trees so dominant here? Why not a deep neural network?

It's the classic trade-off. For tabular data—which is what these hundreds of features are—boosted trees are incredibly efficient, interpretable to a degree, and handle mixed data types beautifully. A deep neural network might gain a fraction of a percent in accuracy, but at a huge cost in training time and serving latency. When you need to score five hundred items in tens of milliseconds, you need a workhorse. That's the ranking stage's job: precision at speed.

Then we have a third stage: re-ranking. If ranking is so precise, why tweak the top results?

Because the ranking model optimizes for a single metric, usually predicted engagement. If you just take the top ten highest-scoring items, you might get ten very similar things. For Netflix, that could be ten dark Scandinavian crime dramas. For Spotify, ten songs by the same artist. That's a poor user experience.

Re-ranking introduces business logic and diversity.

It might enforce rules like "no more than two sequels in the top row" or "promote one new release." And this is where modern AI, specifically large language models, is making a fascinating entry. Spotify has talked about using LLM-based rerankers for Discover Weekly, for example.

How does an LLM help with that? It seems too slow.

It's used judiciously. You might take the top fifty items from the ranking stage and feed them, along with rich user context, into a smaller, optimized LLM. The LLM's strength is holistic understanding. It can read a user's recent playlist titles, their listening notes, and the lyrics and mood of candidate songs to make nuanced adjustments for diversity and thematic cohesion that a simple diversity algorithm might miss.

The LLM isn't starting from scratch; it's fine-tuning the final stack.

It's a specialized tool in the last stage. There's a framework called CARE—Contextual Adaptation of Recommenders—that's all about this modular approach. You keep your core, scalable recommenders, and you augment them with an LLM for that final contextual polish. It’s about adaptability without throwing out a billion dollars of infrastructure.

Which brings us back to the core of Daniel's question. This is why systems use multiple stages. Each stage has a different job, a different latency budget, and a different accuracy requirement.

The trade-offs are everything. Candidate generation is fast but approximate. Ranking is slower but precise. Re-ranking is the slowest, but it operates on a tiny set of items and adds the final layer of intelligence. Break any part of that cascade, and the whole experience falls apart—which is why that pipeline needs something holding it all together.

And that's where the feature store comes in. It's the unsung hero, the central nervous system making that delicate balancing act possible. Without it, you'd have chaos.

A feature store is the unified repository that serves consistent, versioned features to every stage of that pipeline. The key insight is that you can't have the candidate generator using one definition of "user affinity for comedy" and the ranking model using a slightly different one calculated five minutes later. They need the same numbers.

It's the single source of truth. But it's not just a database. It has to handle two completely different kinds of data: batch features and real-time features.

Batch features are computed on a schedule—overnight, hourly. Things like a user's average watch duration over the last month, or their top three genres. Real-time features are things like "item clicked three seconds ago" or "current scroll velocity." Uber's Michelangelo platform was one of the early public examples of this architecture done at massive scale. Their feature store had to serve models that predicted ride pricing, estimated time of arrival, and driver dispatch, all needing a mix of historical rider behavior and real-time traffic data.

The challenge is keeping that real-time lane fresh without melting the system. If every user action triggers a cascade of feature updates for millions of other users, you’re toast.

Which is where tiered architectures come in. Netflix, for example, uses a multi-level caching strategy. The most frequently accessed features, like a user's primary embedding, might live in a blazing-fast in-memory cache like Redis. Less volatile batch features might be in a centralized warehouse. The system is designed so that the ranking model can pull hundreds of features for five hundred items in milliseconds by hitting these optimized layers.

Give me a scale example. How big does this get?

Spotify's feature store, part of their personalization infrastructure, has been reported to handle over one hundred billion feature events per day. One hundred billion. Every play, every skip, every playlist add, every search—it all flows in, gets processed, and becomes a feature available to their candidate generation and ranking models. It’s the only way to power something like Discover Weekly, which uses over ten million user playlists for its collaborative filtering signals.

That's the infrastructure. Now, let's loop back to the AI integration. We've talked about embeddings in candidate gen and LLMs in re-ranking. But where else does modern AI fit? You mentioned cold-start.

Embeddings are the primary weapon against the cold-start problem for new items, but also for new users. With matrix factorization, a new user is a blank row in the matrix; you have to wait for them to interact. Now, you can immediately generate a rough user embedding from whatever context you have—sign-up survey, device type, even the time of day they joined. It's not perfect, but it's infinitely better than nothing.

The two-tower model, powered by these rich embeddings, has largely superseded pure matrix factorization for the candidate generation stage.

In most cutting-edge systems, yes. The comparison is stark. Traditional matrix factorization is elegant but relies solely on past user-item interactions. The two-tower model with content-based embeddings can incorporate so much more: item metadata, user demographics, even the textual description of a movie. It creates a much richer semantic space for that initial similarity search. The old method asks "what have people like you liked?" The new method can ask "based on who you seem to be and what this item is about, might you like this?

That's a fundamental shift. It moves from purely behavioral to behavioral-plus-semantic.

And this is where the scalability challenges become architectural, not just algorithmic. You now have to compute and serve these dense embedding vectors for every user and item, in real time. TikTok's approach to this is fascinating; they push a lot of this real-time scoring to the edge, closer to the user's device, to shave off those critical milliseconds of network latency. It's not just about having a fast model; it's about having the data for that model physically closer.

Which introduces its own nightmare of data synchronization and consistency. If my 'likes' from ten seconds ago haven't propagated to the edge node my phone is talking to, the model is working with stale data.

That's the eternal tension: freshness versus latency. You can have perfectly fresh features, but if it takes two seconds to assemble them, the user is gone. Or you can have a blisteringly fast response with features that are five minutes old. The entire infrastructure—the feature store, the caching layers, the compute placement—is engineered to optimize that trade-off for the specific business goal. For a shopping cart recommendation, freshness is paramount. For a "top movies of all time" list, it's less critical.

The stack is now this hybrid monster. Legacy, battle-tested components like gradient-boosted trees for ranking, sitting alongside modern neural networks for candidate generation, fed by a colossal feature store, with LLMs doing final polish. It's less of a replacement and more of a symbiosis.

That's the key insight most people miss. The headline is "AI is revolutionizing recommendations." The reality is, AI is being slotted into a highly optimized, multi-stage pipeline where each component has a specific job. Nobody is throwing out XGBoost for ranking because it still wins on the speed-accuracy frontier for tabular data. They're using AI to solve the problems the old stack was worst at: understanding new items, understanding nuanced context, and adding that final layer of semantic coherence. So what you end up with is this hybrid system—part old, part new.

And that hybrid reality raises some practical questions. For someone building one of these systems, or even just evaluating one as a user, how do you navigate it? What should you actually look for?

If you're a builder, the first decision point is embeddings versus traditional collaborative filtering. Use embeddings—two-tower models, content-based approaches—when you have a cold-start problem or rich item metadata. If you're launching a new service with no user history, you need that semantic understanding of your catalog. Use matrix factorization when you have a dense, established interaction matrix and pure collaborative signals are your strongest asset. It’s a simpler, cheaper workhorse.

For the ranking stage? Is it always XGBoost?

For now, almost always, unless you have a truly massive dataset and can afford the compute for a deep learning ranking model. The rule of thumb is: if your primary features are tabular—numbers, categories, cross-products—gradient-boosted trees will give you the best bang for your buck. The marginal accuracy gain from a neural net rarely justifies the latency and engineering overhead for real-time serving.

The practical takeaway is, don't get dazzled by the AI headline. Pick the right tool for each stage's specific job.

And when it comes to evaluating whether a system is any good, the most common mistake is focusing only on precision—did we predict the click? You have to balance that with diversity and discovery. A system with perfect precision would just show you the same thing over and over. Good systems measure things like serendipity and catalogue coverage. They track how often users are exposed to new genres or artists, not just whether they clicked the obvious recommendation.

That's for the builders. What about us, the users? We're not just passive recipients. How do we actually "train" our own recommendations to be better?

The levers are simple but powerful. Use the thumbs-up, thumbs-down, or "not interested" buttons religiously. They are direct feedback signals that bypass the interpretation layer. Curating playlists or watchlists creates strong positive signals. And sometimes, the most effective thing is to actively search for what you want—that’s a high-intent signal that most systems prioritize heavily.

If your Netflix row is stuck in a rut, go search for a documentary about Mongolian yak herding.

It tells the system to reset its priors. The feature store will log that search, and the candidate generator will immediately start pulling from a different part of the embedding space. Your recommendations should update within a session or two. You are, in a real sense, retraining your personal model with every action—every click a data point, every skip a label.

That idea of retraining your model in real time—it makes me wonder where the architecture goes from here. If we're already slotting LLMs into the re-ranking stage, is the endgame one giant model that does everything? Will these staged, hybrid architectures get replaced entirely by a single monolithic AI?

That's the million-dollar open question. The current research, including frameworks like CARE, suggests a modular future, not a monolithic one. The strength of the staged pipeline is its baked-in efficiency. A single LLM trying to scan a billion-item catalog, score each one, and then re-rank for diversity would be impossibly slow and expensive. I think the future is deeper integration, not replacement. Imagine an LLM that doesn't just re-rank, but dynamically chooses which candidate generation strategy to use for you in a given session based on your apparent intent.

The LLM becomes the conductor of the orchestra, not the entire orchestra.

It manages the pipeline. The other major frontier is cross-platform recommendation. We've been talking about systems siloed inside Netflix or Spotify. But what if your music app could recommend a podcast because it knows from your calendar you have a long drive? That requires a different kind of architecture—federated learning, like Meta has been exploring, where models train on data across platforms without the raw data ever leaving your device.

That's the privacy-preserving holy grail, but it's a infrastructure nightmare of a different color. Aligning incentives between companies, standardizing feature definitions... it's a political problem as much as a technical one.

But it's the logical endpoint of hyper-personalization. The infrastructure we've been talking about today is the foundation that makes even dreaming about that possible. It's all about moving data, transforming it into signals, and making predictions at a speed and scale the human mind can barely comprehend.

With that, I think we've thoroughly unpacked Daniel's prompt. The magic isn't magic—it's a meticulously engineered cascade of models, data pipelines, and trade-offs, all humming behind a "Because you watched...My final thought is this: the next time a platform suggests something you genuinely love, appreciate the minor miracle of infrastructure that made it possible.

If it suggests something terrible, well, now you know which part of the cascade probably failed.

Our thanks, as always, to our producer, Hilbert Flumingtop, for keeping the audio feature store fresh. Thanks to Modal for providing the serverless GPUs that power our production pipeline, letting us spin up models as easily as we spin up conversations.

If you enjoyed this deep dive, please leave us a review wherever you listen. It's the strongest collaborative filtering signal you can send our way. For the full archive, visit myweirdprompts.

This has been My Weird Prompts.

Take your time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2368: The Multi-Stage Pipeline Behind Netflix's Recommendations

The Hidden Pipeline Behind Your Recommendations

The Cascade: Why One Model Isn’t Enough

The Glue: Feature Stores

Trade-Offs Driving Design

Mentions

Downloads

You Might Also Like

Featured In

#2368: The Multi-Stage Pipeline Behind Netflix's Recommendations