ML-Powered Search & Recommendation 101: From Core Concepts to Scalable Systems

Video size:

Abstract

Sergey Polyashov reveals the hidden complexity behind search and recommendation systems—covering core architecture, model trade-offs, and system pitfalls. With real-world examples, he offers a no-fluff guide to building scalable, production-ready recommendation platforms.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Sergey, and today I'd like to talk about modern approach to search and recommendation systems and highlight some of the scaling challenges and how they can be addressed based on my hands-on experience over the years, I have led work on video search, personalized video recommendations, and product search at companies like Yandex and Microsoft. This presentation is fairly short, so I'll focus only on the most fundamental problems and solutions. Each of these topics deserves a separate deep dive. If you have questions, feel free to reach out to me. This talk covers the core ideas behind machine learning, powered search and recommendation systems from why they matter and how they are built, rank and scaled. We'll look at key models, system design practices, and recent trends shaping the future of recommender technologies. Search and recommendation system are essential for helping user navigate huge catalogs of content or products. They reduce information overload and improve the user experience by surfacing and relevant items. They also drive key business metrics, more clicks, purchases, and return visits, and help optimize for long-term value using tech techniques like reinforcement learning and the impact is clear. 30% of YouTube views, 35 of Amazon purchases, and 80% of Netflix watches all come from recommendations. These systems are not just useful. They are critical to modern digital platforms. When evaluating search and recommendation systems, we often look at two types of goals. Short term and long term. Short term target are based on immediate user actions. Things like clicks, time, or purchases. These are easy to track and give us quick feedback, but if you focus only on them, we risk our fitting to short-term interests and ignoring real user value. Long-term targets are about user satisfaction over time. They are harder to measure, but they better reflect user value, true value, such as retention, subscription, continuation, or content diversity. These metrics encourage exploration and help build loyal users. In short-term metrics are helpful, but long-term metrics are what really drives sustainable success. Let's take a look at big picture. This is a high level architecture of a typical machine learning powered search and recommendation system. At the top, we have the server layer, which is the heart of real time response. It includes entity generation, which Rie a manageable set of potentially relevant items, future feature generation, which builds feature season, user query, and item data and models and business rules where ranking models, score candidates, and business logic handles diversity, exploration and cold start cases. These are the blocks we'll focus on for the rest of the presentation. Any degeneration and ranking. Now, let's zoom out for a second and look at the full loop and at the input we have user data, que items and real time events coming through a data procession. That's how con, that's how context and behavior enter the system. Equally important is the login infrastructure. Everything users do, clicks, scrolls, dwell. Time is locked and sent into offline pipelines. Why is that important? Because we rely on this data for training, evaluation, and experimentation. If you log incorrectly, we will train on biased and broken labels, which directly hear systems quality. In the training pipeline, we apply labeling strategies, some manual, some assisted by LLMs and train our models season full. The training or active learning or online updates at the bottom, we measure success. That includes both offline metrics, like how, and this and d, g, and online metrics like CTR, time spent and revenue. The top the loop. Is closed by using these metrics to guide further model improvements. So again, in this talk, we'll go deeper into the Kennedy generation and ranking layers. But keep in mind the rest of this architecture is equally critical to building a successful system to make recommendations at scale, we use a multi-stage retrieval and Rankin funnel. The idea is simple. We start with millions or even billions of items and progressively narrow them down to just the top 50 or so that we actually show to the user. Let's walk through each stage. We begin with user profile and query understanding. This gives us context. What has the user done in the past? What are they looking for right now? Then we move into candidate generation. This stage focuses on high recall. We want to gather as many relevant items as possible. The goal here is to cast a wide net and collect around 10 K items. We also factor in things like diversity, freshness, and popularity. Next is pre ranking. At this point, we filter down from 10 K to roughly one K items using lightweight models, often gradient boosted decision trees or fast neural networks. We are still optimizing for a call and diversity here, but the computation needs to be fast. In L two and L three layers, we do the heavy lifting. These stages use complex models like deep neural networks or transformers to evaluate each of those one key candidates in depth and select the top 50 or so. The key metrics here is NDCG, which reflects ranking, quality and relevance. Finally. Ranking stage where we applied business rules, for example, to increase category diversity or boost new items to ensure legal and policy compliance. What is powerful about this funnel is that it balance performance and efficiency. We use simple models early for speed and heavy models later for accuracy, so the system stays fast. Stays fast, but delivers high quality results. Once we understood the query end user profile, the first step in the pipeline is candidate generation. The goal here is to select a small, manageable set of that, typically thousands of items from a catalog that may contain million or even billions or sometime trillions. Items. These steps needs to be fast. Provide higher goal and maintain diversity. There are several approach to do so efficiently. First, we have NN or approximate nearest neighbor search, which finds the closest items in embedded space. Popular libraries include phase from meta and scan from Google with fast algorithms like HNSV, which is the most popular. A N relies on good embeds, which we can get from collaborative filtering like a LS Alterna three squares, which is fast, but struggles with call starts and content based models such as two tower architectures based on DSM or bird. Another method is random book, which discovers related items by walking through a graph. It's good for diversity and Goldstar problems. Though a bit outdated in modern systems, in search-based systems, we often use ANOR Index, which maps query terms to documents. This scales extremely well to trillions of items thanks to techniques like term charting and BM 25 based stren and q. Finally, don't underestimate heuristics. Simple based filters like popularity, recency, or user subscriptions are still widely used, especially when latency matters. The takeaway is there is no one size fit all solution. Most production system combine multiple methods and N Heuristics and angle index to balance performance coverage and. Once we generated a list of candidates items, the next step is to rank them to decide what to show and what other in what order. The goal of ranking is to sort the candidates by relevance and other objectives like diversity or conversion. Conversion potential. There are two broad, two broad types of optimization goals, short term metrics like D-C-G-M-R-R, or precision at key. These are widely used in industry and are based on offline judgements and historical user behavior. And long-term metrics, which are still an active research area. This includes approach like reinforcement planning and sequence models that aim to optimize for lifetime value or sustain engagement. Engagement. Let's look at the main approach used gradient boosting decision. Tree models such as Boost or Lead GBM are very popular in production. They're fast and Deb breathable and work well on structured features. They are often used as the final ran layout. On the other hand, deep neural networks like DSM built and other transform based models can model more complex interactions and are commonly used in upstream stages like NN, or as feature generations for gradient boosting. Of course ranking isn't easy. It comes with several challenges and biases. Biases such as label and position biases come from implicit feedback like clicks. The coldstar problem when we lack history for new user are, or items the classic trade between relevance and diversity and another trade off exploration versus exploitation. Do we show what's already working or do we help users discover new things? And adaptability is also a concern. Gradient boosting is easier to explain. While deep neural networks often behave like black boxes, privacy is becoming more important. Many teams are exploring federated learning to train models without centralizing sensitive data. And finally, latency. Deep models can heavy, can be heavy. So we need techniques like distillation or caching to keep the system responsive. In practice, teams often combine these methods using neural networks for representation and gradient boosting for final scoring to get the best of both worlds. This slide shows a typical architecture used for ranking in modern recommendation systems. We start on the left with input features. These include user features like demographic or behavior signals, item features like. Price, popularity or category interaction features. For example, how often the user has interacted with similar items and textile signals like time of the day or device type, and often a score from a lightweight to tower neural network that helps preamp items. These features are then passed to a set of gradient boosted decision three models. One model estimates probability of relevance. Another model can predict the probability that the user will click if the item is shown. And the third estimates the conversion rate. If the user clicks, this course can be combined into a final ranking score dependent on business goals like engagement or revenue in more advanced appliance. We may also use a multitask deploy deep neural network shown here at the bottom. This DNN is trained to predict the same targets, relevant clicks and conversions, but it can model more complex relationship and input. Finally, once we have LEAs, we apply the ranking logic. This includes promotion diversity towards showing too much of the same content, introducing exploration to test new or less known items and applying business rules, for example, to boost sponsor content or ensure policy compliance in production. This architecture is often hybrid. The DNN might be used offline or upstream. Gradient boosting remains the final model for online serving due to its speed and interpretability. Let's now compare two of the most common approach to ranking gradient booster decision trees and deep neural networks, starting with gradient boosting on the left. These models are widely used in production, especially for the final ranking stage. They work very well on structured data and easy to interpret and support fast training and AB testing. They also grade for small and medium data sets and have low latency, making them ideal for high throughput environments. But Gradient Boost also has its limitations. It struggles to model complex and direct and doesn't handle sequences of multimodel they also don't scale as effectively as large data set compared to neural models. Now moving to deep neural networks on the right side, these are seeing increased adoption in both research and industrial. DN Ns can capture both explicit and implicit feature interactions. Handling billings for high cardinality features and enable end-to-end training on sequences like user history, they also support multitasking and transfer learning and scale better with more data. However, deep networks also bring their own set of challenges. They are cap, potentially expensive. Harder to debug and less interpretable. They can suffer from issues like bias, fairness, and even hallucinations, and they tend to be sensitive to noise or small, ch small chain, small changes in input. Because of these trade-offs, many real world systems adopt a hybrid approach using neural network upstream for candidate generation or feature extraction. Gradient boost for final ranking to balance performance and efficiency when building large scale neural networks for recommendations. Performance is important, but scalability and efficiency are critical. This slide outlines some of the key design principles that help us get both. First, we have late fusion and BN color architectures. Instead of scoring user item pairs in real time with a single tower, we split them into a user tower, which answered online and item tower, which we pre-com compute offline. This can lead to over 110 speed up while preserving more than 80% of total profit in some systems. Next. Is contrastive learning. This trains the model to distinguish between positive and hard negative examples using the losses like in or scent. The goal is to produce a billion that are compatible with fast dot product retrieval. In some cases, the teams report up to 100% profit uplift in retrieval performance. A common challenge at scale is embedding size, so we apply compression techniques like hashing, quantization, or distillation to keep memory and latency in check. Another critical aspect is bias correction, including popularity bias, proposition bias and feedback loops. One trick here is to add a context of wear tower during training, but drop it at inference time. We also apply hard negative mining to main. To make training more meaningful instead of using random negatives with sample negatives that are hard for the model to distinguish. This makes the learning signal much stronger. Then there is multis signal learning. Our models often consume inputs from multiple modalities, text, images, table or data, and from different domains like search queries, watch history, or card activity. Finally, sequential model is crucial. We use transformer based on coders to capture the temporal aspects of user behavior in recent events. First, to predict what comes next. These principles together make it possible to run deep learning, massive scale in production, recommendation systems, results, sacrifice, and speed or quality. Let's now talk about why scaling and LU Commander system is much harder than scaling and LMS or vision models. On the left side, we see that LMS and computer vision models scale extremely well. They take in long sequences like texts or pixels, benefit from dance labels and strong supervision and have powerful pre tasks like next token prediction. They use deep transformer architectures and are often latency tolerant. We don't mind waiting a few second for a response In these domains, scaling almost always improves quality and scaling laws are well established. Now contrast that with recomme command system. On the right we deal with massive embedding tables. Billions of user and item IDs. Our models are often animal to layer perceptrons, and they must respond in milliseconds. So we are under heavy latency constraints. Our data is based on short, sparse behavior sequences, just few clicks or skips. Feedback is implicit and we don't have a universal self supervised training task like lms. Do. Also no clear scaling law exists here. As we grow models, we quickly run into bias, noise, and diminishing returns. The key takeaways is in the box at the bottom, a Recomme Commander system. Heat, unique scaling limits, including latency, implicit feedback, massive embeddings, and the main specific biases. These can just be solved by making models deeper or wider. We need the main specific innovations. Let's wrap up with some of the most important trends in neural networks recommended systems today. First, we are seeing a strong shift from traditional ingredient boost models to deep learning, what we call neural ran companies. Like YouTube meta Alibaba use models in production to power personalized recommendations at scale. Multi-stage pipelines with bean quarter for fast retrieval and deep models for final scoring are now fairly standard while not new, they remain a reliable way to balance quality and latency at scale. We are also seeing LMS being used in novel ways, not just for generating text. But to interpret embedding spaces and even answer queries based on vector alone. In terms of architectural transformer innovations like high performer or high performer from DeepMind are pushing the envelope for recommendation specific models when it comes to scaling. Several papers have shown that scaling laws applied recommendation in bearings and sequence models just like they do in NLP Meta 2024, research on trillion Parameters, transducers, and the VU One project aimed to define scaling law, specific to recommendations, another growing trend in sequence modeling. We are moving beyond simple. Next item, prediction and modeling. Richer multi-model, time aware user behavior. The area of graph neural networks models are being used to better capture user item relationship, especially in the long tail. These use inductive and transductive approach depending on whether the graph structure is fixed or dynamic. Finally. Reinforcement learning continues to gain traction. It's used to optimize long-term goals like retention or LTV. Reinforcement learning also helps with exploration, which is key to avoidant feedback loops and filter bubbles. Overall, these trends show that recommendation system are evolving rapidly with a growing focus on scalability, long-term value, and deep user modeling. That bring us to the end of the presentation. Thank you so much for your attention. I've covered the fundamentals of machine learning, powered search and recommendation systems, explore key architectural components, looked at challenges in scaling and discussed emerging trends in this space. I hope this gave you a useful foundation and maybe even sparked ideas you'd like to explore further. If you have any questions or want to continue the conversation, I'd be happy to chat. Thanks again.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

ML-Powered Search & Recommendation 101: From Core Concepts to Scalable Systems

Video size:

Abstract

Summary

Transcript

Slides

Sergey Polyashov

Principal Software Engineering Manager @ Microsoft

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

ML-Powered Search & Recommendation 101: From Core Concepts to Scalable Systems

Video size:

Abstract

Summary

Transcript

Slides

Sergey Polyashov

Principal Software Engineering Manager @ Microsoft

Join the community!