Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Sergey, and today I'd like to talk about modern approach to search and
recommendation systems and highlight some of the scaling challenges and how they
can be addressed based on my hands-on experience over the years, I have led
work on video search, personalized video recommendations, and product search at
companies like Yandex and Microsoft.
This presentation is fairly short, so I'll focus only on the most
fundamental problems and solutions.
Each of these topics deserves a separate deep dive.
If you have questions, feel free to reach out to me.
This talk covers the core ideas behind machine learning, powered
search and recommendation systems from why they matter and how
they are built, rank and scaled.
We'll look at key models, system design practices, and recent trends shaping
the future of recommender technologies.
Search and recommendation system are essential for helping user navigate
huge catalogs of content or products.
They reduce information overload and improve the user experience
by surfacing and relevant items.
They also drive key business metrics, more clicks, purchases, and return visits,
and help optimize for long-term value using tech techniques like reinforcement
learning and the impact is clear.
30% of YouTube views, 35 of Amazon purchases, and 80% of Netflix watches
all come from recommendations.
These systems are not just useful.
They are critical to modern digital platforms.
When evaluating search and recommendation systems, we often
look at two types of goals.
Short term and long term.
Short term target are based on immediate user actions.
Things like clicks, time, or purchases.
These are easy to track and give us quick feedback, but if you focus only on
them, we risk our fitting to short-term interests and ignoring real user value.
Long-term targets are about user satisfaction over time.
They are harder to measure, but they better reflect user value, true value,
such as retention, subscription, continuation, or content diversity.
These metrics encourage exploration and help build loyal users.
In short-term metrics are helpful, but long-term metrics are what
really drives sustainable success.
Let's take a look at big picture.
This is a high level architecture of a typical machine learning powered
search and recommendation system.
At the top, we have the server layer, which is the heart of real time response.
It includes entity generation, which Rie a manageable set of potentially relevant
items, future feature generation, which builds feature season, user query, and
item data and models and business rules where ranking models, score candidates,
and business logic handles diversity, exploration and cold start cases.
These are the blocks we'll focus on for the rest of the presentation.
Any degeneration and ranking.
Now, let's zoom out for a second and look at the full loop and at the input we
have user data, que items and real time events coming through a data procession.
That's how con, that's how context and behavior enter the system.
Equally important is the login infrastructure.
Everything users do, clicks, scrolls, dwell.
Time is locked and sent into offline pipelines.
Why is that important?
Because we rely on this data for training, evaluation, and experimentation.
If you log incorrectly, we will train on biased and broken labels,
which directly hear systems quality.
In the training pipeline, we apply labeling strategies, some
manual, some assisted by LLMs and train our models season full.
The training or active learning or online updates at the bottom, we measure success.
That includes both offline metrics, like how, and this and d, g, and online
metrics like CTR, time spent and revenue.
The top the loop.
Is closed by using these metrics to guide further model improvements.
So again, in this talk, we'll go deeper into the Kennedy
generation and ranking layers.
But keep in mind the rest of this architecture is equally critical to
building a successful system to make recommendations at scale, we use a
multi-stage retrieval and Rankin funnel.
The idea is simple.
We start with millions or even billions of items and progressively narrow
them down to just the top 50 or so that we actually show to the user.
Let's walk through each stage.
We begin with user profile and query understanding.
This gives us context.
What has the user done in the past?
What are they looking for right now?
Then we move into candidate generation.
This stage focuses on high recall.
We want to gather as many relevant items as possible.
The goal here is to cast a wide net and collect around 10 K items.
We also factor in things like diversity, freshness, and popularity.
Next is pre ranking.
At this point, we filter down from 10 K to roughly one K items using
lightweight models, often gradient boosted decision trees or fast neural networks.
We are still optimizing for a call and diversity here, but the
computation needs to be fast.
In L two and L three layers, we do the heavy lifting.
These stages use complex models like deep neural networks or transformers to
evaluate each of those one key candidates in depth and select the top 50 or so.
The key metrics here is NDCG, which reflects ranking, quality and relevance.
Finally.
Ranking stage where we applied business rules, for example, to increase
category diversity or boost new items to ensure legal and policy compliance.
What is powerful about this funnel is that it balance performance and efficiency.
We use simple models early for speed and heavy models later for
accuracy, so the system stays fast.
Stays fast, but delivers high quality results.
Once we understood the query end user profile, the first step in the
pipeline is candidate generation.
The goal here is to select a small, manageable set of that, typically
thousands of items from a catalog that may contain million or even
billions or sometime trillions.
Items.
These steps needs to be fast.
Provide higher goal and maintain diversity.
There are several approach to do so efficiently.
First, we have NN or approximate nearest neighbor search, which finds
the closest items in embedded space.
Popular libraries include phase from meta and scan from Google with fast algorithms
like HNSV, which is the most popular.
A N relies on good embeds, which we can get from collaborative filtering like
a LS Alterna three squares, which is fast, but struggles with call starts and
content based models such as two tower architectures based on DSM or bird.
Another method is random book, which discovers related items
by walking through a graph.
It's good for diversity and Goldstar problems.
Though a bit outdated in modern systems, in search-based systems,
we often use ANOR Index, which maps query terms to documents.
This scales extremely well to trillions of items thanks to techniques like term
charting and BM 25 based stren and q. Finally, don't underestimate heuristics.
Simple based filters like popularity, recency, or user
subscriptions are still widely used, especially when latency matters.
The takeaway is there is no one size fit all solution.
Most production system combine multiple methods and N Heuristics and angle index
to balance performance coverage and.
Once we generated a list of candidates items, the next step
is to rank them to decide what to show and what other in what order.
The goal of ranking is to sort the candidates by relevance and other
objectives like diversity or conversion.
Conversion potential.
There are two broad, two broad types of optimization goals, short term metrics
like D-C-G-M-R-R, or precision at key.
These are widely used in industry and are based on offline judgements
and historical user behavior.
And long-term metrics, which are still an active research area.
This includes approach like reinforcement planning and sequence
models that aim to optimize for lifetime value or sustain engagement.
Engagement.
Let's look at the main approach used gradient boosting decision.
Tree models such as Boost or Lead GBM are very popular in production.
They're fast and Deb breathable and work well on structured features.
They are often used as the final ran layout.
On the other hand, deep neural networks like DSM built and other transform
based models can model more complex interactions and are commonly used in
upstream stages like NN, or as feature generations for gradient boosting.
Of course ranking isn't easy.
It comes with several challenges and biases.
Biases such as label and position biases come from implicit feedback like clicks.
The coldstar problem when we lack history for new user are, or items
the classic trade between relevance and diversity and another trade off
exploration versus exploitation.
Do we show what's already working or do we help users discover new things?
And adaptability is also a concern.
Gradient boosting is easier to explain.
While deep neural networks often behave like black boxes, privacy
is becoming more important.
Many teams are exploring federated learning to train models without
centralizing sensitive data.
And finally, latency.
Deep models can heavy, can be heavy.
So we need techniques like distillation or caching to keep the system responsive.
In practice, teams often combine these methods using neural networks
for representation and gradient boosting for final scoring to
get the best of both worlds.
This slide shows a typical architecture used for ranking in
modern recommendation systems.
We start on the left with input features.
These include user features like demographic or behavior
signals, item features like.
Price, popularity or category interaction features.
For example, how often the user has interacted with similar items and
textile signals like time of the day or device type, and often a score
from a lightweight to tower neural network that helps preamp items.
These features are then passed to a set of gradient boosted decision three models.
One model estimates probability of relevance.
Another model can predict the probability that the user will
click if the item is shown.
And the third estimates the conversion rate.
If the user clicks, this course can be combined into a final
ranking score dependent on business goals like engagement or revenue
in more advanced appliance.
We may also use a multitask deploy deep neural network shown here at the bottom.
This DNN is trained to predict the same targets, relevant clicks and
conversions, but it can model more complex relationship and input.
Finally, once we have LEAs, we apply the ranking logic.
This includes promotion diversity towards showing too much of the same
content, introducing exploration to test new or less known items and
applying business rules, for example, to boost sponsor content or ensure
policy compliance in production.
This architecture is often hybrid.
The DNN might be used offline or upstream.
Gradient boosting remains the final model for online serving due to
its speed and interpretability.
Let's now compare two of the most common approach to ranking gradient
booster decision trees and deep neural networks, starting with
gradient boosting on the left.
These models are widely used in production, especially
for the final ranking stage.
They work very well on structured data and easy to interpret and
support fast training and AB testing.
They also grade for small and medium data sets and have low latency, making them
ideal for high throughput environments.
But Gradient Boost also has its limitations.
It struggles to model complex and direct and doesn't handle sequences of multimodel
they also don't scale as effectively as large data set compared to neural models.
Now moving to deep neural networks on the right side, these are seeing increased
adoption in both research and industrial.
DN Ns can capture both explicit and implicit feature interactions.
Handling billings for high cardinality features and enable end-to-end training
on sequences like user history, they also support multitasking and transfer
learning and scale better with more data.
However, deep networks also bring their own set of challenges.
They are cap, potentially expensive.
Harder to debug and less interpretable.
They can suffer from issues like bias, fairness, and even
hallucinations, and they tend to be sensitive to noise or small, ch
small chain, small changes in input.
Because of these trade-offs, many real world systems adopt
a hybrid approach using neural network upstream for candidate
generation or feature extraction.
Gradient boost for final ranking to balance performance and efficiency
when building large scale neural networks for recommendations.
Performance is important, but scalability and efficiency are critical.
This slide outlines some of the key design principles that help us get both.
First, we have late fusion and BN color architectures.
Instead of scoring user item pairs in real time with a single tower,
we split them into a user tower, which answered online and item tower,
which we pre-com compute offline.
This can lead to over 110 speed up while preserving more than 80%
of total profit in some systems.
Next.
Is contrastive learning.
This trains the model to distinguish between positive
and hard negative examples using the losses like in or scent.
The goal is to produce a billion that are compatible
with fast dot product retrieval.
In some cases, the teams report up to 100% profit uplift in retrieval performance.
A common challenge at scale is embedding size, so we apply
compression techniques like hashing, quantization, or distillation to
keep memory and latency in check.
Another critical aspect is bias correction, including popularity bias,
proposition bias and feedback loops.
One trick here is to add a context of wear tower during training,
but drop it at inference time.
We also apply hard negative mining to main.
To make training more meaningful instead of using random negatives
with sample negatives that are hard for the model to distinguish.
This makes the learning signal much stronger.
Then there is multis signal learning.
Our models often consume inputs from multiple modalities, text, images,
table or data, and from different domains like search queries,
watch history, or card activity.
Finally, sequential model is crucial.
We use transformer based on coders to capture the temporal aspects
of user behavior in recent events.
First, to predict what comes next.
These principles together make it possible to run deep learning, massive scale
in production, recommendation systems, results, sacrifice, and speed or quality.
Let's now talk about why scaling and LU Commander system is much harder
than scaling and LMS or vision models.
On the left side, we see that LMS and computer vision
models scale extremely well.
They take in long sequences like texts or pixels, benefit from dance labels
and strong supervision and have powerful pre tasks like next token prediction.
They use deep transformer architectures and are often latency tolerant.
We don't mind waiting a few second for a response In these domains, scaling
almost always improves quality and scaling laws are well established.
Now contrast that with recomme command system.
On the right we deal with massive embedding tables.
Billions of user and item IDs.
Our models are often animal to layer perceptrons, and they
must respond in milliseconds.
So we are under heavy latency constraints.
Our data is based on short, sparse behavior sequences,
just few clicks or skips.
Feedback is implicit and we don't have a universal self
supervised training task like lms.
Do.
Also no clear scaling law exists here.
As we grow models, we quickly run into bias, noise, and diminishing returns.
The key takeaways is in the box at the bottom, a Recomme Commander system.
Heat, unique scaling limits, including latency, implicit feedback, massive
embeddings, and the main specific biases.
These can just be solved by making models deeper or wider.
We need the main specific innovations.
Let's wrap up with some of the most important trends in neural
networks recommended systems today.
First, we are seeing a strong shift from traditional ingredient
boost models to deep learning, what we call neural ran companies.
Like YouTube meta Alibaba use models in production to power
personalized recommendations at scale.
Multi-stage pipelines with bean quarter for fast retrieval and deep models for
final scoring are now fairly standard while not new, they remain a reliable way
to balance quality and latency at scale.
We are also seeing LMS being used in novel ways, not just for generating text.
But to interpret embedding spaces and even answer queries based on vector alone.
In terms of architectural transformer innovations like high performer or high
performer from DeepMind are pushing the envelope for recommendation specific
models when it comes to scaling.
Several papers have shown that scaling laws applied recommendation in bearings
and sequence models just like they do in NLP Meta 2024, research on trillion
Parameters, transducers, and the VU One project aimed to define scaling law,
specific to recommendations, another growing trend in sequence modeling.
We are moving beyond simple.
Next item, prediction and modeling.
Richer multi-model, time aware user behavior.
The area of graph neural networks models are being used to better capture user item
relationship, especially in the long tail.
These use inductive and transductive approach depending on whether the
graph structure is fixed or dynamic.
Finally.
Reinforcement learning continues to gain traction.
It's used to optimize long-term goals like retention or LTV.
Reinforcement learning also helps with exploration, which is key to avoidant
feedback loops and filter bubbles.
Overall, these trends show that recommendation system are evolving rapidly
with a growing focus on scalability, long-term value, and deep user modeling.
That bring us to the end of the presentation.
Thank you so much for your attention.
I've covered the fundamentals of machine learning, powered search and
recommendation systems, explore key architectural components, looked at
challenges in scaling and discussed emerging trends in this space.
I hope this gave you a useful foundation and maybe even sparked
ideas you'd like to explore further.
If you have any questions or want to continue the conversation,
I'd be happy to chat.
Thanks again.