Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone.
I am PR Singh.
Today I'm going to take your insight how a spam detection is
handled at truly massive scale.
We are talking about billions of users, millions of decisions every second, and
everything has to happen in real time.
Now when we talk about a spam, it's not just annoying messages.
We are talking about fraud, phishing, scams, harmful content, et cetera,
things that directly affects users trust and platform integrity.
And here is the hard part at this scale, even a tiny error, say a 10th of a percent
means millions of peoples are impacted.
And that's why we need ML ops that is not just clear, but resilient,
privacy, preserving, and battle tested.
So let me walk you through how we are going to explore this challenge today.
So here is the roadmap.
We'll start by looking at unique challenges of working
at billion users scale.
Then I will show you end to end ML ops architecture, how large
scale systems design pipelines to keep up with a constant change.
We'll also dive into data strategies because what data you choose.
Either to label and or to train is absolutely critical.
Then we'll look at the production site like scaling
monitoring and incident response.
Finally, we'll end by looking ahead, what spam looks like in
the future, and the technologies we will need to stay ahead of it.
So think of this talk as a blueprint, not just theory, but patterns you
can adapt to your own systems.
So let's start at the beginning.
The sheer challenge of operating at a scale every day.
Billions of pieces of content move across platform.
At peak load, large scale systems are making millions of evaluations per second.
And for user facing decisions like blocking a comment or
flying a post response must be under a hundred millisecond.
Now, think about the consequences of error at this scale.
A 0.1% false positive rate does not sound terrible in theory.
But in practice that could mean millions of legitimate users
waking up to find their posts or blocked, or comments are deleted.
And on the other side, false negative means spam or scams getting through to
the millions of people in harming them.
The constraints are equally tough.
Large scale systems are detecting across text, images, videos,
even behavioral patterns.
They have to support hundred of languages as well, like they have to
constantly adapt because attackers change tactics every day and all
of things must comply with strict privacy laws like GDPR and CCPA.
It's like playing chess with millions of opponents, and they all
change the strategies overnight.
So meeting challenges like this requires more than just good models.
It requires end to end ML ops architecture, and that is we
are going to talk about next.
Here's what that looks like.
First ingest data from multiple sources, which includes real
time streams and batch histories.
Then comes feature engineering where they extract both instant signals
like device fingerprinting and long-term signals like account history.
Models are retrained continuously, often daily, but training is not enough.
You have to validate rigorously against colon data sets.
Fairness checks and adversarial examples.
And when a deployment happens they don't just flip a switch.
We have to use a shadow deployments where new models can run in parallel
silently for days or weeks or even months.
And then can roll outs are there, which are gradually sifting traffic
with instant rollback if needed.
This is not just a pipeline, it's a living system.
It adapts, learns and recovers without stopping.
But what makes this models powerful is not this, but the features we
built from the multi-model data.
So let's discover about this multi-model feature engineering.
Spam today does not come in just one form.
Attackers use text, images, videos, and even behavioral manipulation for text.
Large scale systems use transformer embeddings like Bert and Grams,
entity recognition, sentiment intent classification, and et cetera.
For image and video you can rely on CNNs like efficient net and rest net.
OCR are very common for memes and frame signal analysis, but
content is only half the story.
Behavior is often the giveaway.
Things like account as posting frequency engagement metrics, or
unusual device fingerprints can just be as revealing as other.
Finally, context matters the most.
For example, a post at 2:00 AM from new device.
Linked to trending spam topics tells a different story than the same
post from a long standing account.
When you put all these signals together, you catch patterns
no single dimension can reveal.
Now let's look at how we bring those signal together in the model itself.
Something we call a model architecture ensemble approach.
So where large scale systems.
And don't rely on one big model.
They, instead they use a hierarchical ensemble first.
Do a fast pre-filters, where light model lightweight models that
can process use volume quickly.
They're designed for high recall, catching anything even slightly suspicious.
Now we bring into a specialist model here, like deeper, which has
deeper domain specific networks for text, images, videos, behavior.
And then comes the multimodal fusion where they're combining
the signals using cross attention.
This is where we find hidden correlations, like when the text looks fine, but
paired with image features, it, it becomes clearly a spam or a harm.
Finally, a meta learner integrates everything weighing outputs
based on historical reliability.
The result is speed when we need it, and depth when it matters.
It's the difference between quickly scanning luggage at an airport
versus pulling aside a suspicious bag for deeper inspection.
And to serve this model in production, we need a specialized infrastructure.
So let's talk about distributed serving architecture.
Serving at a scale is a systems challenge as much as an ML one.
Large scale systems use context of air load balancing to send
requests to the right resources.
Horizontal autoscaling helps us handle certain traffic spikes.
A low latency feature store can ensure models, gets what they need without
recomputing, and a specialized hardwares like GPUs and TPUs accelerates inferences.
Performance optimization is constant.
Model quantization makes networks smaller and faster.
Batching lets lets us process multiple requests together.
Prioritization ensures critical cases are not stuck in the queue, and the
circuit breaker prevents cost getting failures if something goes wrong.
All of this will allow to consistently hit 80 millisecond end to end
latency even under extreme load.
80 millisecond.
This just a figure.
It could be anything.
Whatever matters to organization, but even the best infrastructure needs a smart data
strategies to keep the model learning.
So let's talk about a smart sampling and learning strategies.
Leveling billions of sample is not possible.
The simply not possible.
The question is which samples matter most.
So large scale systems use uncertainty sampling.
Extending low confidence cases to human review.
Diverse diversity sampling makes sure we don't miss underrepresented
languages or formats.
Adversarial sampling, durability, six out the cases where models fail
on making the system even stronger.
And time sensitive sampling ensures we catch new attack
vectors as they emerge together.
This strategies a drastically reduced labeling cost while
keeping the models robust.
One example I can think of is like when attackers started hiding spam in memes.
Adversarial sampling surface these cases early, letting retrain
before it is spread widely.
But not all spam come from individual.
Often it's coordinated campaigns.
So let's discuss a little bit about coronated attack detection.
This is where things gets interesting.
A single post or may look normal, but when thousands of accounts act in
unison, that's a coordinated attack.
Large scale systems use real time streaming analysis to spot
anomalies in activity patterns.
They apply graph neural networks to model user content interaction as dynamic
graphs exposing suspicious clusters, and they deploy countermeasures in real
time, like throttling activity, adding friction with captures or flagging
suspicious clusters for further review.
One case we can discuss here is like 10,000 accounts lacking
a post in under one minute.
Individually, nothing stood out, but when viewed as a graph.
The coordination will be obvious.
Now while large scale systems might fight these attacks, they
also have to protect user privacy.
Everything I've described so far has to be done under strict privacy laws like
GTPR and CCPA large scale systems use very learning to train across decentralized
data without centralizing raw content.
Differential privacy ensures individuals can't be re-identified.
The systems also employ advanced cryptographic techniques like homomorphic
encryption for secure inferences, multi-party computation for trust
distribution, and geo knowledge proofs for compliance without exposing our data.
The key point being here, privacy and performance are not opposites.
With the right techniques, you can have both.
Now let's look at how to keep these models fresh and reliable in production.
The training cycle is continuous.
The large scale systems, reference model retrains models daily on fresh data using
b optimization for hyper parameters.
Then evaluate against golden data sets, fairness slices, and adversarial examples.
New models run in shadow deployment first silently for scoring real
traffic without affecting users.
Once it's stable, they move into progressive rollout, which means
can retest gradual traffic shifts.
Automated metric analysis and instant rollback if something slips.
This rhythm ensures the morals.
Keep space with evolving spam tactics without ever destabilizing the system.
But the deployment is only half.
The story.
Monitoring keeps everything healthy.
Usually large scale systems monitor at four labels.
Model performance, which basically is precision recall and drift detection
and system performance is another, which has request latency, q, te,
and resource utilization data.
Drift is third one, which has includes metrics like feature distributions is
schema, validations, statistical tests.
And business impact, of course, like user eng, user engagement metrics, false
positive appeals, and of platform health.
The idea is to catch problems before users do.
When it's a model drifting, like whether it's a model drifting
or a certain traffic spike.
We want to detect early and respond fast, and when issues do happen, rely
on a structured incident response.
So let's talk about anomaly detection, incident response here.
So we know incidents are inevitable.
Resilience comes from how you respond.
The large scale systems use automated anomaly detection, so involves
outlay detection, multivariate analysis, and baseline comparisons.
If something is off the systems follow structured playbooks.
Severity based escalation.
Automated rollbacks shadow mode for investigations and
configurable safety thresholds.
And after every major incident, run post-mortems.
Not just to fix, but to learn and strengthen the system.
This structure discipline turns outages into opportunities for resilience.
Now let's look forward what's next in this arm's rest.
So the arm race is not slowing down First.
Synthetic content.
Let's talk about that.
Attackers are already experimenting with AI generated spam, deep fakes,
synthetic text that slips past filters.
Second cross platform coordination is another thing where spam
campaigns don't stay confined.
They spread across apps and platforms.
Collaboration, while preserving privacy will be the key here.
Third, let's talk about adversarial robustness.
We must strengthen defenses against intentional manipulation of models.
Finally, explainable AI enforcement actions must be transparent
both for users and regulators.
We need models that not only predict, but also explain why it took certain action.
The future will demand even faster Adaptation.
This is not a one-time problem, but a continuous race.
So let's wrap this up with the key takeaways.
Here are four key principles.
I want you to remember, spam detection must be multi.
Meaning you have to take care of text, images, videos, and behavior together.
ML ops at billion users scale requires a specialized architecture
is something we have to be aware of.
Privacy and performance can coexist with the right techniques.
That's very important.
Continuous adaptation is non-negotiable.
We, you must know this is an evolving arms race.
If you keep these principles in mind, you'll be better prepared to design
systems that don't just work in research, but that scales in the real world.
And with that, let me thank you.
So thank you for listening.
I hope this gave you a clear look into what it takes to build robust, scalable
privacy, presuming ML ops system for span detection at global scale.
This fight against spam never ends, but with the right
architecture, we can stay ahead.
Thank you.