Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Welcome to the Con 42 Machine Learning Conference.
Hope you're having a good day.
My name is S Kati and I've actually been working as a senior software
engineer at Meta for about a year now.
Before that I used to work at Google for about four years and prior to that,
Walmart Labs and prior to that in a startup for about another five years.
Basically, today we are actually going to talk about something that I've actually
been working on for quite some time.
And I think it'll actually provide you with great amounts of insights
on how to make your decision when it comes to building analytics pipelines.
On that note, we are basically going to do a deep dive into the fascinating
world of real time machine learning.
So we'll explore how to build analytics pipelines that delivers insights in less
than a second with the enormous amount of volumes volumes of data generated today
which is around 2.5 quintilian bytes each day, which is not an normal amount.
It's quite enormous, right?
Traditional methods simply aren't keeping up.
We'll unpack why speed matters, look at the challenges involved, and
discuss how industries are gaining major advantages from realtime ml.
Now let's actually talk about the data explosion challenge.
First, let's understand the sheer scale of the data problem we face every day.
Like literally as we said, we produce about 2.5 quintillion bytes of data,
which is even hard to visualize.
So to put it into perspective, right?
If we were to stack blue, red disks containing this data, they'd
reach all the way to the moon.
Again, like this is it's even impossible to visualize.
Around 75% of the companies now rely on machine learning applications
that need immediate responses.
So for these businesses, waiting isn't an option.
I. Real-time analytics provides insights 35 past percent faster and
boosts efficiency by almost 42%.
Imagine how crucial the speed is for decision making during live
events, medical diagnostic, and final financial transactions.
Of course, like you will see many businesses, which still
do rely on batch processing.
We'll of course come to that what batch processing is versus what
streaming pre stream processing is.
But the thing that I want you to keep in perspective is.
Every business has its own needs for some businesses.
It's very important for you to make sure that you actually process real time data.
For example, financial transac financial transactions, right?
Real time insights are the ones which matter.
Like for example, let's say.
Basically if if a fraud is being detected, which is let's say a fraud
occurred like, let's say two hours ago, but you're detecting right now it
wouldn't make much of a sense again.
We'll put things into perspective as and when we go through the slides,
but I just wanted to visualize how businesses want to process data.
And of course, like it differs from business to business now coming to
batch versus streaming processing.
So I would want to you, I want, I would want you to take a moment to
realize what batch processing is versus what stream processing is.
Batch processing usually is typically takes hours or days.
Which is basically for monthly or weekly reports.
For example, let's say you are running an ad tech business and you would want
your business people to understand more insights into how your ads are performing.
I. Over a day for the past week, for the past month, so on and so forth,
you would rather want to stick with your batch processing pipelines because
which run like, let's say every hour, every day, so on and so forth, which
would definitely fit your needs.
That's totally fine.
You would not want to go for very high-end like engineering heavy
stream processing pipelines.
But on the other note, think about what I have been discussing just now.
Think about fraud detection.
What if your bank took hours to identify suspicious activities?
It's not at all good.
So now coming to that, there is an intermediate approach
called micro batching.
Which basically cuts the latency to milliseconds, of course,
like basically, which is which in my opinion is really good.
For example, micros Micro batching has actually been serving like near
realtime needs for quite some time now.
But realtime streaming actually goes a step further, like as in.
How do you cut down that latency to under 10 milliseconds?
Like for example I agree there is, there are instances where you would
want to go for micro batching, but you would definitely want to understand
where your business fits, right?
So let's say you are actually in your amazon amazon.com and you are
buying like a couple of things, like for that matter, let's say
diapers for your kids or whatever.
Basically.
It takes a couple of minutes for you to actually go from the homepage to the page
where the product is, to the add to cart page, and then again the checkout page.
So now let's say that you are an engineer at Amazon and you want to identify all
the people who are buying diapers, let's say in the, who are buying diapers.
So you got an event into your into Amazon saying that, Hey,
somebody bought diapers, right?
You would want to identify what what items were they looking at prior to that?
So as in what is a really, what is the path of the user
starting from the homepage, which is really important, right?
Otherwise, where would you want to put ads?
How would you want to categorize things which are bought together usually?
So for such sort of things, you would want to analyze the path of the user.
To do that, you really want to.
Use micro batching because then you would want to, let's say check, okay,
this time the buying occurred, let's actually go back a couple of windows,
which is a couple of milliseconds, a couple of seconds or a couple of minutes
back to identify what really went on.
Such sort of things is where microb batching is really like really used
and it actually is really good too.
But there are obviously some businesses which definitely want to take it a
step further to real time streaming.
As I said, which cuts down the latency to under 10 milliseconds.
Now, this near instant processing is essential in scenarios like
let's say livestock trading, realtime gaming, dynamic online
content, so on and so forth.
As you can see, many of the FinTech startups have come up and
many the FinTech scenarios have.
Like literally exploded after the ai boom.
You can see livestock trading, right?
That's an excellent example where real time streaming is much more important.
You would definitely want to get real time data, as in how the stock is performing
or as in how the options are being sold but so on and so forth to make an
informed decision about what to do next.
So that's where real time streaming is what is required.
So I hope you might, you could have actually at least, form the mental
picture of where batch micro batching and stream processing sit in the
entire analytics pipeline scenario.
But of course, like in this particular thing, we would definitely want to as part
of this talk, I would definitely want to dig deep into streaming analytics per se.
But again, of course, like batch I think is easily understood.
Micro batching is a little harder to understand.
But yeah, feel free to stick with me and I promise you'll get better insight.
So now starting with, let's say financial services, right?
Oh, sorry.
I think I might have.
Yeah, my bad.
So now starting with financial services, real time fraud
detection is much more important.
Of course, as and when as and when the internet exploded and frankly, as, and
when the AI is currently taking over the world, you can see many and many more
new type of frauds and scams popping up.
So it is much more important to do real time fraud detection.
So as to make sure that they are, the banking is secure, right?
So banks started adopting the real-time fraud detection, which have reduced
their losses by almost about 27% for, let's say, considering a mid-sized
bank that translates to around 15 million savings in each year,
which is not a normal amount again.
So traditional fraud detection mechanisms analyze data.
After the event has occurred, like as we were just discussing, which
is often too late because of course the event has already occurred, the
transaction has already occurred, and as such, it is now near impossible
to reverse that particular event.
But near, but the real time systems that we are talking about actually
flag the unusual activities as in when they're occurring.
So you can actually stop the fraud activity before the event
completes, which is what we, which is what is of utmost importance.
For example, let's say detecting multiple purchases in distant locations, right?
I've had the situation where my credit card, got hacked again, like I, I assume
many people might have actually gone through the same scenario and like, why?
While I myself live in California, I. I've actually had some business transaction
go off in, let's say, Tennessee, Florida which again, like from from from tracking
my, let's say my recent transactions, let's say, which occurred literally
one hour ago, happened in California, but the next transaction, which is
currently occurring now is happening in Tennessee, which is quite impossible.
So basically flagging that particular transaction as fraud actually
would save me a lot of money.
Again, like these sort of scenarios are what are important.
So now how do we like detecting multiple purchases in, let's say,
distant location simultaneously?
Some, something traditional batch methods would completely miss, right?
Let's say for example, if you are taking a batch processing approach to this
particular problem, what would really happen is okay in this particular hour
a transaction happened in California.
In the next particular hour a transaction happened in Tennessee.
But of course, like you wouldn't, you would not necessarily get the
entire picture, let's say, okay, for on an hourly basis if we check it.
Of course, even batch process would pick it up, but the transaction
might have already occurred.
Like the person who would've actually hack my credit card might have
already made like thousands of dollars of purchases, and I am the person
who is responsible to pay for it.
Which doesn't really make sense.
So basically such realtime fraud detection mechanisms are much more
important and they are becoming more important day by day.
So now coming to let's say another particular use case, right?
Which is the e-commerce, for example, as I was talking about like with
the advent of at least I myself use Amazon and Walmart quite frequently.
So what I have actually observed is the personalization.
For the customers in these e-commerce businesses have actually gotten way
better than what they used to be.
For example, they have as I was saying, e-commerce businesses
have actually seen major benefits.
Some real time personalization can boost sales up to like by 18%.
So for example, picture yourself doing this right.
As you are browsing, let's say our real time analytics immediately updates product
recommendations based on your let's say current session, inventory status,
pricing strategies, so on and so forth.
So this dynamic approach will definitely increase the average order values by 12%.
For example, let's say as I was alluding to my previous example, like I have
a kid and I of course, the thing that usually pops up into my mind is buy
diapers, screams on, and so forth.
So if you were to take the example of buying diapers themselves, right?
So of course like running on that less sleep more for quite
some time now with a kid.
Basically you sometimes forget, or sometimes you might not even realize what.
Is really needed for you to purchase something.
So for example, let's say as I was browsing diapers, if let the real
time personalization kicks in, like the analytics pipelines kicks in and
they say that, Hey, I have seen that you have been buying diapers now and
it's been some time since you bought, let's say diaper cream or let's say
moisturizing lotion for your kit.
Why don't you actually add that to the cart?
Or of course, like this is, I think not.
Prompting per se, but basically people have also bought moisturizing
cream, so on and so forth.
That would actually trigger something in my brain saying
that Hey, I forgot to buy this.
Let me actually add this.
So this sort of personalization is something that I would love
in any new product, right?
Because it is important for the tech to make our lives better.
So now along with that, let's say realtime systems prevent customers
from facing out of stock items.
Like I've always had this happen, right?
Like where, let's say I have seen that you add a couple of items in
Target, a target shopping cart, or Walmart shopping cart for that matter.
By the time you are just about to check out, or let's say you are about to get
it delivered, it just says out of stock.
Such sort of events are actually problematic for the user experience
and handling such events in real time is much more important.
I. Again, you see these are much more important when they, when there are
like peak shopping like ho peak shopping events like holidays or major sales.
Like for example, when I was working at Walmart Thanksgiving
is a major season, right?
So you would want to buy gifts for your loved ones.
Want to make sure your family's happy.
I've actually seen this happen many times where there is a
particular sale which pops up.
Let's say you want to fi, you want to buy an iPad for your family members.
Okay?
There is a suddenly, there's suddenly a sale which popped
up in, let's say Walmart.
That 20% off of iPad.
It's not a normal, it's not a normal sale, right?
So you.
Basically many people along with you are actually flocking to walmart.com to buy
that particular thing, and as such, it would definitely make much more sense
to update them real time on what is actually going on with that sale event.
As in, are you a little too late?
Are you are, let's say they're like, there are, how many of the iPads
are actually still left, so that way you can provide customers with
way better shopping experience?
Now let's actually discuss a completely different side of the coin, which
is manufacturing for that matter.
Factories right have actually dramatically improved operations by shifting to
realtime predictive maintenance.
For example, machines are now equipped with advanced sensors which can
stream continuously stream data.
Like what is the temperature, what is a vibration?
What is the operating speed of the machine?
So now real time machine learning models.
Analyze this data instantly, and they can actually predict equipment
failures before they happen for ex like I, you may ask the question, right?
Like, why is this important?
So manufacturing is not something which you, which you regularly
see on a day-to-day basis.
So why is this really important?
So regular maintenance of, regular maintenance of machines is important
so as to streamline the operations and not expect any latency.
For example, let's say again an out of the mind example, I'm just thinking out
loud, where let's say you have ordered some toys and basically for delivery
and let's say that has gone back to the manufacturing plant for manufacturing.
Again, this is a completely out of the world scenario that I'm talking about.
But let's say that, okay, the tie manufacturing has started, but somewhere
down the line the machines failed.
And that the tie, which is supposed to be delivered for your kid's birthday, has
now been, the delivery date has now been pushed back let's say four to five weeks.
That is not acceptable, right?
There is a particular reason why you pick that tie, and there is a particular
timeline that you have in your mind.
But now that because of a manufacturing delay, all of this happened.
So equipping machines with such data where it can actually stream the data.
Like saying that Hey, my current, the temperature of the machine is so and
let's say that like the pressure which is going through and that number of ties
that has been processed are built until now or so and so such sort of numbers
would help us quickly understand, hey.
We have seen that let's say previous day we have seen that this machine
is processing so many ties or so building so many toys per minute.
But now that has gone down drastically.
What is really going on, right?
So you are able to identify the issue with the manufacturing or the machine
way before something really bad happens.
And as such, you are able to fix it so that your operations
are way more streamlined.
As and when more and more.
Machines are more and more automation kicks in.
This is much more important.
So now let's actually talk about the technical challenges,
which are event time processing.
So for example, there are again, implementing this real
time machine learning is not without its own challenges.
For example, a significant hurdle is event time processing, right?
Accurately handling data based on when events actually happen.
But not when they are recorded, like as in there real.
There is it's hard to visualize.
So let's take a moment to understand what is going on.
So there is a slight delay on when the event actually occurs and
when it is recorded, as in, let's say you clicked on add to card
button inlet on let on amazon.com.
So that event is being sent to the backend server for processing.
So the, let's say you click the add to cart right now, but because of a
server delay or something, the backend processing system actually received
that event a couple of minutes late.
So that does not necessarily mean add to cart never happened.
It means that the add to cart event is actually recorded late.
So that is the distinction, like as in when the event occurred versus
when it was actually processed.
So around 15% of streaming data arrives out of order because there are.
There are a lot of issues, right?
There are, these are all computers where there events are being recorded.
So let's say there is a network problem, there is a machine, which
machine which went down, so on and so forth, which actually makes the
streaming data this is a very common occurrence where they are out of order.
Which complicates accurate analytics.
Again, this is this is something that this is a, which has been a bigger challenge.
So now to overcome this, right?
Basically methods like sliding windows or watermarking has been
has been used to ensure data accuracy, like as in let's take a
moment to understand what these are.
Sliding window is basically what you're saying is, again, going back to the
e-commerce example, you saw that.
An add to cart event is is, has come to you, okay?
Like you are a software engineer, and as part of the data that
an add to cart event occur.
You see that, but weirdly enough, let's say you, you don't see an event
where the user landed on the homepage.
That doesn't make sense.
As in how would you go to add to cart before going to the homepage, right?
Like you go to the homepage, you browse the product, you then add to cart.
So there are like three events, but somehow because of some server
delay or whatever, add to cart is the one which you got first.
So what you say is hey, I'll actually wait for two minutes or one minute.
Where for the home homepage event and the product page event, so let's
say that is called, that is where you ensure how you ensure data accuracy.
Okay, now that, yeah, you have waited two minutes, let's say you have gotten
homepage and the product page, you bundle this together saying that, okay, like now
I have gotten the full data, and as such I have, I can order those events properly.
Homepage occurred, first product page occurred next, then the
add to cart page happened.
So you can see the user path or the user behavior.
As it is needed.
And again, if you were to take another example of stock market trading, precise
timing of transactions is critical again, which is really important, right?
Like you cannot just say let's say I want to buy a stock right now,
but suddenly for whatever reason, I.
Like the event got delayed and the stock price rose, you cannot just tell
the customer that unfortunately, we could not buy the stock right now.
And as such that incorrect ordering can cost significant financial impact.
Such sort of events are much more kept in mind.
And as such, of course, proper methods have been devised, like
siding, windows, and watermarking.
Again, just to reiterate now, going over, let's say we want to understand
the model drift in continuous systems, which is basically, which is, this
is another one major challenge.
Like for example over time data patterns naturally change.
So this causes the models to become less accurate because they have been
trained on a particular set of data on our particular data patterns, but
naturally the data pattern changed.
You usually have to update the models very frequently, as in the
training has to occur very frequently.
So as to make sure that the models actually keep up
with the data pattern trend.
So if you were to take real Ty retail businesses right?
Customer shopping behaviors change seasonally.
For example if you were to take any clothing business of sort summer
clothes and winter clothes are something which are very common, and
as such, the buying patterns of summer clothes increases over summertime
decreases as the winter approaches.
So is the case with winter clothes where it increases over the wintertime
and decreases as the spring approaches.
So now considering such seasonal behaviors, right?
Basically, which influence the products, what products they usually buy.
Real time systems should address this by detecting changes in data distribution
and automatically retraining the model.
So that such retraining of models is much more important to make sure
that you keep up with the data trend.
Again, as we were just discussing, think of it as keeping the
system fresh and accurate.
Automatically adjusting to shifts without manual intervention, right?
Like of course there will always be some data changes, which humans
might not have actually let's say, been thinking about, for example I've
actually seen a couple of things where I.
There are a couple of scenarios where, there are a couple of scenarios where
the flowers buying actually increased by quite a lot and but as in when some
of the, some other events occur, it did quite decrease and so on and so forth.
Such sort of things are very.
Almost impossible to predict, right?
As a human, you cannot deal with every single data pattern, but
basically you would want your models to automatically take care of it.
Let's say.
Now taking, talking about the architectural comparison, so choosing the
right streaming framework is critical.
For example, until now, we have established why stream processing
is important, how we have overcome the challenges, so on and so forth.
Now let's actually discuss about what technologies are actually present.
And what frameworks essentially that's the right word, but what streaming
frameworks are actually present and what are the ever so slight differences
between them and what, how do you compare them against each other?
I.
So for example Apache Kafka, right?
It can actually process about 2 million events per second with
around 10 milliseconds latency.
Apache Flink is actually even faster reaching sub millisecond latency ideal
for financial markets or real time giving cloud services like AWS Kinesys and Azure
event hubs provide simpler integration.
But to have slightly higher latency.
So now business often adapt hybrid approaches basically to balance
performance, complexity, and convenience.
For instance, using Kafka for event ingestion and flink for processing
might be ideal for complex environments.
Again, you might ask the question, why not actually go for the best one, right?
So there are a lot of reasons.
So when it comes to big companies or when it comes to your use case it depends on
what sort of thing you're looking for.
It's just not always ideal to go for the best one.
Again it's not, I'm not trying to say.
Increasing the latency is what you want to go for, but basically there are ever
so many things that you can, you have to think about, keep in mind as in how many
teams are actually looking for this data?
What is the latency that is tolerated?
What is the latency that you really want to shoot for?
So on and so forth.
Again, like some things are easy to integrate, some things are very hard
to go for, go with, go to integrate, and sometimes the learning curve is
way too much, so on and so forth.
There are a lot of decisions.
Our thought processes, which go into selecting a
particular streaming framework.
But at least in the span of my career, I've actually always
worked with the hybrid systems.
Again, like as I was just saying, Kafka for event ingestion is something which
is very widely used because it actually.
Elev for seamless event ingestion and flink for processing is
something very widely used as well.
Again, there is park streaming, there is park structured streaming, so on
and so forth, which provides similar functionalities, but it really depends on.
What functionalities you are you looking for?
For for instance some frameworks provide better watermarking, some
frameworks provide better latency.
So if, let's say you require better watermarking, you go with the necessary
framework I'm just trying to throw all the details at you so that you
can make an informed decision and you can actually learn what, how to make a
proper, informed decision and choosing a particular streaming framework.
Now implementation.
Let's talk about the implementation, right?
So basically how does the implementation work when you are trying to
implement a real-time architecture?
So first the data enters through high speed brokers like Kafka, for example.
Kafka is used for so that, let's say for server fails, and you somehow.
Don't process the event downstream.
You will you have enough time for you to process and which hand Kafka like
handling millions of events per second.
Next processor like Flink manages complex computations and realtime analysis.
Then a real-time feature store quickly delivers essential data to
models, because that is important to create features, so on and so forth.
And finally, optimized models.
Serving infrastructure produces predictions and constantly
monitor system accuracy.
So this setup includes automatic retraining.
Where accuracy drops, ensuring consistent performance and reliability.
Now let's actually talk about the key takeaways and next things, next steps.
So to wrap things up again, implementing real-time machine learning
has clear substantial benefits.
I. Dramatically reducing fraud for an for instance, significantly
improving sales in e-commerce, minimizing downtime in manufacturing.
The keys to now keys to key to success in is include is to choose
your architecture very carefully.
Continuously monitor performance and automating model retraining.
Now I would say begin by targeting your most impactful
latency sensitive applications.
From there incrementally build your capabilities and be on lookout for
changes in your data environment.
Again, thank you so much for your attention.
I hope you learned a great deal about streaming data structure,
streaming frameworks, so on and so forth, and how, where they are used,
and I'm really happy that you are.
Again, thanks for attending the conference.
Hope you have a really good time.