Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Welcome to Site Reliability Engineering Conference of 2025.
My name is Ani.
I am from Rapid Cloud Infrastructure and I'm very excited to talk about
zero downtime ML observability Today.
Today we will explore how to implement observability into
machine learning systems without causing any disruption or downtime.
To the services that rely on them.
Machine learning is no longer on experimental playground.
It has become a reality and it has been used pretty much in all production
systems, but it brings with itself unique set of operational challenges.
Today we'll explore some of those SRE challenges that are
being faced and how we can make.
Okay, the ML systems more observable, scalable, and reliable.
So let's dive right in.
As you can see, there are two different worlds today.
One is of ML engineers and then of sre.
These two work on very different philosophies.
ML engineers are experts at building models, whereas
SREs are experts at running.
So.
Systems.
ML engineers tend to be focused more on experiments and they
work in an ator process.
SREs are more focused on the service level agreements and the service
level objectives of their services.
ML engineers tend to focus more on the accuracy of the models and
the predictions by the models, whereas SREs are focused more on.
The uptime and the latency of their distributed system services.
ML engineers work more on an offline setting where they evaluate their
models, train their models, work in the inferencing of it, whereas the SRE are
working on real time production matrix.
As you can see, there is a gap between both of these worlds.
So now when SREs are tasked.
With running machine learning services in production, they would likely face a
lot of challenges because they have not designed these models and they don't know
how the these models are supposed to run.
And what are the predictions along with it?
Right?
So let's try and understand what are, what are the traditional
observatory matrix that looks like?
Today's traditional tools for observability that are equipped
for distributed systems are more along logs, matrix, and traces.
That tells us if the distributed service is running up or if it is
running into error, if it's hundred percent available, and so on.
But it doesn't necessarily tell.
If the model that is being run by the service is working correct and
in an accurate fashion, and if it is predicting the correct output for which
it was designed, so the errors tend to be pretty silent when they occur
in this ML systems in production.
The predictions returned by these models could be very much or super biased.
But it is more than likely that an SRE engineer won't get to notice this.
There could be drift that is arising in the data and it is degrading
the performance of the model.
But when an SRE looks at all the matrix, but it, it shows that all
the services are running up and fine.
They're meeting their SLAs and the sles.
So as we see, there are no standard SLOs that are defined for ML quality
today that an SRE can track their accuracy and the fairness and the
correction ratio of this models.
So what are the other challenges that ML engineers are face facing and
the SREs are facing, who are running these systems as we spoke about it?
There are these silent failures, which often grow.
Undetected system is producing incorrect output, but SRE has no way to notice this.
There are data drips happening, and we'll talk about the different kind of
data drips in the next section, but the data distribution has changed over time.
The data drifts are happening.
There could be prediction drift, but there are not enough tooling around it today.
That an SRE could use, it is slowly changing.
The model could very well get stale if it is not retrained.
So the model is losing its effectiveness, and an SRE should be equipped to deal
with it and to detect it early on.
There could be very much a lot of debugging issues that could arise, and
the features that are being tracked by the models often lack visibility.
There could be more versioning problem.
The code version may not match with the model version and the
latency issues could also arise.
The inferencing latency could be more, the loading of the model could
be more, which will be resulting.
In some delay at the customer end for a good ML monitoring service work, all of
these challenges needs to be addressed.
So an ML observatory model should be able to monitor, debug, and be
able to understand these complex and complicated systems that are in product.
The data that is ingested by these models need to be very accurate,
should mimic real time traffic and not any synthetic traffic.
So the data quality needs to be high and should be consistent over time.
There should be way to track the distribution of the
features over time, and so that.
SREs are able to see if there is any change over time in this distribution
and be able to catch it early on.
The model performance should be detected using Matrix over time
so that we can detect if model is performing accurately and correct over
time and hasn't changed its trend.
That should be.
For SREs to identify changes in the input and output data distributions
that is known as drift detection,
a way to track different versions of the model across retraining, and then
last, the model should be compared with the live traffic and not with the
three traffic on which it was trained.
Think of observability here, not just as monitoring the pipeline, but as
observing the health and reliability of the data logic and the outcomes.
It's like adding an x-ray vision to the black box.
Let's look at the simple ML monitoring system in production.
An ideal model monitoring service should be sitting right
next to the prediction service.
With it should not just ingest the data input that is fed to the pipeline,
but it should also be ingesting the predictions that are emitted by the model.
With this, it'll be able to make some complex calculations and feed it to
the monitoring and visualization tools.
Now, these tools could help SRE.
To look at any alarms or matrix that could be generated if the
model is not behaving at par.
And whenever possible, there should be B matrix calculation that looks
at the historical data and compares it with the ground truth, and see if
there are any changes along the way.
From an SRE perspective, the system looks very much like a service.
This, but this service correctness depends not just on the binary,
but on dynamic external input data and its own historic behavior.
Let's look at some simple matrix, which are out there for a shopping cart.
As you can see about, there is an average latency for every region, and there is
something known as R, which is basically.
Root means squared error.
A ma a metric, often tracked in machine learning world to see how far the
prediction is from the ground root.
It should be as low as possible.
Then there are real user metric data.
For example, there are page views that are being track and the number
of checkouts, and towards the end we can see an end-to-end conversion rate.
Which tells us how many users visited the, uh, the shopping page for browsing
sofas, how many of them added those sofas into the cart, and how many of
these sofas were actually checked out.
This is some of the good matrix that SREs could track, especially the RMSE.
Now let's go and take a look at the different types of drift
that we spoke about earlier.
There is prediction drift, data drift, feature drift,
and feature attribution drift.
Today there are tools available that would be able to track these drips
and an SRE can detect this early on.
As SRE supporting ML workload, our job doesn't stop once a model is deployed.
In fact, that's where it really starts.
One of the key challenges is understanding if and when the model starts behaving
differently, and this is where the drift modeling becomes very crucial
and critical, even when ground truth is delayed or unavailable monitoring drip.
It helps us maintain the model quality, reduce the customer impact, and triggering
retraining workflows at the right time.
Let's talk about prediction drought.
This is when the distribution of model prediction starts changing over time.
Let's take an example.
Let's say we have a recommendation model that suggests product.
Normally it suggests a mix of budget and premium items, but over time we
have noticed the model is heavily skewed towards premium products.
So now that's a prediction shift.
Why could it be happening?
It could be because of the shifts in the user behavior, seasonal shopping trends
like Black Friday or some festivities
or even boxing.
The upstream service, feeding the model.
Importantly, we don't need ground truth to detect this.
Just monitoring the shape of the predictions is enough.
Predicting drift can be the very first alarm bell that an SRE
can see to detect if there is a prediction drift happening or not.
Now let's go on to the next drift.
The data.
Data drift is a typical example of a wide gap between the training
work set versus the real work set on which it has been trained.
For example, if a model hasn't seen any data coming out from a new
geography, it is unlikely that it's going to predict correct output.
So this data drip happens typically when the model hasn't seen this type of data.
And it could be for reasons like the business has expanded in new geographies.
A new product has been rolled out, but the model hasn't been trained on it.
Or there could be new marketing campaigns due to which this new
data has been, uh, rolled out.
Now let's look at, take a look at one more granular drip, which is a feature drip.
It is much granular as compared to data drip, and this also happens because.
Of a change in data, but for a particular feature.
For example, if there is a shift in the values of input for an individual feature.
For example, let's say we have a user location data, and most
of the time it used to capture 70% urban and 30% rural data.
But now over time it has flipped.
Maybe due to a regional ad campaign that must have happened.
Now that change in the future feature distribution is known as feature drift.
This often results from changes in how or where the data is collected,
all shipped in the user population.
It is especially important to drag these features that are
heavily weighted in your model.
So you would want to track the most important features of your model
and take a careful look at these.
Feature drips specifically that are affecting the model in a higher ratio.
The last trip is the feature attribution route.
This is a shift in how important feature is to model prediction,
especially across retraining.
Now, let's take an example.
Initially, our model rank prizes the most important features of predicting code.
But after several regions and over time, suddenly brand emerges
to become more influential.
Now, that's an attribution drift.
Why could it be happening?
Usually caused by retraining on new data.
Sometimes that could be subtle label noise that was introduced, or there were
feature core relations that have happened, or it could be due to an over of a model.
Do that the model has learned over time to make decisions differently than what it
used to previously attribute distribution doesn't always show up in the raw data,
but it changes how the model thinks.
We can have a very stable input data and still see a noticeable performance
drop because of the attribution shift.
These are typically little hard.
To catch.
But as SREs or platform engineers, we are not just
monitoring CPU and memory anymore.
We are monitoring the health of our algorithms and adding drift matrix
alongside traditional observability matrix have become all the more crucial ship,
and it'll bring the right attention to all the silent failures that is happening
in ML systems that we could catch.
Very early on and hopefully mitigate any large scale events
or any ML related incidents.
Now let's spend a little bit of time to look at what a good ML
observability model looks like.
Starting with the data logging, as we spoke about earlier, the model or the
monitoring service should not only capture the raw inputs, but also the
prediction to make the right set of.
And next is the feature monitoring.
Each of the important features that track by the model should be captured
and the drift should be populated.
Then comes the prediction Monitoring.
The monitoring service should have the ability to compare the
predicted outcomes versus the actual outcomes that are based on ground.
Truth then comes the version.
There should be a way for the service to model predictions and the versions
and dust and the most popular one.
An SRE should have dashboards through which they can visualize latency, the
throughput, and most importantly, the drips that are happening over time.
So how has.
The SRE practices changed.
We have already started seeing that the SRE practices have emerged.
In our world, the traditional SRE practices were more around the latency
monitoring and the health checks, whether the candidates are running fine and the
incident response was catered around it.
But now the ML aware SRE practices lets us.
Do monitoring of the prediction drift and all of the drifts
that we spoke about earlier.
There are enough data pipeline checks, shadow deployments are also
happening, and the incident response is cd from an ML perspective.
Now, what are the tools that one can use for detecting these?
There are enough open source tools that could be used like evidently while.
The major cloud providers has also rolled out, uh, tools like Vertex AI is
a popular one from Google, and Amazon SageMaker is one other one that is
quite popular amongst AWS customers.
Let's look at why zero downtime matters.
Zero downturn matters per any distributed system service, but especially for
real time or customer facing systems.
Any monitoring or analysis steps that disrupts availability can
lead to significant business impacts and revenue loss.
So zero downtime is non-negotiable.
Now, as we spoke about earlier, when we are deploying a new model, we need to
send light traffic to a shadow wood and compare decisions with production and
flip it only when we are fairly confident.
Having said that, all the observability, uh, monitoring should happen
asynchronous to the inferencing path.
It should not come in the way of the critical path.
Otherwise, it'll impact the latency.
And if at all, the model is not predicting or is giving wrong prediction,
there should be graceful degradation.
Last but not least, the observability has to be part of
every CICD pipeline in production.
Action.
Wrapping up with my very last thoughts, ML systems fail differently.
They brought silently, they don't always crash, but they decay as SREs.
We must evolve our observability mindset to include these nuances and build systems
that don't just deal up, but stay smart.
Thank you for your time.