Zero Downtime ML Deployments: SRE Techniques for Seamless Reliability

Video size:

Abstract

Ensuring zero-downtime ML deployments is a challenge for SREs. Traditional observability falls short for ML at scale. This talk explores the ML-SRE gap, breaking down systems and introducing key techniques to enhance observability, monitor performance, and ensure seamless, reliable deployments.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Welcome to Site Reliability Engineering Conference of 2025. My name is Ani. I am from Rapid Cloud Infrastructure and I'm very excited to talk about zero downtime ML observability Today. Today we will explore how to implement observability into machine learning systems without causing any disruption or downtime. To the services that rely on them. Machine learning is no longer on experimental playground. It has become a reality and it has been used pretty much in all production systems, but it brings with itself unique set of operational challenges. Today we'll explore some of those SRE challenges that are being faced and how we can make. Okay, the ML systems more observable, scalable, and reliable. So let's dive right in. As you can see, there are two different worlds today. One is of ML engineers and then of sre. These two work on very different philosophies. ML engineers are experts at building models, whereas SREs are experts at running. So. Systems. ML engineers tend to be focused more on experiments and they work in an ator process. SREs are more focused on the service level agreements and the service level objectives of their services. ML engineers tend to focus more on the accuracy of the models and the predictions by the models, whereas SREs are focused more on. The uptime and the latency of their distributed system services. ML engineers work more on an offline setting where they evaluate their models, train their models, work in the inferencing of it, whereas the SRE are working on real time production matrix. As you can see, there is a gap between both of these worlds. So now when SREs are tasked. With running machine learning services in production, they would likely face a lot of challenges because they have not designed these models and they don't know how the these models are supposed to run. And what are the predictions along with it? Right? So let's try and understand what are, what are the traditional observatory matrix that looks like? Today's traditional tools for observability that are equipped for distributed systems are more along logs, matrix, and traces. That tells us if the distributed service is running up or if it is running into error, if it's hundred percent available, and so on. But it doesn't necessarily tell. If the model that is being run by the service is working correct and in an accurate fashion, and if it is predicting the correct output for which it was designed, so the errors tend to be pretty silent when they occur in this ML systems in production. The predictions returned by these models could be very much or super biased. But it is more than likely that an SRE engineer won't get to notice this. There could be drift that is arising in the data and it is degrading the performance of the model. But when an SRE looks at all the matrix, but it, it shows that all the services are running up and fine. They're meeting their SLAs and the sles. So as we see, there are no standard SLOs that are defined for ML quality today that an SRE can track their accuracy and the fairness and the correction ratio of this models. So what are the other challenges that ML engineers are face facing and the SREs are facing, who are running these systems as we spoke about it? There are these silent failures, which often grow. Undetected system is producing incorrect output, but SRE has no way to notice this. There are data drips happening, and we'll talk about the different kind of data drips in the next section, but the data distribution has changed over time. The data drifts are happening. There could be prediction drift, but there are not enough tooling around it today. That an SRE could use, it is slowly changing. The model could very well get stale if it is not retrained. So the model is losing its effectiveness, and an SRE should be equipped to deal with it and to detect it early on. There could be very much a lot of debugging issues that could arise, and the features that are being tracked by the models often lack visibility. There could be more versioning problem. The code version may not match with the model version and the latency issues could also arise. The inferencing latency could be more, the loading of the model could be more, which will be resulting. In some delay at the customer end for a good ML monitoring service work, all of these challenges needs to be addressed. So an ML observatory model should be able to monitor, debug, and be able to understand these complex and complicated systems that are in product. The data that is ingested by these models need to be very accurate, should mimic real time traffic and not any synthetic traffic. So the data quality needs to be high and should be consistent over time. There should be way to track the distribution of the features over time, and so that. SREs are able to see if there is any change over time in this distribution and be able to catch it early on. The model performance should be detected using Matrix over time so that we can detect if model is performing accurately and correct over time and hasn't changed its trend. That should be. For SREs to identify changes in the input and output data distributions that is known as drift detection, a way to track different versions of the model across retraining, and then last, the model should be compared with the live traffic and not with the three traffic on which it was trained. Think of observability here, not just as monitoring the pipeline, but as observing the health and reliability of the data logic and the outcomes. It's like adding an x-ray vision to the black box. Let's look at the simple ML monitoring system in production. An ideal model monitoring service should be sitting right next to the prediction service. With it should not just ingest the data input that is fed to the pipeline, but it should also be ingesting the predictions that are emitted by the model. With this, it'll be able to make some complex calculations and feed it to the monitoring and visualization tools. Now, these tools could help SRE. To look at any alarms or matrix that could be generated if the model is not behaving at par. And whenever possible, there should be B matrix calculation that looks at the historical data and compares it with the ground truth, and see if there are any changes along the way. From an SRE perspective, the system looks very much like a service. This, but this service correctness depends not just on the binary, but on dynamic external input data and its own historic behavior. Let's look at some simple matrix, which are out there for a shopping cart. As you can see about, there is an average latency for every region, and there is something known as R, which is basically. Root means squared error. A ma a metric, often tracked in machine learning world to see how far the prediction is from the ground root. It should be as low as possible. Then there are real user metric data. For example, there are page views that are being track and the number of checkouts, and towards the end we can see an end-to-end conversion rate. Which tells us how many users visited the, uh, the shopping page for browsing sofas, how many of them added those sofas into the cart, and how many of these sofas were actually checked out. This is some of the good matrix that SREs could track, especially the RMSE. Now let's go and take a look at the different types of drift that we spoke about earlier. There is prediction drift, data drift, feature drift, and feature attribution drift. Today there are tools available that would be able to track these drips and an SRE can detect this early on. As SRE supporting ML workload, our job doesn't stop once a model is deployed. In fact, that's where it really starts. One of the key challenges is understanding if and when the model starts behaving differently, and this is where the drift modeling becomes very crucial and critical, even when ground truth is delayed or unavailable monitoring drip. It helps us maintain the model quality, reduce the customer impact, and triggering retraining workflows at the right time. Let's talk about prediction drought. This is when the distribution of model prediction starts changing over time. Let's take an example. Let's say we have a recommendation model that suggests product. Normally it suggests a mix of budget and premium items, but over time we have noticed the model is heavily skewed towards premium products. So now that's a prediction shift. Why could it be happening? It could be because of the shifts in the user behavior, seasonal shopping trends like Black Friday or some festivities or even boxing. The upstream service, feeding the model. Importantly, we don't need ground truth to detect this. Just monitoring the shape of the predictions is enough. Predicting drift can be the very first alarm bell that an SRE can see to detect if there is a prediction drift happening or not. Now let's go on to the next drift. The data. Data drift is a typical example of a wide gap between the training work set versus the real work set on which it has been trained. For example, if a model hasn't seen any data coming out from a new geography, it is unlikely that it's going to predict correct output. So this data drip happens typically when the model hasn't seen this type of data. And it could be for reasons like the business has expanded in new geographies. A new product has been rolled out, but the model hasn't been trained on it. Or there could be new marketing campaigns due to which this new data has been, uh, rolled out. Now let's look at, take a look at one more granular drip, which is a feature drip. It is much granular as compared to data drip, and this also happens because. Of a change in data, but for a particular feature. For example, if there is a shift in the values of input for an individual feature. For example, let's say we have a user location data, and most of the time it used to capture 70% urban and 30% rural data. But now over time it has flipped. Maybe due to a regional ad campaign that must have happened. Now that change in the future feature distribution is known as feature drift. This often results from changes in how or where the data is collected, all shipped in the user population. It is especially important to drag these features that are heavily weighted in your model. So you would want to track the most important features of your model and take a careful look at these. Feature drips specifically that are affecting the model in a higher ratio. The last trip is the feature attribution route. This is a shift in how important feature is to model prediction, especially across retraining. Now, let's take an example. Initially, our model rank prizes the most important features of predicting code. But after several regions and over time, suddenly brand emerges to become more influential. Now, that's an attribution drift. Why could it be happening? Usually caused by retraining on new data. Sometimes that could be subtle label noise that was introduced, or there were feature core relations that have happened, or it could be due to an over of a model. Do that the model has learned over time to make decisions differently than what it used to previously attribute distribution doesn't always show up in the raw data, but it changes how the model thinks. We can have a very stable input data and still see a noticeable performance drop because of the attribution shift. These are typically little hard. To catch. But as SREs or platform engineers, we are not just monitoring CPU and memory anymore. We are monitoring the health of our algorithms and adding drift matrix alongside traditional observability matrix have become all the more crucial ship, and it'll bring the right attention to all the silent failures that is happening in ML systems that we could catch. Very early on and hopefully mitigate any large scale events or any ML related incidents. Now let's spend a little bit of time to look at what a good ML observability model looks like. Starting with the data logging, as we spoke about earlier, the model or the monitoring service should not only capture the raw inputs, but also the prediction to make the right set of. And next is the feature monitoring. Each of the important features that track by the model should be captured and the drift should be populated. Then comes the prediction Monitoring. The monitoring service should have the ability to compare the predicted outcomes versus the actual outcomes that are based on ground. Truth then comes the version. There should be a way for the service to model predictions and the versions and dust and the most popular one. An SRE should have dashboards through which they can visualize latency, the throughput, and most importantly, the drips that are happening over time. So how has. The SRE practices changed. We have already started seeing that the SRE practices have emerged. In our world, the traditional SRE practices were more around the latency monitoring and the health checks, whether the candidates are running fine and the incident response was catered around it. But now the ML aware SRE practices lets us. Do monitoring of the prediction drift and all of the drifts that we spoke about earlier. There are enough data pipeline checks, shadow deployments are also happening, and the incident response is cd from an ML perspective. Now, what are the tools that one can use for detecting these? There are enough open source tools that could be used like evidently while. The major cloud providers has also rolled out, uh, tools like Vertex AI is a popular one from Google, and Amazon SageMaker is one other one that is quite popular amongst AWS customers. Let's look at why zero downtime matters. Zero downturn matters per any distributed system service, but especially for real time or customer facing systems. Any monitoring or analysis steps that disrupts availability can lead to significant business impacts and revenue loss. So zero downtime is non-negotiable. Now, as we spoke about earlier, when we are deploying a new model, we need to send light traffic to a shadow wood and compare decisions with production and flip it only when we are fairly confident. Having said that, all the observability, uh, monitoring should happen asynchronous to the inferencing path. It should not come in the way of the critical path. Otherwise, it'll impact the latency. And if at all, the model is not predicting or is giving wrong prediction, there should be graceful degradation. Last but not least, the observability has to be part of every CICD pipeline in production. Action. Wrapping up with my very last thoughts, ML systems fail differently. They brought silently, they don't always crash, but they decay as SREs. We must evolve our observability mindset to include these nuances and build systems that don't just deal up, but stay smart. Thank you for your time.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Zero Downtime ML Deployments: SRE Techniques for Seamless Reliability

Video size:

Abstract

Summary

Transcript

Slides

Payal Godhani

Principal Engineer @ Oracle Cloud Infrastructure

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Zero Downtime ML Deployments: SRE Techniques for Seamless Reliability

Video size:

Abstract

Summary

Transcript

Slides

Payal Godhani

Principal Engineer @ Oracle Cloud Infrastructure

Join the community!