Conf42 Machine Learning 2022 - Online

How to detect silent failures in ML models

Video size:

Abstract

The objective of the talk is to build understanding on why and how you need to monitor ML in production.

We’ll cover the taxonomy of failures based on use cases, data, characteristics of systems they interact with, and human involvement. You’ll learn the tools (both statistical and algorithmic) used in dealing with these failures, their applications, and their limits. Fact is, the world changes and data drift and concept drift lead to model degradation and losses to the business. We’ll leverage real-life use cases to showcase the importance of ML monitoring in one of the biggest industries. Finally, we’ll show you how to address this by by monitoring ML performance

Summary

  • Today we're going to discuss two main reasons why machine learning models can fail in production. These reasons are concept drift and data drift. We'll discuss how they can impact performance of your models and to what extent they normally impact these models. And we'll discuss why calculation is rarely possible and we need to estimate performance instead.
  • Just monitoring data drift is not enough, because data drift does not always lead to performance. Performance is the best proxy we have for the business impact. Only way we can quantify it easily in development is to look at technical metrics.
  • We want to figure out what features or sets of features or segments of the data can be responsible for the drop in performance. The simplest and most interpretable option is of course the univariate data drift. To alleviate this problem, we can look at the multivariable data drift detection approaches.
  • Data drift and concept drift are two main reasons why performance can drop in machine learning models deployed in production. Performance estimation of our target data is key to machine learning monitoring. Visit our website or add me on LinkedIn to learn more about detection of silent machine learning failures.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Today we're going to about detecting silent machine learning model failure in models that are already deployed in production so they are making business decisions. It's going to serve as a sort of production to machine learning monitoring. So let's get started on the agenda today we have three things. First, we're going to discuss two main reasons why machine learning models can fail in production. These reasons are concept drift and data drift. We will define them and we'll define how they can impact performance of your machine learning models and to what extent they normally impact these models. Then we're going to talk about performance estimation calculation. We'll discuss why calculation is rarely possible and why access to targets data after you make predictions is often limited and we need to estimate performance instead. And then we're going to come back to find to data and concept drift. And we'll try to find the link between drop in performance and potential reasons for that that we can find in data drift. And before we start doing that, let's set the stage with an example use case that most of you should be already familiar with. We're going to talk about a simple binary classification use case when we're trying to predict whether a client is going to default on their loan. So we take credit scores and customer information and we're trying to predict whether a customer is going to default on the loan. Our target is going to be non payment within one year, which means we'll have to wait for an entire year after making a prediction before we can get access to the ground through data, before we can simply calculate the performance. And we're going to use a technical metric to evaluate the quality of our model. And for that we're going to use Rokuc. So before delving into the details of data and concept drift, let's do a quick refresh and define again, what are we trying to actually do when we train a machine learning model. So there exists some true pattern in reality, not even in these data, but just in reality, something that relates one variable to the other or multiple variables to some other variable. So in this example here, we're going to have just one input feature, x, and we're going to have a relative frequency of positive and negative class. And as the x increases, we're going to have some kind of function that maps this increase in x to the relative frequency. So the higher these value of x, the less likely it is that the data point belongs to a positive class. Then what we'll do is we're going to sample this from this pattern from this population, we're going to sample our data set, and this data set is all the data we actually have access to. So it can be our training, validation and testing data, for example. And that is what we take. And then we try to find this true underlying pattern by using machine learning algorithms. And let's say that we capture this pattern in some way, maybe imperfectly, maybe perfectly. Most likely it's not going to be to be a perfect capture of this true pattern, but it's going to be close enough. And these let's examine what happens if we experience data drift. So, data drift can be defined as change in the sampling method. So the true pattern that exists in reality remains the same. But what changes is how we sample this data. So the data is going to be different, the input to the model is going to be different, but the underlying pattern between the model inputs and the targets is going to remain the same. So now we can formally define it after we've built a bit of intuition about what data drip is, and we can say that it's a change in join model input distribution. So again, it's all about the model inputs and it has nothing to do with the targets, but of course it does have to do with the model outputs. And just to illustrate here, we see that there is data here before and after the data drift, and we will see that class balance might actually change, because class balance might be dependent, and normal is dependent on where in the space of your inputs and the data exists. So that's data drift. And data drift can, but does not have to impact your model performance. If your data moves from one region when model is performing very well, to another region where model is performing equally well, we do not expect to see a significant drop in performance. Now, let's define concept drift. So, unlike data drift here, what changes is the true pattern. So we will sample our data in exactly the same way. Of course, if we have concept drift, that is only concept drift and does not have a data drift component, and then what changes is the true pattern? So the actual underlying pattern that we're trying to find, or the underlying function that we're trying to find between the model inputs and the target changes. So our data or the model inputs can look more or less the same, or even exactly the same, but how they map to the target will change, and we can visualize it here in a slightly different way. So imagine that we have a two dimensional data set, and we have training data and production data, and we'll see that the data tools very similar, the data points are basically in the same regions in space, but the class boundary is going to be completely different. And that also means that if you experience or your data, your use cases experiences strong concept drift, your models are almost guaranteed to fail because the pattern that they have learned is no longer valid. And now that we've defined concept drift and data drift, we know how they can potentially impact performance. Let's talk about why we need to worry about performance at all, actually. So first and foremost, just monitoring data drift is not enough, because data drift does not always lead to performance. And if you were to do that, data changes constantly. And especially if you have a lot of features. And if you look at data drift from kind of feature to feature perspective, you will really have a lot of alerts, so many that they will become basically useless. So because data drift does not always mean performance drop, we cannot just monitor the data drift. Another reason why we should monitor performance directly is that this is something that we've been directly optimizing for in training. We chose the model that has the best performance. However we define performance, it can be our like in our example, the RoC AUC can be precision, recall for classification examples can be root mean square error for regression examples, et cetera. Another thing why we need to look at performance is that it is the best proxy we have for the business impact. Of course, when we develop machine learning use cases in industry setting, what we care about is creating value, creating business impact. And the only way we can quantify it easily in development and within the technical setting is to look at these technical metrics for the machine learning performance. So, since now we know that we need to measure performance or monitor performance, how do we do this? The easiest thing would be to take the ground truth. After we make the prediction, compare it, sorry, after we've obtained it, compare it with our prediction and see what's the difference. Just measure it, literally calculate the performance. However, this is very rarely really possible. Why is that? First and foremost, in some of these use cases, the data is delayed. So let's take our example again, and we'll have to wait one year, sometimes maybe even longer, to get actual target data. So our model will be operating always for a year. And without knowing the performance for a fact, that means that we are exposed to huge risk of giving loans to people who should not have received the loans and generally mispricing these loans. And this is the risk that it's not acceptable and you should always try to minimize it. Another thing is that even when we do get the labels, these labels are not complete. What I mean by that is, we do know whether the people that received the loan, whether they paid it or not. But what we don't know is whether the people who have not received the loan youll have paid or not, had they been given the loan. And the last reason or example where we do not have the ground truth is automation. Use cases. These are the use cases when we try to automate some menial human task, such as, let's say document classification, most of computer vision tasks, when we want to replace or augment human to do something. In these cases, we train model based on data that was manually processed by humans, and now we want to replace them. That means that we will never get ground truth for all the data. That would literally defeat the purpose of developing such an algorithm. We can also, of course, do some kind of spot checking when we can take maybe 1% of the data and have it double checked by the human. But that will not give us a full picture of performance of the algorithm. So that means that we need to estimate the performance of a machine learning model. So we arrive at performance estimation. And that is possibly the most interesting part of machine learning, monitoring and detecting science. Model failure is that you can indeed estimate the performance, and specifically you can fully estimate the impact of data drift on the performance. And I'm going to give you a high level intuition how this algorithm works. It's an algorithm that we developed. It's part of our open source library, so feel free to check it out. So what we're going to do is we're going to look at the confidence of a model. So in our example of a binary classification, this is just going to be the model scored number between zero and one. If these number is close to either one or to zero, it means that the model is very confident. If the number is close to 0.5, it means that the model is not confident. First thing we need to do is we need to confirm that these scores do represent probabilities. They don't always, and sometimes we need to do probability calibration to make sure that these scores are turned into something that actually represents real probabilities. That if a model output 0.6, there is actually 60% chance that this data point is going to belong to class one. So after we have that, we have properly calibrated probabilities, we're going to look at the expected performance of a model as a function of this uncertainty. So imagine that in training we have the picture on the left here, and in this picture youll will see that most of the data points are in the high confidence regions. Maybe the part of this kind of butterfly wings, one of the butterfly wing is concentration of negative class. Another one can be the concentration of positive class. And kind of the body of the butterfly is going to be the class boundary when the algorithm is really uncertain whether this class is positive or negative. So then imagine that sometime passes and we see significant data drift. And this data drift is of a specific form that it moves from high performance regions or high confidence regions to low confidence regions. That means we would expect the model to perform worse, and the model itself expects to perform worse. We can these take that and convert it into expected confusion metrics. And from there we can calculate the expected value of basically any metric you want, it being accuracy, precision, et cetera. I will not go into details how youll actually go about it as that would take too much time, but do feel free to check our docs with more explanation on it. So now we have estimated the performance. What is the next step? The next step is trying to figure out if there's identify issues, see if there's failures. And if we see that there are some issues with performance, we want to figure out why they haven't. Just before we jump there, let's look at an example of how such performance estimation algorithms performs on a real life data set. So we took, for the purposes of these presentation, we took a California housing data set that's available basically everywhere in scikitlearn, among other places. We turned this model into a classification problem and created very simple algorithm. I think it was random forest. We trained it on the training part of the data set, and then we evaluated in production. And you see here that this algorithm that explained to detect the impact of data drift on performance. So the performance estimation algorithm works quite well, and the estimated rock AUC is very, very close to the real rock AUC. So now let's jump into figuring out why models can fail. So then we'll go back to the data drift. And here we want to basically figure out what features or sets of features or segments of the data can be responsible for the drop in performance. And we can do it in two ways. First, we can look at the data feature by feature, and just see which features changes significantly. This is the univariate data drift, or we, youll also look at multivariate data drift. So we look at maybe all features at once, or some subsets of features, and we're trying to figure out whether there is a significant change in joint distribution of these features. The simplest and most interpretable option is of course the univariate data drift. So we're trying to detect it. And to do that we can use simple tests, simple statistical tests such as comoger ks test or chisquare test, where we will just look at the reference data set for which we know that these performance is stable and high and all the data looks like it should, which could be, for example, our test set, or it could be the first month of production and we're going to compare it to our analysis data set, which is the part of the data for which the performance has decreased and we'll see if there's any changes in the data that are significant. These tests of course have few shortcomings. The first one is that if you have hundreds of features, you will get reallife a lot of false positives in absolute terms. And that means that let's say if you go into thousands of features, you will just not be able to go through all these false positives and you will not be able to find the real issues with the data. The other shortcoming of these approaches is that they fail to find more subtle types of data drift. If we see only the shift in correlation between features or some changes in internal data structure that is not really visible from univariate changes. This test of course will fail to detect that kind of data drift. So kind of to alleviate this problem, we can look at the multivariate data drift detection approaches. And here I'm going to present one that we develop, which has to do with the data reconstruction. So what we're going to do here is we're going to take our original data. So all the failures that we have and we're going to compress it, we're going to project it to a latent space at lower dimensional and then we're going to do the invariance transform. We will reconstruct the data, then we will compare the original data with the reconstructed data and we'll see what is the compression loss, how strongly this data is different. And we will use basically any dimensionality reduction or compression algorithm that is fitted on the data. So any algorithm that learns the internal structure of the data can be used here. Let's delve deeper into that. So first, when it comes to the choice of this algorithm, there's few requirements. First, as I already mentioned, the encoding needs to learn the internal structure of the data because we want to track the changes in that structure. That's the entire underlying intuition here is that we want to track how the internal structure of the data is changing, and we can measure that using the change between the original dislocation of points before and after the reconstruction. So between the original space and the reconstructed space, then this encoding needs to reduce dimensionality of the data because we want to compress the data in some way so that this internal structure of these data needs to be learned in order to perform compression. Well, then of course the inverse transformation needs to be possible because we want to reconstruct the data. That one is obvious, and there's one important requirement these is that the latent structure needs to map in a stable way to the original space. So let's say if you took out encoder that is not variational, so traditional one there, you would see that the latent space can map in completely unpredictable way to the original spacer, which means that these reconstruction error is not going to be a reliable metric to measure really the change in the data structure. So what we're going to do is we take, let's say PCA, we do the PCA, we reduce number of components, we take let's say top end components, we keep 95% of variance in the data, and then we do the inverse transformation. We measure the dislocation of points before and after the reconstruction. And to do it we can use any distance metric. We're using the minuclidean distance between the original and reconstructed points, and we get the reconstruction error. And then we only need to keep track of one metric to see what is the data drift here. Of course, this does have a certain shortcoming, that it becomes less and less interpretable. If we look at maybe five features at a time and we do the data reconstruction these, it's still reasonably interpretable. If we look at the entire data set and we perform this kind of multivariate drift detection, it's not going to be interpretable. But we will still know that if there is drop in performance and we see spike in our multivariate data drift, we see that data drift is responsible. And we'll have to dig deeper to find out what exactly change in our data that affects performance. And just a few words about how to actually interpret this reconstruction error. We will have some kind of baseline reconstruction error because this compression is meant to be lossy. And as you deploy your model, you can see that this reconstruction error stays roughly constant, which is perfect case scenario. In that case, we see no drift, it increases or it decreases if it increases. We see that these encoding, the internal structure that was learned by the encoding is no longer appropriate. So these compression doesn't work as well. We see increase in the reconstruction error and we have data drift. However, if there is a drop in reconstruction error, that means that the internal structure learned by the encoding is even more appropriate to the data than it used to be before. So we still see data drift. This case is rare, but it might happen. And I want to give you a very quick example of where such an algorithm would be necessary to detect drift and univaliate approaches would fail. So let's imagine two very simple data sets, the reference data set in blue, these when we know everything's fine, and the orange data set, which would be the analysis data set. So we see there is some kind of drop in performance. We want to find out why. And if we just look at the univariate data drift detection methods, we would see that there is no increase in the d statistic. There is basically no alerts, because from univariate perspective, if you just look at the x and the y axis, these data sets look basically the same. However, if we do our encoding decoding and we measure the distance between the reconstructing area, the original space, we will see that there is a strong difference. These because the internal structure of the data has significantly changed, obviously. So this is the simplest possible example where multivariate data detection would be absolutely necessary. So we're slowly nearing the end of the presentation. Let's summarize. So first thing is that there's two reasons why. Two main reasons why performance can drop in machine learning models deployed in production. It is data drift and concept drift. Data drift does not always lead to drop in performance, whereas concept drift tends to almost always lead to drop in performance. Ideally, we'd like to always calculate the performance of a machine learning model in production to really know whether there's any issues or not. However, this is rarely possible because production targets are often not available because they are either delayed, not available at all, or we dealing with an automation use case where you can only get a very small percentage of these. And that means that performance estimation of our target data is key to machine learning monitoring because it allows us to estimate what is the current performance of the model, whether we need to be worried or not, and then we need to go back to data drive detection and figure out why. So thanks for listening. And if you'd like to learn more about the topic about detection of silent machine learning failures, either visit our website, shoot me an email, add me on LinkedIn, or most importantly, check out our GitHub. We are, we've just launched our product, so we have our GitHub. It's publicly available. You can just pip install the library and use the method I described in the presentation. So, yeah, that's it. Thank you very much, and I'm happy to talk more on LinkedIn or anywhere else.
...

Wojtek Kuberski

Co-Founder @ NannyML

Wojtek Kuberski's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways