Conf42 Platform Engineering 2025 - Online

- premiere 5PM GMT

Exploring the math of thresholds

Video size:

Abstract

Thresholds are easy, until they are not. It’s a complex balance between business context, engineering practice and developer productivity. In this talk, walk with me as I take you on a journey of understanding what the context of thresholds is and manage the evolving trend of your applications.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Emmanuel Bakari, and today I'll be taking you through exploring the math of thresholds. Thresholds are bounds, upper and lower bounds. And I'll be showing methods using numerical methods which is in mathematics, iterative methods for predicting where something should be. So where am I? My name is Ivano Ery. People call me back, man. I'm a senior developer engineer at Twilio. I also on decide build solutions at Baseline EQ Cloud. One of them is Course Craft where we use some of these methods for things like right sizing request and limiting Kubernetes automatic right sizing. So you can check out the product there. Cost craft to baseline HQ Cloud if you like it. What is the threshold? A threshold is a boundary at certain time that defines a minimal maximum. So it can either be an upper bound or a lower bound. Metrics define. The values that you then use, right? To check whether threshold exceeds or not at the time. T And then if it reaches for assessing period, usually that's when you can take action. In this case I'm showing what is an anomaly band, which is a open and lower bound maxima, right? This is a scatter up plot, but in this, that the values are drawn out, right? So this is a predictive model. So as you may notice, no taches are the same, right? That is very different to what you would've defined as a threshold. And that's because the one we are aware of is static thresholds, right? Which is just a classic line through the noise, right? They're fixed, they're easy to estimate. They're also easy to diagnose. Like you can look at it straight up and know that find this is not there, but they're also prone to noise, right? They can fail. They don't adjust well. So the situations of your system, things like organic growth, E to C, TC, and they're also single variants, right? You can either monitor CPU or memory, but in dynamic thresholds, you can use CPU on memory to say. To define, okay, fine. We need to scale the system. Right? So autoscaling models usually employ dynamic thresholds and dynamic thresholds. In this case, they change with the metric observations. So they're boundary based, right? So that can be upper or lower, right? In the case of the anomaly bands I just showed you they're always calls home. So they rely on past and present data and they are feedback loop driven. That's basically what the causal sense means, right? It's control systems, that kind of stuff. So they're motivated, they adapt well, they're complex to align, right? You can't basically say this is where it's gonna be, right? They need more, you need at the start. 'cause you also have to understand how your system is modeled and, but they're less pro to false positives and a result of that. So definitions are done. So what's the process like for static thresholds? So start off with a static threshold. You have a bunch of data points, right? So you need to filter out for out layers, right? To start off, right? Clean up the data, that kind of pre-processing if you want to. It's not particularly a required process, but you can do so with things like I QRS and Z course, right? Just to see just how the world spread out. Your data is right. Standard deviation also comes in here, but that's on the Z course. The second thing you do after that is that you define your aggregate statistics, which is like all these data points that are present, how do we make it to one value? And you can do that with present ours. You can also do that with a linear aggregate like in me. Which is just an average or in this there's geometric kind of means, but in this case, I mean a normal linear mean. So the once you've done all of that, then you can reduce your mean error. I'm using something like the law of large numbers, which I'm explaining a bit, but this is like the entire process for static thresholds with numerical methods. So why do we need to filter for out the years is because static thresholds don't adapt. So that means that from the start, you need to be sure that is the normal band of your system that isn't really influenced by those art layers, right? So that's the skewness of your distribution. The ketosis is reasonably fair, right? That's that be core distribution. Art layer filtering basically guarantees that you reduce the case of a false negative or a false positive. I can keep it going from there. Now in Takota Ranges, as we've spoken about before for that filtering process is very simple, right? And I only recommend this, and then you can then move into more extreme methods, right? But this is very straightforward. The idea is that you take what the first and third quarter would be, right? You subtract them, which is the range, part of the inter quarter range, and then you basically define the minimum maximum using the first quarter, minus 1.5 times that inter quarter range. And then the third quarter plus 1.5 times the inter quarter range. So it helps you like, stabilize those numbers before you then go into predicting you know, where they should be. Now, the law of large numbers as you have seen me mentioning, is basically a lot that the average of a lot of results will converge towards the value that should be the result of the experiment. So if you had a lot of data points, say spread out and say you took the P 99 of the first five minutes. Then on the next five minutes, then on the next five minutes in that like rolling window fashion, you might get one in the first one, one in the second one, two in the third one, one in the fourth one, one in the fifth one, right? That's who is the outlier. But because of the fact that you have a lot of samples broken down, if you got the average of everything, it would converge below two and towards one. And with the more samples you add, the larger the probability that it tends towards one, which is your ideal mean. So that's the idea of the law of large numbers. It reduces the error case. So the way that you start it off is that you define a trend for what a system should behave like. In this case is a CR job that runs three times every minute. You can fail every time. It can pass once, it can pass twice. So it can pass all the time. You take those numbers and then you define what would be a logic table, right? With those sets of currencies. Can it fail all the time or can it pass all the time? And in that order, and then you basically throw it into what that trend formula would be because it's a static threshold. If it's predictive, you can do this. If it's non predictive, then you can basically go about that P 99 case I mentioned of just sampling by time, right? In this case, I can already know what the value would be at that time, and I can use that then to estimate what that, trend would be like. And then just generate like my own data set for it. So yeah, so you run multiple IT versions of that, right? Using percentiles or, you can use an aggregate function as we've already spoken about, mean max or a percentile. And then as you begin to then average that trend, you will get towards the value that meets that threshold. You can also guarantee the convergence using standard deviation, right? So in this case, this is like a sample of what that looks like from like when I do one and 10. Iterations of, each running. So one in the sense that there's only one crown jump running and 10 in the sense that there are 10 crown jobs running. Those three iterative steps every minute, either health checks or whatever you use to define it. And then this is like what that random occurrence would look like if I six a DP 90 or the P 99.999. You notice the P 90 has higher aviation, right? So this would be nice if your system say, had a lot of spikes, right? And it was very like jagged. And this P 99.999 works in a sense that it's very predictive. It's always available and you want to have very like strict SLOs. So the standard deviation will obviously capture this, right where you can see standard division for P 90 is quite high. And as you then add more ations, you'll notice it goes down. So this also then means that we're attending towards that convergence. So this is where the low large numbers then plays a good part in running experiments like this. 'cause you can be able to then define a way to then aggregate. That value towards getting the threshold. And also then analyze whether like you are getting much closer towards the actual mean that you want rather than it's increasing. And in that case, you would see the standard deviation go up. Like in this case where you can see a hundred, 200, it decreases two 50 with 10 increases again. So with duration, P 99.999 obviously has more variants, but the standard deviation is quite low. So you can. Obviously use that to judge based on your expectations. So yeah. But yeah, here's a graph of what the experiment looks like. You notice that the P 90 threshold is close enough at 102 hundred iterations, right? And it obviously then, shows like what that range of values would be. So it can take all this data and then be able to define what that threshold value on comp should be like. We know what are heuristic estimate. Would fit into. And then this is basically how you apply the low, large numbers, right? And yeah, you can basically then define that actic threshold. So that's basically what the low, large numbers is. It's a simple method for approximate static threshold to predefined behavior. Even if it's not predefined, you can obviously then procedure 99 to aggregate for that timeline. It allows you to get clear methods on where that value should be. At any given point in time. And then you can obviously then define it based on that your aggregate function, whether it's a sum, whether it's a percentile, whether it's an average, and then convergence across multiple. The top points implies the limits that captures where that threshold should be, and so you don't then have to guess, right? You've already proven that out, right? Over a very large period of time with multiple iterations, experiments. And checks. So yeah, static thresholds are very easy, right? What about dynamic thresholds? How do we get there? Now, dynamic thresholds in platforms is, these are is usually called anomaly detection, but it's the same thing, right? So you want to basically capture whether the behavior system is outside, what a predictive model would say it should be in, right? So again, dynamic, right? It's not fixed. So the idea is that you can build them on. Supervised models, right? So in this case you get into machine learning or tical models, so things like logistical regression falls in here you have a priority algorithms, things like clustering, unsupervised learning, and also core fitting, which is just Aris, right? You just basically say, this is how I feel the strength should be, and then I can then find values that match it, and then I can then use that to then judge the fit against whatever assign disorder behavior that system exhibits. So I'll talk about curve fitting 'cause it's the one that I've done and I can share examples of. But they are, but all the other methods I'll talk about them in a bit. So the basics to curve fitting is that you have a, like a series that exhibits like, some or wave like wave form, like fashion, and then you're trying to then find what's trend would fit on it, right? You can say it's a sine wave, a hand wave, a cross wave, like whichever way you want to express that wave form, if, even if it's exponential to the exact same way. So you start off like this, you basically define, say perfect. In this case I've generat, I've generated random data showing you our sales in millions. It goes up and down across the hour. All of this is basically random, randomly distributed, but it's still follows as any sort of fashion. Then we then find a waveform that would fit it, right? And here you can see what the normal waveform would be, and here you can see what the fits the curve would fit for. It would be like I, since that division is reasonably low, and then the fit is around 70%, right? So how do we get here? The idea is that remember that? Use case that we defined. So we obviously didn't have to estimate whether the fit is good or not. You can use the cheese square, good fit or standard deviation, which is used in cheese square, good fit. And then cheese square good fit is better because it also takes into consideration the number of samples as opposed to standard deviation where I just checking the range of deviation from those, of the samples from the mean. So you get there by taking A, B, and C. So A can be the amplitude B can. B, I guess the phase multiply and then CS the constant, right? So you take all of those together, you then iterate across different values of A, B, and C to see like which one fits better, and then you then basically check what the error there would be. Or in this case, you can use the FIT formula, right? So then judge like how well it actually fits the curve that you defined. It's the, for core fitting, the V variables are bounded, right? You can bound it based on, say, the amplitude of the curve, right? Like how high does it go. And then you can obviously then use that to then find, okay, fine. If a look like it's tend towards the fits, then you can then change B and C to see whether fits better or not. So that would be that regressive the defense model for error correction. So yeah, it's very simple, it's very straightforward. There also are methods for doing, dynamic thresholds, right? Based on that machine learning phase of either you're doing anomaly bands, or in this case you're just having one part trying to, approximate what that threshold should be. You can use logistic regression methods. K Ns, you can use naive phase scaler vector machines also fitting in here. And you also have so these ones you would've classified data, they would label properly, everything else. So this is more supervised. The case where you have like unsupervised methods, right? So you can use isolation forest. The case that I showed before of those are normally band with the open and lu with the open lower fits, right? You can obviously then use density based cans. This case, this example, use them PCA for filtering, alignments and all those sort bits. So that this is also an example where you can do. Anomaly detection on random scatter, upload data with, unsupervised learning methods like density based scans and PCAs. Yeah. The summary of all of this is that to basically use numerical methods, you can identify the business opportunities like the kids that I've mentioned with the dynamic autoscaling or say in the non, technical sense where you're just trying to see whether user orders have increased so that you can scale operations. All these things are more, you can use them to basically predict what that value should be if it lies outside effects. And then you can obviously then, take action over time. Organic growth and, pass data will influence these models. So they would just auto fit over time, but at least now you can capture, those trends without having to get paged every minute or two. Yeah, confirm with that. You need dynamic or static threshold. Determine what the noise step should be. You can use cut up plots, you can use clustering methods, you can use iqr, you can use these scores. There's various ways to actually check whether your dataset is fit for these methods or not. And then from there, just iterate until the error is minimized. A lot of these things are iterative methods, multicolor in that. And, whichever we choose to define them. These are just methods. They don't particularly define the approach, but it's a good way to think about thresholds and statistics when you're modeling, monitoring systems or just trying to improve the way that you iterate everywhere. Yeah, thank you all. Again, you don't need to use all these guys. Sometimes eyeballing might be worth all of these, but if your systems behave in very weird ways. It might be worth the deal to with statistics to save you months of headache. Thank you all. Go forward and explore thresholds. Thank you.
...

Emmanuel Bakare

Senior DevOps Engineer @ Twilio

Emmanuel Bakare's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content