Smoke detectors in large scale production systems

Video size:

Abstract

Static alerting thresholds no longer cut it for modern distributed systems. With production systems scaling rapidly, using static alerting to observe critical systems is a recipe for disaster. Observability tools have recognized this need and provide a way to “”magically”” catch deviations from normal system behavior instead.

What is that magic? What goes in to deciding whether a spike or a drop is violating a known “”good”” condition or not?
How do we avoid alert fatigue?
How do you factor in seasonality - low off peak hours and high holiday traffic?

The rabbit hole goes deeper than I imagined. As a part of the core data science team at Last9, I ran into scenarios where my assumptions of building anomaly detection engines were shattered and rebuilt with every interaction with production traffic.

In this talk, I will talk about:

What I learnt when trying to find answers to the above questions.
How known theoretical models map to real world workloads e.g. streaming services, high frequency trading applications etc.
The science that goes behind choosing and calculating the right SLOs for different SLIs and sending out early warnings and how to measure and improve leading and lagging indicators pertaining to system health.

Summary

15 years ago, there were no auto scaling applications, no Kubernetes clusters and more importantly no yaml to write. Today, applications are spread across the world. Newer problems and variants of the older problems keep on occurring, breaking the systems in ways that prove difficult to diagnose and debug.
For an ecommerce site, 1 second is too slow and might lead to frustrated customers. Context pertaining to the nature of the systems is paramount. It is imperative that we need to have some kind of dynamic alerting in place that gives us information about leading indicators.
The size of the time frame under consideration depends on how predictable the traffic pattern is. Detecting loss of metrics is just as important as detecting anomalies in the metrics. Having this algorithmic toolbox at hours disposal gives has great power to detect issues faster and resolve them.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Smoke detectors in large scale production systems. Before we begin, let's go back in time 15 years ago to look at the life of sres. Back then. Well, not sres and not DevOps. We were called sysadmins back then. Traffic was low, services were enough and weekends held promise. There were tools like Nagios and Zavix which had against to monitor various resource usages and emitted zero if all was okay, one in case of a warnings and two in case of an error. Static thresholds were good enough for these situations. When stuff broke, it broke in a fairly predictable manner. Not to say that when things broke at the time, they were easy to fix. Who wants to configure a raid array in today's world? Well, definitely not me. The domain was fairly known and the problems repeated enough for us to know what broke and how they broke. Obviously there are quite a few war stories, but the good part is that there were no auto scaling applications, no Kubernetes clusters, and more importantly no yaml to write. Now let's take a look at today and what the future look like. Applications no longer reside in a single data center in someone's basement. They are spread across the world. We have auto scaled applications. There are cloud native workloads spanning across multiple containers and orchestrators, primarily kubernetes clusters. There are huge number crunching data lakes. We have functions as a service, lambdas and whatnot. It can't be denied that the scope has increased and the domain has exploded. Newer problems and different variants of the older problems keep on occurring, breaking the systems in ways that prove very difficult to diagnose and debug. Along this transition to the new world order, we realize that static thresholds don't cut it anymore. If they are enough, there would be definitive answers to these questions. What is a good number of five xx errors? Depends on the throughput. If I'm getting a few 505 xx errors in a few minutes, then it's completely acceptable if my traffic is at the scale of millions. However, if my traffic itself is in hundreds or even thousands, it is a cause for concern and may even cause the slos to fail. How many container restarts for kubernetes is too many? Depends on the desired containers. If the restarts are fairly small in numbers compared to the number of desired containers, it does not affect the availability of my systems and are of no concern to me. Coming to the last question, how slow is slow enough? Again, it depends on the workload. For an ecommerce site, 1 second is too slow and might lead to frustrated customers. However, in the case of asynchronous workloads, 1 second might be fine if the workload involves crunching data, or might even be considered blazing fast with a video editing pipeline, we can safely assume that context is king with increasingly complex systems. Context pertaining to the nature of the systems is paramount. Owing to the sheer complexity and variety of components involved in modern distributed systems, it has become harder to isolate faults and mitigate them accordingly. There are no sitting ducks anymore. The ducks are armed and dangerous. However, one might argue that since services have to abide by slas, aren't slos enough? Not really. Slos are lagging indicators of system health. They indicate system performance in the past and not the future. As a consequence of that, we should be more concerned about leading indicators of systems performance. For example, my car alerting washed in the rains outside is a lagging indicators as it can only happen after it has rained. However, cloudy weather and lightning are leading indicators that it may rain. It might not always rain, but there is a pretty good chance that it might slos are primarily of two kinds, request based and window slos. Request based slos perform some aggregation along the lines of good requests versus the total number of requests. For example, while computing the availability SLO for a compliance duration, we would simply count the total number of requests and the total failed requests. However, consider this scenario during a holiday sale on a shopping site, the system services 99% of requests successfully during the day. However, during a 30 minutes window in the day after the PCR, all requests failed and consequently most of the customers got upset and did not core back to the site. A request based SLO of 99% would look undefeated. In this case. However, the result it can be said request based slos give us no indicators of consistent performance throughout the SLO window. They only serve as an indicator of overall performance. To circumvent the issues above, we must use window slos. A windowed slo is on the lines of a certain percentage of time intervals should satisfy a certain criteria. For example, in the past 24 hours, at least 99% of 1 minute intervals should have a success ratio of 95% or above. Now, the question that arises is what kind of slo should be set on what kind of service? Consider the service that caters to payments. The service only cares about successful payments. A request based slos would be ideal in this scenario, a sample objective could be over the last seven days, 99.9% of the total payments should be successful. However, on the other hand, a streaming service would care mostly about the users being uninterrupted over the weekend so that they can continue binge watching their shows. In such a scenario, window based slos are ideal. A sample objective could be that over the last seven days, 99% of the time, the server should have served reasonably successful intervals of 15 minutes seats. An interval is appropriate if 95% of the users did not receive another. Hence, it is imperative that we need to have some kind of dynamic alerting in place that gives us information about leading indicators. Does that mean we need to employ machine learning? Not really. Machine learning models are cost intensive and will require extensive training using metrics observed in the past. Since each model needs to be trained using the metric, it will observe as the number of components and their corresponding metrics increase, the costs pertaining to training. Deploying and maintaining the machine learning models goes through the roof. How do we go about dynamic alerting then? Well, we have high school math to the rescue. Armed with high school math, basic statistics, and other avengers, we at last nine built a smoke detector to monitor increasingly sophisticated distributed systems. Now the question that arises is what does a smoke detector really do? Well, it raises an alarm when an anomaly is detected. What exactly is an anomaly then? An anomaly is any deviation in normal system behavior. It could be owing to any of the following characteristic changes in the metrics rate at which a particular metric is increasing or decreasing the amplitude or spikes, which basically means the absolute value of a metric and also the time of the day at which the particular value was observed. Let's start with rate. Rate is basically a measure of how fast a metric value changes. If something is spiking up fast enough and then goes back to normal, is it worth alerting? Maybe or maybe not. A simple way to go about computing rate changes in a metric could be using standard deviation across a sliding window. If there is an increase in the standard deviations being observed, that means there are sudden changes in the metrics that weren't happening before. However, when exactly do we alert based on rate changes? If the percentage of four xx errors observed in a system suddenly spikes up, it is worth alerting and so is the queue depth on an RDS database. However, if the cpu or memory usage suddenly increases from 20% to 50%, it should not really be alerted upon it as it's not really a cause for concern. However, if it continues to increase after 50% and stays up, it could be a cause of concern as it could indicate shortage of resources which could degrade the system performance. Now consider the case of a services which experiences very sparse traffic. The resource usage would be negligible most of the time. However, whenever there is any traffic, resource usage would increase suddenly and a rate alert would be sent out. Hence, it is imperative that we include historical context as well while checking anomalous behavior. As the name suggests, spike detection only detects spikes or unusually high values which are not usually observed. Continuing the same example, whenever the service has traffic, the resource usage increases. Therefore, an alert should be sent out only when the increased value is unusually high with reference to the past data to minimize alert fatigue. A simple way to go about this would be to use percentile based cutoffs to detectors anomalies. Let's say we set the 95th percentile as the upper bound as and when new values are observed, the values corresponding to the 95th percentile adjust and the bounds get adjusted. However, consider the case of a database which has a backup job scheduled to run at 04:00 a.m. Every morning. Whenever the job runs, the network I op creates unusually high values for a minute and then goes back to normal. In such a case, both the rate based algorithm and the spike detection algorithm is likely to send out a false alert. Therefore, it can be argued that the time at which the value of the metric was observed is just as important as the value itself. This brings us to seasonality. Seasonality can be loosely defined as a reckoning pattern in the metric depending on the time of the day, the day of the week, or the month of the year. To include the context of time while computing the bounds, we only include the data which was observed in a similar time frame in the past. For example, while checking if a value observed at 06:30 p.m. In the evening is an anomaly, I can look at values observed from 06:00 p.m. To 07:00 p.m. In the past week. This warnings us to the next question. What exactly should be the size of the time frame under consideration? It depends on how predictable the traffic pattern is. For example, in case of a streaming service which is broadcasting a sports tournament, the time window could be as small as an hours, since the time at which the event starts and ends are mostly known. However, in case of an ecommerce site, traffic in the weekend or holidays might not strictly adhere to the same pattern that was observed in the past. Therefore, a broader time window spanning over 3 hours or even 5 hours should be used. In this case, a combination of the previously discussed characteristics can also be used to check if an alert is to be sent out. Depending on the use case, more and more such characteristics can be included to add more context upon which anomalies can be detected and accordingly alerts can be sent out. On top of this, the number of characteristics that are being flagged as anomalies can also help us gauge the severity of the situation. Now. Till now, all of these detection mechanisms hinged on the fact that metrics are coming into hours observability setup and are being monitored continuously. However, if the metrics themselves were to stop coming in, all of these methods would be rendered useless. What do we do then? Detecting loss of metrics is just as important as detecting anomalies in the metrics. Depending on the nature of the service, it becomes difficult to gauge if metrics have stopped choosing in or is the traffic genuinely missing. It could also be the case that the pipeline which feeds metrics into hours setup has gone down. Consider a high traffic service on an ecommerce site. It is very unlikely that all of the traffic stopped coming in suddenly. Hence any gap in metrics should definitely send out an alert. On the other hand, if I have a lambda function which gets triggered every hour or so, I'd be observing metrics only once every hour. How do I know if the lambda has not triggered or if the metric pipeline has broken down? A simple way would be to track the time duration observed between consecutive metric values and setting percentile based cutoffs. If the duration is unusually high, an alert should be sent out. So if I have not observed metrics for the lambda function in the last 75 minutes, I can brush it off. However, if no metrics have core for the past 2 hours, it is definitely a cause of concern. Before we conclude, let's recap. We looked at the old way of doing things. We looked at the need of doing things in a new way and we looked at different ways of alerting which were based on rate, amplitude and even the time of the day. It is to be noted that there is no silver bullet, no one size fits all approach. Even in this day and age, I find myself guilty of using static threshold based alerts to keep track of basic things like system uptime. While uptime may not be the same as availability, having this algorithmic toolbox at hours disposal gives has great power to detect issues faster and resolve them. Owing to their domain agnostic nature, it makes them flexible enough to address a huge variety of distributed systems. I hope this session was useful and shed some light on various ways to go about dynamic alerting. Feel free to reach out to me on the discord server or on LinkedIn.

Slides

Download slides (PDF)

See all 33 talks at this event!

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Smoke detectors in large scale production systems

Video size:

Abstract

Summary

Transcript

Slides

Abhijeet Mishra

Software Engineer @ last9.io

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Smoke detectors in large scale production systems

Video size:

Abstract

Summary

Transcript

Slides

Abhijeet Mishra

Software Engineer @ last9.io

Join the community!