Transcript
This transcript was autogenerated. To make changes, submit a PR.
Smoke detectors in large scale production systems.
Before we begin, let's go back in time 15 years ago to look
at the life of sres. Back then. Well, not sres and
not DevOps. We were called sysadmins back then.
Traffic was low, services were enough and weekends held promise.
There were tools like Nagios and Zavix which had against to monitor various
resource usages and emitted zero if all was okay,
one in case of a warnings and two in case of an error.
Static thresholds were good enough for these situations. When stuff
broke, it broke in a fairly predictable manner. Not to say that
when things broke at the time, they were easy to fix. Who wants to configure
a raid array in today's world? Well, definitely not me.
The domain was fairly known and the problems repeated
enough for us to know what broke and how they broke.
Obviously there are quite a few war stories, but the good part
is that there were no auto scaling applications, no Kubernetes clusters,
and more importantly no yaml to write. Now let's
take a look at today and what the future look like.
Applications no longer reside in a single data center in someone's
basement. They are spread across the world. We have auto scaled applications.
There are cloud native workloads spanning across multiple containers and orchestrators,
primarily kubernetes clusters. There are huge number crunching data
lakes. We have functions as a service, lambdas and
whatnot. It can't be denied that the scope has increased and
the domain has exploded. Newer problems and
different variants of the older problems keep on occurring, breaking the systems
in ways that prove very difficult to diagnose and debug.
Along this transition to the new world order, we realize that static
thresholds don't cut it anymore. If they are enough,
there would be definitive answers to these questions. What is a
good number of five xx errors? Depends on the throughput.
If I'm getting a few 505 xx errors in
a few minutes, then it's completely acceptable if my traffic
is at the scale of millions. However, if my traffic itself is
in hundreds or even thousands, it is a cause for concern
and may even cause the slos to fail. How many container restarts
for kubernetes is too many? Depends on the desired containers.
If the restarts are fairly small in numbers compared to the
number of desired containers, it does not affect the availability of my
systems and are of no concern to me. Coming to the last
question, how slow is slow enough? Again,
it depends on the workload. For an ecommerce site,
1 second is too slow and might lead to frustrated customers.
However, in the case of asynchronous workloads, 1 second might
be fine if the workload involves crunching data, or might even be
considered blazing fast with a video editing pipeline,
we can safely assume that context is king
with increasingly complex systems. Context pertaining to the
nature of the systems is paramount. Owing to the
sheer complexity and variety of components involved
in modern distributed systems, it has become harder to isolate faults
and mitigate them accordingly. There are no sitting ducks anymore.
The ducks are armed and dangerous.
However, one might argue that since services have
to abide by slas, aren't slos enough?
Not really. Slos are lagging indicators of system health.
They indicate system performance in the past and not the
future. As a consequence of that, we should be more concerned about
leading indicators of systems performance. For example,
my car alerting washed in the rains outside is a lagging indicators
as it can only happen after it has rained.
However, cloudy weather and lightning are leading indicators
that it may rain. It might not always rain, but there is
a pretty good chance that it might slos are primarily of two kinds,
request based and window slos. Request based slos perform
some aggregation along the lines of good requests versus the
total number of requests. For example, while computing the availability SLO
for a compliance duration, we would simply count the total number of
requests and the total failed requests. However,
consider this scenario during a holiday sale on a shopping site,
the system services 99% of requests successfully during
the day. However, during a 30 minutes window in the day after
the PCR, all requests failed and consequently
most of the customers got upset and did not core back to the site.
A request based SLO of 99% would look undefeated.
In this case. However, the result it can be said
request based slos give us no indicators of consistent performance
throughout the SLO window. They only serve as an indicator
of overall performance. To circumvent the issues above,
we must use window slos. A windowed slo is on the lines
of a certain percentage of time intervals should
satisfy a certain criteria. For example, in the past 24
hours, at least 99% of 1 minute intervals should have a
success ratio of 95% or above. Now, the question
that arises is what kind of slo should be set on what
kind of service? Consider the service that caters to payments.
The service only cares about successful payments.
A request based slos would be ideal in this scenario, a sample objective
could be over the last seven days, 99.9% of the
total payments should be successful. However, on the
other hand, a streaming service would care mostly about
the users being uninterrupted over the weekend so that they can
continue binge watching their shows. In such a scenario,
window based slos are ideal. A sample objective could be
that over the last seven days, 99% of the time, the server should
have served reasonably successful intervals of 15 minutes seats.
An interval is appropriate if 95% of the users did not receive another.
Hence, it is imperative that we need to have some kind of dynamic alerting
in place that gives us information about leading indicators.
Does that mean we need to employ machine learning? Not really.
Machine learning models are cost intensive and will require extensive
training using metrics observed in the past. Since each model needs
to be trained using the metric, it will observe as the number of components
and their corresponding metrics increase, the costs pertaining
to training. Deploying and maintaining the machine learning models goes
through the roof. How do we go about dynamic alerting then?
Well, we have high school math to the rescue.
Armed with high school math, basic statistics,
and other avengers, we at last nine built a smoke detector to monitor
increasingly sophisticated distributed systems. Now the question
that arises is what does a smoke detector really do?
Well, it raises an alarm when an anomaly is detected.
What exactly is an anomaly then? An anomaly is
any deviation in normal system behavior. It could be owing
to any of the following characteristic changes in the metrics
rate at which a particular metric is increasing or decreasing
the amplitude or spikes, which basically means the
absolute value of a metric and also the time of the day at
which the particular value was observed. Let's start with rate.
Rate is basically a measure of how fast a metric value changes.
If something is spiking up fast enough and then goes back to normal,
is it worth alerting? Maybe or maybe not.
A simple way to go about computing rate changes in a metric could
be using standard deviation across a sliding window. If there is
an increase in the standard deviations being observed, that means
there are sudden changes in the metrics that weren't happening before.
However, when exactly do we alert based on rate changes?
If the percentage of four xx errors observed in a system suddenly spikes
up, it is worth alerting and so is the queue depth
on an RDS database. However, if the
cpu or memory usage suddenly increases from 20% to
50%, it should not really be alerted upon it as
it's not really a cause for concern. However, if it continues
to increase after 50% and stays up,
it could be a cause of concern as it could indicate shortage of resources
which could degrade the system performance. Now consider the case
of a services which experiences very sparse traffic. The resource
usage would be negligible most of the time. However,
whenever there is any traffic, resource usage would increase suddenly
and a rate alert would be sent out. Hence,
it is imperative that we include historical context as well while
checking anomalous behavior. As the name suggests, spike detection
only detects spikes or unusually high values which are not
usually observed. Continuing the same example,
whenever the service has traffic, the resource usage increases.
Therefore, an alert should be sent out only when the increased value
is unusually high with reference to the past data to minimize alert
fatigue. A simple way to go about this would be to use percentile based
cutoffs to detectors anomalies. Let's say we set
the 95th percentile as the upper bound as and when new
values are observed, the values corresponding to the 95th percentile adjust
and the bounds get adjusted. However, consider the case of a
database which has a backup job scheduled to run at 04:00 a.m. Every morning.
Whenever the job runs, the network I op creates unusually high
values for a minute and then goes back to normal.
In such a case, both the rate based algorithm and the
spike detection algorithm is likely to send out a false alert.
Therefore, it can be argued that the time at which the value
of the metric was observed is just as important as the value itself.
This brings us to seasonality. Seasonality can be loosely
defined as a reckoning pattern in the metric depending on the time of the
day, the day of the week, or the month of the year. To include the
context of time while computing the bounds, we only include the
data which was observed in a similar time frame in the past. For example,
while checking if a value observed at 06:30 p.m. In the evening is an
anomaly, I can look at values observed from 06:00 p.m. To 07:00
p.m. In the past week. This warnings us to the next question.
What exactly should be the size of the time frame under consideration?
It depends on how predictable the traffic pattern is.
For example, in case of a streaming service which is broadcasting
a sports tournament, the time window could be as small as
an hours, since the time at which the event starts and ends are
mostly known. However, in case of an ecommerce site,
traffic in the weekend or holidays might not strictly adhere to
the same pattern that was observed in the past. Therefore,
a broader time window spanning over 3 hours or even 5 hours
should be used. In this case, a combination of the previously
discussed characteristics can also be used to check if an alert
is to be sent out. Depending on the use case, more and more
such characteristics can be included to add more context upon which
anomalies can be detected and accordingly alerts
can be sent out. On top of this, the number of
characteristics that are being flagged as anomalies can also help us
gauge the severity of the situation. Now. Till now,
all of these detection mechanisms hinged on the fact that
metrics are coming into hours observability setup and are being monitored
continuously. However, if the metrics themselves were to stop
coming in, all of these methods would be rendered useless.
What do we do then? Detecting loss of metrics
is just as important as detecting anomalies in the metrics.
Depending on the nature of the service, it becomes difficult to gauge if
metrics have stopped choosing in or is the traffic genuinely missing.
It could also be the case that the pipeline which feeds metrics into hours setup
has gone down. Consider a high traffic service on
an ecommerce site. It is very unlikely that all of the traffic
stopped coming in suddenly. Hence any gap in metrics should
definitely send out an alert. On the other hand,
if I have a lambda function which gets triggered every hour
or so, I'd be observing metrics only once every hour.
How do I know if the lambda has not triggered or if the metric
pipeline has broken down? A simple way would be to
track the time duration observed between consecutive metric values
and setting percentile based cutoffs. If the duration
is unusually high, an alert should be sent out.
So if I have not observed metrics for the lambda function
in the last 75 minutes, I can brush it off.
However, if no metrics have core for the past 2 hours,
it is definitely a cause of concern. Before we conclude,
let's recap. We looked at the old way of doing things.
We looked at the need of doing things in a new way and we looked
at different ways of alerting which were based on rate, amplitude and even
the time of the day. It is to be noted that there is no
silver bullet, no one size fits all approach.
Even in this day and age, I find myself guilty of using static threshold based
alerts to keep track of basic things like system uptime. While uptime
may not be the same as availability, having this algorithmic toolbox
at hours disposal gives has great power to detect issues faster and
resolve them. Owing to their domain agnostic nature, it makes
them flexible enough to address a huge variety of distributed systems.
I hope this session was useful and shed some light on various ways
to go about dynamic alerting. Feel free to reach out to me on the discord
server or on LinkedIn.