Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi, I'm Joe. Welcome to don't get out of bed
            
            
            
              for anything less than an SLO. I'm a software
            
            
            
              engineer at Grafana. I've been building and operating large distributed
            
            
            
              systems for nearly my whole career. I've been on call for
            
            
            
              those systems. I think there's great things about being on call.
            
            
            
              The people who are operating and fixing things usually
            
            
            
              know what to do because they've built the system.
            
            
            
              But some companies have things more together than others. I've been in
            
            
            
              some organizations where things are incredibly noisy
            
            
            
              and stressful, and I want to talk about a tool today that
            
            
            
              can help us improve that situation markedly.
            
            
            
              We're going to talk about what makes an on call situation
            
            
            
              bad and what sort of knobs we have to
            
            
            
              improve it. We'll talk about a particular tool called a service
            
            
            
              level objective, which helps you understand what really
            
            
            
              matters in your system and how to
            
            
            
              alert on that, so that the alerts that you're sending to people
            
            
            
              at three in the morning are really meaningful.
            
            
            
              They point to things that really have to be done and they
            
            
            
              explain sort of what's going on. What problems are people
            
            
            
              having that caused this alert to happen?
            
            
            
              Burnout is a big topic in software
            
            
            
              engineering these days. Tons of people in the industry
            
            
            
              experience burnout at some point in their career, and right now
            
            
            
              huge numbers of people are experiencing some symptoms of burnout.
            
            
            
              It kills careers. People go out to the woods to build log
            
            
            
              cabins when they leave, choose jobs. It really
            
            
            
              kills their teams and their organizations. Too much
            
            
            
              turnover leads to institutional knowledge walking out the door.
            
            
            
              No matter how good your system documentation is, you can't
            
            
            
              recover from huge turnover. So mitigating,
            
            
            
              preventing and curing
            
            
            
              to some degree, burnout is really important to software
            
            
            
              companies. Burnout comes from a lot of places,
            
            
            
              but some big external factors, factors that
            
            
            
              come from your job are unclear expectations,
            
            
            
              a lack of autonomy and direction, an inability
            
            
            
              to unplug from work and an inability to
            
            
            
              deliver meaningful work. And bad
            
            
            
              on call can affect all of those. There are unclear expectations
            
            
            
              about what am I supposed to do with an alert? Who's supposed to handle this?
            
            
            
              Is it meaningful if you're repeatedly getting paged at three
            
            
            
              in the morning, especially for minor things, you really can't relax
            
            
            
              and let go of work when it's time to be done. And if you're
            
            
            
              spending all day responding to especially
            
            
            
              useless things, youll don't ship anything, you don't commit
            
            
            
              code, you feel like you're just running around putting
            
            
            
              out fires and it's very frustrating and draining so
            
            
            
              bad alerts, poorly defined alerts, things that people don't
            
            
            
              understanding what to do about are huge contributors to
            
            
            
              burnout. And to improve that situation. We need to understand our
            
            
            
              systems better, understand what's important about them, and make sure
            
            
            
              that we're responding to those important things and not to
            
            
            
              minor things, underlying things that can be addressed as
            
            
            
              part of a normal work routine. A useful,
            
            
            
              good on call shift looks like having the right people on
            
            
            
              call for the right systems. When alerts happen,
            
            
            
              they're meaningful and they're actionable. They are
            
            
            
              for a real problem in the system, and there's something someone can do
            
            
            
              to address it. There's a great
            
            
            
              tool in the DevOps world that we can use to help make
            
            
            
              all of that happen, called service level objectives,
            
            
            
              and they help you define what's actually important about the system.
            
            
            
              You operations understand how to measure, choose important
            
            
            
              behaviors, and help youll make decisions based on
            
            
            
              those measurements. Is there a problem we need to respond to right now?
            
            
            
              Is there something we need to look at in the morning? Or is there something
            
            
            
              over time where we need to prioritize work to
            
            
            
              improve things sort of long term? How can we assess what problems
            
            
            
              we have and how serious they are?
            
            
            
              We split that into sort of two things here. There's a service level
            
            
            
              which is not about a microservice or a given set of little
            
            
            
              computers. It's about a system and
            
            
            
              the services that it provides to its users. What's the quality of service
            
            
            
              that we're giving people? And then an objective which
            
            
            
              says, what level of quality is good enough for this system,
            
            
            
              for our customers, for my team.
            
            
            
              So a service level is all about the quality of service you're
            
            
            
              providing to users and clients. To do that, you need to understand what
            
            
            
              users really want from that system and how
            
            
            
              you can measure whether they're getting what they want.
            
            
            
              So we use something called a service level indicator
            
            
            
              to help us identify what people want and whether they're
            
            
            
              getting it. We start with sort of a prose description. Can users sign
            
            
            
              in? Are they able to check out with the socks in their shopping
            
            
            
              cares? Are they able to run analytics queries
            
            
            
              that let our data scientists decide what we need to
            
            
            
              do next? Is the catalog popping
            
            
            
              up? Do we start with that?
            
            
            
              Pros. Then we figure out what metrics we have, or maybe that we need
            
            
            
              to generate that. Let us measure that,
            
            
            
              and then we do some math. It's not terribly complicated
            
            
            
              math to give us a number between zero and one,
            
            
            
              one being sort of the best quality output we could expect for a
            
            
            
              given measurement, and zero being everything is broken.
            
            
            
              Nothing is working right. Some really common
            
            
            
              indicators that you might choose are the ratio of successful
            
            
            
              requests to all requests that a service is receiving over some
            
            
            
              time. This is useful for an ecommerce application,
            
            
            
              a database query kind of system. You might have like a threshold
            
            
            
              measurement about data throughput being
            
            
            
              over some rate. You may know you're generating
            
            
            
              data at a certain rate and youll need to be processing
            
            
            
              it fast enough to keep up with that. And so you need
            
            
            
              to measure throughput to know if something's falling apart. Or you may want to
            
            
            
              use a percentile kind of threshold to say,
            
            
            
              for a given set of requests, the 99th percentile latency
            
            
            
              needs to be below some value, or my average latency needs to
            
            
            
              be above some value, some statistical threshold.
            
            
            
              These are all really common indicators. They're relatively easy to compute,
            
            
            
              relatively cheap to collect.
            
            
            
              Next, you need to set an objective. What level of quality is acceptable?
            
            
            
              And of course, only the highest quality will do,
            
            
            
              right? That's what we all want to provide, but we
            
            
            
              have to be a little more realistic about our target. To measure an objective,
            
            
            
              you choose two things. You choose a time range over which
            
            
            
              you want to measure. It's better to have a sort of moderate to long time
            
            
            
              range, like 28 days,
            
            
            
              seven days, than a really short one, like a day or a few minutes.
            
            
            
              You're getting too fine grained if you're trying to measure something like that.
            
            
            
              And then a percentage of your
            
            
            
              indicator that would be acceptable over that time range,
            
            
            
              90% is sort of a comically low number. But are
            
            
            
              90% of login requests succeeding over the last week?
            
            
            
              If so, we've met our objective,
            
            
            
              and if not, then we know we need to be paying attention.
            
            
            
              Youll want to set a really high quality bar, right? Everybody does. But it's
            
            
            
              useful to think about what a really high quality bar
            
            
            
              means if you're measuring your objectives over the last seven
            
            
            
              days and you want a 99% objective
            
            
            
              to be set, we're talking here
            
            
            
              about, let's say total system downtime, right? If you're
            
            
            
              hitting zero, you've got an hour and 41
            
            
            
              minutes every seven days of total downtime
            
            
            
              to meet a 99% objective. And if you want five
            
            
            
              nines, the sort of magic number that you sometimes hear people say over
            
            
            
              a month, that's only 25 seconds of downtime. It's really important
            
            
            
              to think these things through when you set objectives, because it's better to start
            
            
            
              low and get stricter.
            
            
            
              Teams feel better and perform better when they're
            
            
            
              able to start at a place that's okay
            
            
            
              and get better rather than be failing right out of the gate
            
            
            
              and when setting objectives, you want to leave a bit of safety margin for
            
            
            
              normal operations. If you're always right up on the line of your
            
            
            
              objective, then you either need to do some work to
            
            
            
              improve your system's performance and build up some margin or lower the
            
            
            
              objectives a bit. Because if you're right up on the line, people are alerts
            
            
            
              all the time for things they may not be able to do
            
            
            
              much about. So to choose an objective,
            
            
            
              once you've defined your indicators, it's a good idea to
            
            
            
              look back over the last sort of few time
            
            
            
              periods that you want to measure and get a sense for what a good objectives
            
            
            
              would look like. Something that doesn't work is
            
            
            
              for your vp of engineering to say, look, every service needs to have a minimum
            
            
            
              99% SLO target set for their services
            
            
            
              and it's not acceptable for there to be anything less.
            
            
            
              This doesn't work well, especially when you're
            
            
            
              implementing these for the first time in a system and you may not
            
            
            
              even know how well things are performing, you're going to be sabotaging
            
            
            
              yourself right out of the gate. It's better to find your indicators,
            
            
            
              start to measure them and then think about youll objectives after that.
            
            
            
              Once we understanding what our indicators and
            
            
            
              objectives look like, then we can start to alerts
            
            
            
              based on measurements of those indicators. We want to alerts operators
            
            
            
              when they need to pay attention to a system now,
            
            
            
              not when something kind of looks bad and you should look at it in the
            
            
            
              morning. We're going to monitor those indicators
            
            
            
              because they're symptoms of what our users actually experience and
            
            
            
              we'll prioritize how urgent things are based on how bad
            
            
            
              those indicators look. To think about this,
            
            
            
              it's useful to think about monitoring the symptoms
            
            
            
              of a problem and not the causes. Symptoms of a
            
            
            
              problem would be things like checkout is broken, cause of
            
            
            
              a problem might be pods are running out of memory and
            
            
            
              crashing in my kubernetes cluster.
            
            
            
              Alerting is not about maintaining systems. It's not
            
            
            
              about looking for underlying causes before something goes
            
            
            
              wrong. That's maintenance. Alerting is for dealing with emergencies.
            
            
            
              Alerting is for dealing with triage. And so you should be looking at
            
            
            
              things that are really broken, really problems for
            
            
            
              users. Now youll systems deserve checkups.
            
            
            
              You should be understanding and looking at those underlying causes.
            
            
            
              But that's part of your normal day to day work.
            
            
            
              If your team is operating systems, that should be part of what your
            
            
            
              team is measuring and understanding as part of your
            
            
            
              general work, not as something that your on call does
            
            
            
              just during the time that they're on call. You shouldn't be getting up
            
            
            
              at three in the morning for maintenance tasks. You should be getting up at three
            
            
            
              in the morning when something is broken. So you
            
            
            
              need to be alerted not too early
            
            
            
              when there's some problem that's just a little spike,
            
            
            
              but definitely not too late to solve your problem either.
            
            
            
              So to think about this, it's useful to think of system
            
            
            
              performance like cash flow. Once you set an objective,
            
            
            
              then you've got a budget. You've got a budget of problems you can
            
            
            
              spend, and you're spending that like money. If you've been in the startup
            
            
            
              world, you've heard about a company's burn rate, how fast
            
            
            
              are they running through their cash? When are they going to need to raise more?
            
            
            
              And you can think of this as having an error burn rate
            
            
            
              where you're burning through your budget. And so when you start spending
            
            
            
              that budget too quickly, you need to stop things, drop things,
            
            
            
              look in on it and figure out how to fix that.
            
            
            
              So we can think about levels of urgency
            
            
            
              associated with measurements of the SLO. There's things that should
            
            
            
              wake me up, like a high burn rate that isn't going away or can
            
            
            
              extremely high set of errors that cares happening over a short
            
            
            
              period of time. Now, if there's a sort of sustained moderate
            
            
            
              rate that's going to cause youll problems over a period of days,
            
            
            
              it's something your on call can look at in the morning, but that they must
            
            
            
              prioritize. Or if you've got a sort of you're never burning
            
            
            
              through your error budget, but you're always kind of using a bit, maybe more than
            
            
            
              you're comfortable with, then you should be doing that as part of
            
            
            
              your maintenance checkup kind of work on your system.
            
            
            
              And if you have sort of just transient moderate burn
            
            
            
              rate kind of problems, small spikes here and there,
            
            
            
              you almost shouldn't worry about these. There are always going to be,
            
            
            
              especially in larger systems,
            
            
            
              transient issues. As other services deploy,
            
            
            
              a network switch gets replaced.
            
            
            
              SLO. This is why we set our objectives
            
            
            
              at a reasonable level, because we shouldn't be spending teams
            
            
            
              valuable time on minor things like that that can
            
            
            
              really be addressed in
            
            
            
              order to get alerted soon enough,
            
            
            
              but not too soon. To avoid choose sort of transient
            
            
            
              moderate problems causing sending alerts
            
            
            
              that cares resolved by the time somebody wakes up.
            
            
            
              We can measure our indicators over two time operations,
            
            
            
              which helps us handle that too early, too late kind
            
            
            
              of problem. So by taking
            
            
            
              those two time operations and comparing the burn rate
            
            
            
              to a threshold, we can decide how serious a
            
            
            
              problem is. The idea here is over
            
            
            
              short time periods the burn rate needs to be very, very high
            
            
            
              to get somebody out of bed. And over longer time operations,
            
            
            
              a lower burn rate will still cause you problems. If you've
            
            
            
              got a low burn rate all day every day,
            
            
            
              you're going to run out of error budget well before the end of the month.
            
            
            
              If you have a high burn rate, you'll run out of the
            
            
            
              error budget in a matter of hours. And so somebody really needs to get up
            
            
            
              and look at it. So for a rating of urgency here,
            
            
            
              you can have, like, a short time window of 5 minutes and
            
            
            
              a longer time window of an hour. And if the average burn
            
            
            
              rate over that time are both above this relatively
            
            
            
              high threshold of 14 times your error budget,
            
            
            
              then you know youll need to pay attention. And what these two windows buy you
            
            
            
              is the long window helps make sure you're not alerted too
            
            
            
              early. And the short window helps
            
            
            
              you make sure that when things are resolved, the alert resolves
            
            
            
              relatively quickly. And so higher
            
            
            
              urgency levels means shorter measurement windows, but a higher
            
            
            
              threshold for alerting and lower
            
            
            
              urgency levels, things you can look at in the morning,
            
            
            
              take measurements over longer periods of time, but have a much more relaxed
            
            
            
              threshold for alerting. And this is a way to
            
            
            
              really make sure that getting out of bed when something is really going
            
            
            
              wrong at a pretty severe level. Thanks for, thanks for
            
            
            
              coming.