Conf42 Incident Management 2022 - Online

Relia...bility?

Video size:

Abstract

Technology ecosystems are complex and it is really important to understand every change and how it affects our systems, as well as the service provided. Users expect systems to be up, responsive, fast, consistent, and reliable.

Reliability for systems means that they are doing what their users need them to do. A system’s reliability is essentially how happy users are and we know those happy users are good for business. If reliability is one of the most important requirements of any system, users determine what reliability means, and it’s okay to not be perfect all the time. We need a way of thinking that can address this way of thinking since we have limited resources to spend, be they financial, human, or political.

Summary

  • Ricardo Castro: Today we're going to talk about reliability. We are going to develop a framework that many of you already have heard about that surrounds around slos. We will then see where the real value of having such a framework in place comes from. At the end we will conclude this on why all of this is important.
  • Anova is developing a framework to ensure that services are doing what their users expected them to do. How can we assess if a supermarket is being reliable or not? We're going to start to build this, building this step by step.
  • Service level indicator is a quantifiable measure of service reliability. It helps us separate good events from bad events. Here are a few tips on how to create good slos.
  • What is an exoplasmable target? 90%, 95%? 100%? Of course, it depends. It will depend on your user needs. It can depend on business needs. Let's do some back of the envelope calculations to see what does this mean.
  • Can error budget is what it's left from an SLO. It's effectively the percentage of reliability left. It can help us make educated decisions on whether, for example, to release a new feature. We can set alerts when available error budget reaches a critical level.
  • So now we have a framework that will help us assess, measure and define reliability. And of course this facilitates prioritization. This help us facilitate, prioritize work and makes it easier to make decisions. I hope you enjoyed my talk and have a great conference.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Welcome to my talk. My name is Ricardo Castro and today we're going to talk about reliability. So what do we have on the menu for today? We're going to start by giving some context about this talk. So we're going to use an example from the real world and then we're going to translate this into our techie reality. We will then talk about reliability. We are going to start step to start, step by step, and we're going to develop a framework that many of you already have heard about that surrounds around slos. We will then see where the real value of having such a framework in place comes from. And at the end we're going to conclude this on why all of this is important. So let's start with an example from the real world. So can example from the real world. I'm going to be using a supermarket. If you think a little bit about it, a supermarket, it's kind of a microservices architecture. Why do I say this? So the idea for me as a consumer, I go to a supermarket, I do my shopping, I pay and I get out. But underneath the covers there's a lot that needs to happen so that my consumer experience is actually reliable. So there are many different pieces that need to fit into place, so that when I go to the supermarket, everything is there for me to buy. So someone has to make purchase orders to actually get products into the supermarket. Someone has to transport those products, someone has to unload those products into the supermarket, someone has to stock the shelves, someone needs to be at a counter. If I need some kind of assistance, someone needs to be at a counter so that I need to pay. So as we can see, there are a lot of moving pieces inside the supermarket that have to be successful so that my simple user experience is actually reliable. So how can I assess if a supermarket is being reliable or not? Let's see a couple of examples of how we can assess the reliability of a supermarket. So it says, imagine that we did all of our shopping and we want to pay. How can this experience be not reliable? So if it takes too long to pay, I may assess this experience as not being reliable. By the same token, I go into a shell, I go to a shelf, I pick up the product, and the product expiration date has passed, right? So I might assess this supermarket as not being reliable. If I go try and buy a product and I can't because the product expiration has passed. And we can draw a parallel here into our techie world. When something takes too long to pay, we can equate that to latency. So if I send a request to a service, it takes too long to receive a response back. And the same thing, if a product expiration date has passed, I can use this as an example of an error. So as an example, I'm going to use my own company as an example. So I work at a company called Anova, and we operate in the industrial IoT space. So essentially we develop solutions for our customers to have industrial sensors. We ingest data from those sensors and we create meaningful services that help them manage their own infrastructure. So essentially, this is our main focus. So we collect high reliability data from high reliability sensors. We collect that, process those services, and then provide solutions to our customers. So for all of this to be successful, we actually need to ensure that all of our services are actually reliable. What does reliability mean? So if we go into the Cambridge dictionary, it has a definition of reliability as being the quality of being able to be trusted or believed because of working or behaving. And this gets me a little bit confused, but essentially it says that something is reliable if it's behaving well, right? But I prefer the definition from Alexey Dalgo in his great book, implementing service level objectives, which essentially asks those question, is my service reliable? And how can I assess that? And basically I can say that is my service doing what its users needed to do. So this is a little bit of shift. So it says, we're not saying that something is reliable if it's behaving well, we say that something is reliable. A service, for example, if it is doing what its users expected it to do. So how can we go about developing a framework to actually ensure that services are doing what its users needed to do? We're going to start to build this, building this step by step, and we're going to start with those most basic component in our system, which are metrics, very well known to all of us. And essentially a metric is a measurement about a system. It doesn't tell us anything other than a measurement. So a few examples of what a metric can be at the amount of memory that the service server is using, for example, in terms of percentage of available memory, the time it takes for an HTTP request to be fulfilled, for example, in milliseconds or seconds, the number of errors, HTTP response errors, for example, a natural number or the age of a message arriving at a Kafka cluster in terms of minutes. So with metrics, we can start building on this and start evolving this concept. And the next concept is the concept of a service level indicator and a service level indicator indicator is a quantifiable measure of service reliability, and it helps us separate good events from bad events. So how can we define service level indicators to actually say if an event is good or bad? So here are a few examples. We're going to take the examples of metrics, and we're going to use that metric and actually define SLis. That tells us if when we look at a metric or an event described by a metric, you can say that if that metric is good or not. So we need for a binary state to be achieved, right? So we need for something to be good or bad, even if the underlying magic doesn't provide that binary state. So how can we define that? So, using the previous example, we can say a request needs to be responded within 200 milliseconds. So every time we take a measurement, if it takes more than 200 milliseconds, we can say that this is not good. If it takes 200 milliseconds or less, we can say that those is good. And the same is analogous to the other example. So a request to a service must not be responded with a 500 code, right? If it's responded with a 500 code, it's not okay. If it's responded with a code different than 500, it is okay. And again, same thing for messages arriving to Kafka. If the message is not older than five minutes, everything is okay. If those message is older than five minutes, things are not okay. You need to do, might need to do something. So we started with metrics. We went to slis. What comes next? So, next is an SLO. And an SLO is nothing more than how many times one SLA has to be achieved, so that I can be sure that my users are happy with my service. And that, of course, needs to be measured within a time interval. So, using the same examples again, we're going to evolve an SLI into an SLO. So we can say 99% of requests to a service need to be responded within 200 milliseconds within a 30 day period. So we define an interval which is 30 days. We look at all requests, and we say that 99% of them need to be responded within 200 milliseconds. If not, I can say that my users are not being satisfied with my service. Same thing with those other examples. 99% of requests to a service need to be responded with a code different from 500 within a seven day period. And exactly the same thing with the four messages arriving to Kafka. But how can we create good slos so here are a few tips on how to create good slos. So first and foremost, and if you forget all the other tips, is the first one, always, always and always focus on the users. Users are the one who will be defining the reliability of my service. So it makes sense for us to actually know what the user expects, expect and track our reliability according to that. Going a little bit deeper. One good option to actually start defining slos is to list out critical users journeys and order them by business impact. Maybe is your search catalog, maybe is checkouts, whatever makes sense within your company. Then we need to determine which metrics we will be using as service level indicators to most accurately track that user experience. So we need to define a few metrics or measure a few things and be sure that we have slis that actually track what we know that the user is valuable. Also, try not not to to get too overboard with slos. For most cases, three or four slos should be enough and we can use a composable metrics to do that. Also important is to have SLO documents. We'll see an example in a second, but it's essentially a document that describes what the SLO is when it was last reviewed, and so on. Very important is to review SLOS periodically. SLOS is not something that is set in stone, and every once in a while it needs to be reviewed and to be sure that it's still actually tracking user satisfaction and also determine SLO targets and goals and those SLO measurement period. Don't try to be too reliable. And we see what I mean by this in a bit. So just a quick glance on an example of an SLO document. This example was taken from the Google icre book, so you can use it as a reference to build your own and adapt to your own reality. So essentially it has a description of the slO. So this is an SLO for the example game service. It has the authors, when it was defined, who reviewed it, when it was approved, and when it should be revisited. It also has an overview of the service that this SLO applies to. Then it comes down to slis. So these are the definition of the Slis that we've seen previously, but here are the ones who actually inform this slo. Then there's a section for rationale. Maybe there's something that you need to make clear to whoever is using this slO, and this should be explained here. Effort budgets we're going to see in a bit. But here we can describe what the effort budget is for this particular slO. And of course, you can have a section that has clarifications and caveats, something that needs to be explained, something or maybe some constraints that need to be taken into consideration when using this slO. So, going back to our previous point, what is an exoplasmable target? 90%, 95%? 100%? Of course, it depends. It will depend on your user needs. It can depend on your business needs. For instance, you might be in a highly regulated business that has many constraints. It can be informed by cost and many other things. But the point here is that it depends. But just for us to have an idea, let's see, can slo for uptime. Let's say that we define three nines, right? So the number of nines is something that it spread across the industry. But for example, for an slo of three nines, this means that I can have 8 hours of downtime per year. Just by adding a nine, those 8 hours go down to 25 to 52 minutes. And if I add another nine or five nines of reliability, this means that I can only be down for five minutes a year. So just by adding one nine to my uptime slo, I drastically reduce those downtime. So I have to put things in place so that I can ensure that I'm not down more than the allowed amount. So just to conclude this point, let's just do some back of the envelope calculations to see what does this mean. So I mentioned that you need to put things in place to actually assure this type of reliability. So let's just go through a couple of scenarios and see how much this could cost us. So let's see, scenario one. Let's imagine that I want to increase my reliability from three nines to four nines, right? So I'm increasing my reliability 0.9. And those means. And let's imagine that my service does 1 million of revenue. Those means that if I improve my reliability, the value of improvement is actually $900. So does this make sense? Maybe this will be up to you. But these are important to do these calculations to see if it makes sense. Scenario two, let's use as rule of thumb, we can see a lot in documentations, and I believe it's also in the SRE book that each additional nine of reliability costs us ten times to achieve that. So if I go from two, from three to four nines, this means that whatever it costs me to run my services, it will increase by a factor of ten. So let's use exactly the same example going from three nines to four nines. And let's imagine that it costs us to run 1 million. $1 million. That means that by adding a new nine, it's going to cost me around $10 million. So these, of course, are back of the envelope calculations. But it's important to do those type of calculations for us to have an idea. If I want to increase my reliability, how much will it cost me and the amount of value that it can maybe get me. So, moving forward, we touched briefly before about error budgets, but essentially can error budget is what it's left from an SLO. So if I consider that 100% is that my service is always reliable, the error budget will be 100% minus what I define can SLO. So if it was two nines, it will mean that I'll have 1% of error budget left. So it's effectively the percentage of reliability left, and it can help us make educated decisions on whether, for example, to release a new feature or not. Based on the amount of risk that we can take. They help make sure the operability process, for example, incident response, is appropriate to the budget available for the service being provided. What this means is that with the amount of reactor budget that we have left, we can inform our incident response. And last but not least, we have service level agreements that are well known to us. But essentially, an SLA is nothing more than an SLO that has some kind of penalty attached. For example, let's use the same examples that we've been using up until this point. So if I have, I can define an SLA that says that 99 95% of requests to a service need to be responded within 200 milliseconds within a 30 day period. If that doesn't happen, the customer will get a 30% discount. And the same thing for the other example. But the basic idea here is that an SLA is an SLO that actually has some type of penalty attached. They are usually looser than slos, so that if we breach an SLO, we actually still have some time before breaking the SLA. So looking at the first example, if we had an SLO of 29, so 99%, this meant that if I breached that SLO, I still had 4% of unreliability to burn, so to speak, until my SLA was breached. So what can we build with this? All of this? So, of course, we can build visualizations. So here is an example of how we can track and put on a dashboard a visualization of an SLO. Here we have the objective, for example, 99%, how much slo is being burned, how much I still have left. If something is burning or not. So this is very interesting and we can see and can do can historical analysis of how my SLO is actually going. But we don't want to be looking at dashboards all day. This is very interesting, but what we actually want is to be informed if something is not going okay, right? So that we can take the appropriate measures. So we come to alerts. So if we think about the traditional relative methods, we usually use metric thresholds, right? So we define some kind of threshold and see if something goes anova that threshold, we need to trigger an alert so that someone can investigate and see if everything is okay. So here are just a couple of examples. If a cpu goes above 80%, if a certain number of requests is taking more than 200 milliseconds, if we have x amount of 500 responses, we just trigger an alert and someone needs to investigate what's going on. We can take the same approach with the slO. With an slO. So we can say that if latency, if we have an SLO of 99% and latency goes below that, those is SLO, we mean we need to do something. This is better because if we define our slos in a meaningful way, it's actually tracking users experience, but it only alerts us when we are already in trouble, right? So how can we do better than this? So we can alert on the amount of error budget that we still have available, or the amount of error budget that we have already burned. So we can set alerts when available error budget reaches a critical level, or when a critical amount of error budget has been burned. And we can set different trigger levels to different alert channels. For example, if I have already burned 25% of my budget, I can send an email 50%, I can alert someone on teams or on slack. And if 75% of my budget has been burned, I can trigger pager duty or ops genie for someone to actually look into it. This is better than the alternative because we are being alerted before we get into trouble and we need to do something. But we have no idea how fast this error budget is being consumed. So it begs the question, if my error budget, if my slo is well defined, and if I know, or if I could know if by the end of the evaluation period we would still have some error budget left, would you like to receive desolate? Probably not, because I will still be within my bounds of the amount of reliability that I have. So I can make the decision to actually, I don't know, release more features or do other type of work because I still have some reliability that I can account for. So the next evolution will actually be would actually tackle this problem. And it is the burn rate. So burn rate actually tells us how fast an error the error budget is being consumed. When I make this calculation, if the error budget, those burn rate of one means that all my error budget will be consumed within the interval that I define, for example 30 days or a week, let's see those example. If I have a window of evaluation of four weeks and I'll calculate my error budget, my burn rate, and it's two, this means that I will consume all my effort budget in half the time. So in this example, two weeks, this is a lot better. But it still has a slight problem, which is if the effort budget burns too fast and we evaluated, and for example it is consumed within my evaluation period, I might not even receive an alert. So we can do here a small tweak to have excellent alerts based on burn rate. And these are alerts on effort budget available. I'm sorry, these are alerts on multi window multiburn rate alerts. So we will use multiple windows and multiple burn rates to inform us of different problems. We will inform fast burn alerts that will alert us on sudden changes in the consumption of burn rate. We can think of this as an example. We have a huge spike in errors in our API, for example, and slow burn alerts that alerts us on less urgent issues. But is something that is consuming over time a lot of error budget? So here are just a few examples I can define that. I have a window of evaluation of 2 hours. I evaluate every five minutes, and if my burn rate is ten, I know that something is not okay. And the same thing for a slow burn. I define evaluation period of 4 hours, so longer. And if my burn rate is two, I actually have a problem and I need to investigate. And our last concept is the concept of the Akar budget policy. And those budget policy is nothing more than a document or a set of documents where we define what happens when certain conditions of my echo budget have been breached. So here are just an example. So if a service has exceeded its echo budget for the preceding four week window, we will halt all changes and releases other than p zero or security feed security fixes until the service is back within its slO. And depending upon the cause of the SLO miss, the team may devote additional resources to working on reliability instead of feature work. So this is a concrete definition of what happens when my Slo is breached. This of course will be highly contextual and will be dependent on your organization, of course. But it's something that it's predefined and it's agreed by everyone involved that if certain thing, certain things happen, those are the actions that we are going to take. So to recap, let's see this concept of the reliability stack in his extended way. We started by looking at metrics, which are measurements about those system. We then evolved those to an SLI, which tells us if a metric is good or bad or an event is good or bad. We then evolved into slos, and slos can tell us how many times the SLI needs to be good so that my customers are actually happy. The other budget is nothing more than what it's left from the slO. And then with slos, what can we build? We can build visualizations so that we can use to assess if the SLO is okay or not. We can build minifill alerts that track user experience and tell us if something is not okay. We can define effort budget policies that actually can inform us what we need to do if certain conditions happen. And of course we can users those slos to define slas that those are nothing more than an SLO with a penalty. And why is all of this important? For starters, we start measuring reliability from the eyes of our users. They are the most important actors that will help us define the reliability. It doesn't matter if I think if my system is performant, if my users are not happy with the way that it's working. Also, reliability work. Reliability work ties directly to business goals. So happy users are usually good for business. And if I'm tracking reliability and my users are happy, and I'm ensuring that my users are happy through a reliability framework, this is good for business and we can tie this directly to business goals. It also creates a shared language to talk about reliability. There's no more some engineers defining how reliability should be tracked or measured. Other reliability engineers doing it another way, product people saying it another way, business people doing it another way. So now we have a framework that will help us assess, measure and define reliability. And of course this facilitates prioritization. So if we have a reliability framework in place and we have can error budget policy that tells us what needs to happen when setting conditions apply. This help us facilitate, prioritize work and makes it easier to make decisions on when we should we focus our efforts into reliability work or on product work, for example. And this is all from my part. I hope this was informative for you. Don't hesitate to contact me through my social links. I hope you enjoyed my talk and have a great conference.
...

Ricardo Castro

Lead SRE @ Anova

Ricardo Castro's LinkedIn account Ricardo Castro's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways