Conf42 Site Reliability Engineering 2023 - Online

Starting the SLOs Implementation

Video size:

Abstract

Based on Goolge SRE Workbook, SRE will be driven by SLO, its like the core value of implementing SRE culture for improving company’s products or services. This talk will guide the audience how they can start to implementing it starting from the SLI, SLO definition and calculating it

Summary

  • There is a confusions on down definitions between your operations team and product or engineering team. Product team should be very agile, very dynamic, it should be tolerable with the change. Operations teams in the other hand want to be very safely, very carefully, very stable to operate the software. So SLO is coming to help you.
  • What is the SLI? It is a carefully defined quantitative measure of some aspect of the level of service that is provided. The metrics can be everything, like surface latency, non error response of your services, and then your saturations of your instances or surfers. Follow these four golden signals to start implementing the SLO.
  • You should choose your Slo time window when you want to evaluate your target. After you visualize it, you should find the average movement of your SLI. Put the boundaries below the minimum point or average movement. If you put the boundaries on top of your average movement, then your system was breaching the SLO.
  • To build your first SLO, you should build your SLI specifications. Set the boundaries below the average movement of your SLI. It is okay for error as long as the SLI metrics stay on top of the slO. For next topics, I will talk about error budgets and error policy.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Okay, starting the SLO's implementations. So I will start from the problems, guys. So these kind of problems that I've been stated on my slide, I've been filtered and I think these three problems, it's a common problems for a tech company. I guess the first thing first, it's third party or your business partner said that their services that you use will downtime for a while. This is a problem because we don't know what is for a while definition. It is 1 minute, is it 2 minutes or I mean is it two years? Because we didn't have any standardized for our services that could be allowed to retrieve an error from the third party. So this kind of problem, this is the foundation problem why you should implementing the slos. And the second one is your product team had a debate if the new deployed surface was stable or not. Because I think it is very different about how product team works and the operations team works. Right. Because product team should be very agile, very dynamic, it should be tolerable with the change. Because in order to improve the product itself, it should be changed. Right. But in the other hand, the operations team was very, it should be stable. It should be stable and it's kind of like how to operate the software more safely. So this is also a problem because product team wants to pass very oftenly the chains or the new products or the new services, but the operations team wants to keep stable, keep doing everything on the settlement state. So it's kind of problem also. And the second one is also the foundations why you should implementing the slos. And the third one is there is a confusions on down definitions between your operations team and product or engineering team. Yeah, like I said before, because the product or engineering team wants to be deployed fast, very fast, very often and very dynamically. But your operations teams in the other hand wants to be very safely, very carefully, very stable to operate the software itself, to keep maintaining the reliability of the software itself. So SLO is coming to help you. So what is the SLO? SLO is a target value or range of values for a surface level that is measured by an SLI. By the way, I'm quoting these definitions from the book called site reliability Engineering. The book was from Google and I think it was a great book to start your journey, your site reliability engineering journey. So you should read it. So from these definitions we found a buzzword again, it is the SLI. So we cannot understand SLO completely before we should learn about the SLI first. So you should start with the SLI. What is the SLI. SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided. It is a lot of buzzword, but on my word is the metrics that you want to observe. I mean, the metrics can be everything, right? Like your surface latency, your non error response of your services, and then your saturations of your instances or the surfers. So the metrics can be everything. But on the specific SLI, it is the metrics that you want to observe. So I will explain to you what is the characteristics of the SLI on the next slide, after, you know, what is the definitions of SLI, going back to the SLO definitions. So on my word, the SLO definition is the target value or range of values from your metrics that you want to observe. For example, your SLi could be your latency of your services, let's say. So if you see your moving average of your latency, let's say your latency, it's average below 200 millisecond, let's say that is your Sli, but your SLO should be a target from your SlI itself. So my target average of latency should be below 100 millisecond, let's say. Okay, so then by default, you are breaching your SLO, right? But we will not talk about this now. But you could differentiate it, right? So the SlI, it's the metrics that you want to observe, but the SLO, it's the target of your, okay, the target of your SLi that should be achieved from your SlI. That is the definitions of SLO. Okay, but I think I know what you're thinking now to start implementing the SLO, you will feel like. But I don't know what metrics should I observe for now. So there is a lot of metrics that you can achieve from your entire system, I guess. So there is a ton of metrics that you can retract from your services or your products or your infrastructure, I guess. But the thing is, if you sre on the Google SRE workbook. So if you have no idea about what metrics should I retrieve? So you can follow these four golden signals. You can start to collect the data about your latency, services or latency of your systems, I guess. Or the second one, you can start to collect and retrieve your traffic metrics. And the third one, it's the errors metrics. So I have been stated also on my slide, what is the error means? So the rate of requests that fail either explicitly, it's like the HTTP five, xx, or implicitly, for example, an HTTP 200 success response. But coupled with the wrong content or on some other applications, every error or unexpected results about it is categorized through an error. So that is the definitions of errors. You can start collecting the errors metric also from your services or your products or your infrastructure, I guess. And the fourth one is the saturations. And the definition on the saturation is how full your surfaces is. So I've been note also what is the definitions of saturation. So going back through your worried, so I know you kind of confuse what is the metric should I observe if I have no idea about the metric itself? So you can follow the forbulden signals. It's also stated on the Google SRE workbook. So these four metrics at least you should retrieve from your services or your infrastructure. Okay, so you can start to collecting, or if you have no surface to collect the metrics, you can start to deploy the surface itself like Prometheus or surface mesh. You can also using the surface mesh if you have kubernetes cluster right to retrieve these kind of metrics. A lot of procedures to retrieve these metrics. But the thing is, if you have no idea about what metrics should I retrieve, these forbidden signals can help you to define what metrics should be in your SLi. Okay, now I know what should be measured and then what? And the next step is start to create your SLI after you know, what metrics should you build the SLI, what metrics should you observe and then start to create the SLI. There is two steps on how to build the SLI. First, you should build the SLI specifications. So the SLI specifications contains the definitions about what you want to observe. You can detail it like my previous example. So it's like, okay, so my SLI specification, I want to measure the average latency within a month from my surface a, let's say. So that is the definitions. It is one of these SLI specifications. And the second one on the SLI specification is usually it's a form of percentile or percentage between some events and total events. So it's like the portions of the target of the events that you want, that you want to be observed divided by every events that has been occurred. Okay, so let's say you want to measure your average latency below 200 milliseconds within a month. So your SLI should be some of the requests that had a latency below 200 milliseconds divided by total request. So the SLI should be percentile or percentage. You can imagine that. And after you build your SLI specifications, you can going through the second step, it is this SLI implementations. So you know what your SLI definition is. But now you should thinking again where you can get the metrics. How can you get the metrics? Also, if you're using the Prometheus, you should use ProMQl, right? And you should learn on how to querying it and how to aggregate it to fulfill your SLI specifications. But the thing is, the generic formula to create the SLI implementation is a good or target events divided by total events times 100%. Like I said before, usually it's a kind of percentile or percentage. So once again SLI is a good events divided by total events times 100% like my previous example. So sum of all the requests that had latency below 200 milliseconds divided by total request times 100%. Okay, that is my SLI specifications example. You can build it from now another SLI examples. So on my SLI specifications I stated that I want to measure HTTP response, that return non error response. I define also what is the non error response. It is the two xx or three xx to the client response. And on my implementations I will retrieve my HTTP response counting. I will count the total of the HTTP response that fulfill my SLR specifications and also the total request response to the client. I can query it from the API gateway metrics, or I can query from the surface mesh, or I can querying it from the it's like the cloud player or some other cloud proxy services. It's also providing the counting of HTTP response. So my SLI should be a sum of two xx response plus sum of three xx response divided by all the torque requests within a time window within some particular time times 100%. So I can retrieve the percentage of my target events. Okay, so that is my SLI. So this is the example. So you can start to first thing first you should find your metrics and then build these SLI specifications. And then try to build the SLI implementing, okay, and then after you had an SLI, you should choose your time window. The first time window you should choose, it's the evaluations or aggregations time window. So you could imagine that if you had a web server, that it's a normal web server that's serving the HTTP request. You could imagine that your client request doesn't have the constant time rate, right? It doesn't constantly request every 1 second, and then sometimes it is every 1 second. Sometimes it is per one millisecond request, I think. So there is no constant time for the client request. So client can request every time, right. But in the SLI, you should choose your aggregations time, what time you want to evaluate it. Is it per 1 minute request or is it per five minute request? Or is it every ten minute request? So you should choose your aggregations of your metrics time window. Okay. And the second one is you should choose your Slo time window. So you could imagine that if you had a web server that had been running through ten years, you want to measure the SLo for going back to ten years, right? It didn't make sense. Right? So you should choose your Slo time. Is it per week, is it per one month? Or is it per one years? So you should choose your Slo time window when you want to evaluate your target. Is it rich or not? Is it breach or not? You should define also the slo time window. So after you choose your time window and you had an SLI, then you should set the boundaries. These are the steps to set your SlO boundaries. Okay, first thing first, try to visualize it. Try to retrieve the metrics, your SlI metrics, and then visualize it within your selected time window as well. So after you visualize it, you should find the average movement of your SLI. Okay? So I think the movement of average, it's between 90% until 95%, I guess. So that is your average movement. So I will show to you the example on the next SLI, but bear with me. So these are the steps to set your first boundaries of your SlO from your Sli. And third one put the boundaries below the minimum point or average movement. Okay? So there is another point that you should pay attention. Slo. The first thing first is if you put the boundaries on top of your average movement, then by default your system was breaching the SLO. Okay? So that should be noted by you, because if your average movement of your metrics or of your SLI, I would say it's between 90% until 95% success. But you set your boundaries SLO first. Boundaries SLO, it's on 99%. So by default, your system couldn't achieve, right. It means that your SLi, it's breaching your slo by the first time. Okay? So it is not good. So you should find the average movement of your SlI. It's more easier to visualize it first as well. But after it, you should find the average movement of your SLI and put the boundaries below the average movement of your SLI. And the second point is it usually start from 90% the boundaries or your first SLO, your first target on your SLI, it usually start from 90% but it is not allowed. But usually it standardized start from the 90% and then after 90% you can improve your SLI metrics by tuning your surfaces I guess. And after your SLI is increasing, the average movement of your SLI is increasing. You can start to incrementing it to 95%, but you can do whatever your number is. But it is the general one. And the third one is you can increase your SLO after your normal circumstances is also going up. Okay, again, so if you put your boundaries first on top of your moving average, then I believe your SLI should be breaching your slo or your boundaries. But let me check on the example part for more better understanding. Let's see the examples. So I have the SLI specifications, which is HTTP that return non error response for every 5 minutes. So I will break down my Sli implementations, I mean my SLi specifications, Slo five minute matrix aggregation times, which is one dot matrix on the graph and then 30 days of my slo time. So that is my chosen time window. And then going back to set boundaries step. So we should visualize it so we can see the graph itself. And if we see the average movement of my metrics, it's sometimes it 100% per every 5 minutes and sometimes it's below 85% and even it is reached to the 75%. Okay, so if we see the average movement, I can set it from the 95%, but you can do the calculate of the median of your metrics for it. For the better of set the first boundaries, you can set them on the median of it. But for this simple guessing, so you can see the average movement of your graph. Okay, so I will set my first set boundaries or my slo to the 95% of Slo. So for better understanding this graph. So I will explain to you that every dot on this graph, every single matrix dot on this graph is implementing the 5 minutes metrics of aggregations of the request of the HTTP non error response divided by total request in average on every 5 minutes. That is the one dot on this graph. Okay, that should be noted by you. So how did we know that our surface or our SLI in this case breached the SLO? So if the sum of matrix below Slo is greater than one minus Slo times slo time window divided by aggregation time window, so then we breach the Slo. Okay, so for better understanding SLO, we back to best on our example, we have 5 minutes matrix aggregation time. Remember, there is one dot matrix on the graph and we had the 30 days SLO time and we had the 95% of SLO. So we have an allowed error response within one -95% time 30 days. Okay, what do we mean by this? Because our target SLO, it's just not achieved the non error response below the 95%. Okay, so we allowed the error happens just 5% between 30 days. You can imagine that, right? Slo once again, because our target is returning to the client with non error response with 95% within the 30 days. So we allowed the error, we allowed the error one -95% which is 5% in the 30 days. Okay, so we tolerate the errors within 5% between 30 days, which is 36 hours. And because we have a 5 minutes aggregations matrix, then we have 36 hours divided by 5 minutes of allowed under 95% matrix one dots which is 432. You got it right. So if your five minute matrix. So if one dot matrix on this graph, if one dot on this graph, on this visualized graph and below the 95% below our target and tends to appear greater than 432 times within 30 days. So if we manually calculate this 1234-5678 910 if we calculating it manually, right, it's this estimate. So if we calculating it and sum all of the dots that below our SLO and it tends to appear greater than 432 times within 30 days, then you breach your 30 days slo. Okay, it should be makes sense to you, right? But this problem can be understood better when we introduce the error budget concept. So the error budget is actually one minus slO, but it has greater capability to detect the SLO breach within the concept of error budget and error rate. But I will explain it later. So to quick recap on how to build your first SLO, the first thing first, you should build your SLI specifications. If you're still not sure about what to be measured, see the four golden signals. After you had SLI specifications, you should going to build your SLI implementations as well. So you should start to think about where and how you can get those metrics right. And how can I formulate those metrics. So my implementations should be matched with my SLI specifications. After you had SLI implementations, you should visualize your metrics, your SLI within. Also the next step, set the time window, set your metrics aggregations time window what the timing node that you want to observe. And you also should choose your SLO time window when you will evaluate that the SLO is breached or not. And after that, you should see the average movement of your SLI. After you visualize it, you can see the average movement of your SLI. And the next step is set the boundaries below the average movement of your SLI. So I have given you a reason why it should be below the average movement of your SLI, right? So it's actually your boundaries. It's actually your first SLO. Congratulations. You have the SLO. Now then, if we're going back through our problems, it should be the solutions after we implementing the SLO, right? Like the first case, the third party or your business partner said that their service will downtime for a while. So it shouldn't be for a while. Okay. Because you know you had an SLO that you cannot return can error more than x minutes. So you should talk to your partner that we had an SLO. So you cannot go into the maintenance time or downtime more than 5 minutes, let's say. Okay. And the second problem is, it is okay for an error. Okay. It is okay for can error. We tolerate can error as long as the slimetric still on top of the slo. The third problem, it's also okay for returning the unexpected result when product or engineering team, it's told that their service was returning an error or unexpected result intermittent. So operations team can say that it is okay for returning unexpected result as long as the SLI is still on top of the SLO. SLO. The operations team, the product team, the engineering team had the same definitions on what is the down definitions? As long as it's not breaching our SLO, it's okay. It's also same with the second case. So whenever product team wants to deploy it, very vast, deployed, very agile and operations teams, it's too scared to being very agile. So product team and operations team can have a bergaine about the okay. So we had our SLI. Still about the SLO. Let's doing some experiment stuff. Because it is okay for error as long as the SLI metrics stay on top of the slO. So for next topics, I will talk about error budgets and error policy. And these concepts will help you understand better on when the SLO is breached or not. And next, I will talk about how to calculate the SLO for integrated services. Let's say surface a had an SLO and then had the communications with the surface b also had an SLO. How can we calculating the SLo between all of those services that was integrated? And next I will talk about the alerting on slos. And the last one it's the interesting part. Why can rerelease can improve your slos while maintaining agility? By the way, on the number one, number two and number three, I had written on my article. So visit my medium and follow so you can read it. And the fourth one, I still progressing my article and doing some math on it and think that's all for me. Thanks. See you.
...

Muhammad Jihad

CEO @ SocketSpace

Muhammad Jihad's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways