Measuring Reliability in Production

Video size:

Abstract

We’ll use an example application to describe how to define SLIs and SLOs, including an overview of architecture, a how-to for developing SLOs, and suggestions for implementing SLOs. We-ll also focus on how to identify CUJs and recommendations for implementing metrics to use as SLI and SLO targets.

Summary

Ramon Medrano is a cyber reliability engineer at Google. He will talk about how we measure reliability in production. We are going to be doing a small workshop of creating a small SLO in a distributed system.
Most important feature of any system or service is the reliability it has towards its customers and clients. This includes security as well, because any system will have to convince users to trust them with their data. SRE is when you treat an operations problem like running a system as a software problem.
Slice. They are service level indicators that describe what the user experience is. Slos are the lingua franca that we use across the whole business cycle of a product or service. If everything goes through slos, it's like how we are going to go to create an slO.
In this case, your user is going to be the product that is calling you as an infrastructure service. We need to have metrics as simple as possible, but sufficiently rich so they capture exactly what the user are expecting us to provide them.
To create appropriate slos, you need to list the user journeys and order them by business impact. You also need to determine what are the indicators that describe the successfulness of these Cujs. Finally, you're going to have to deploy some alerts.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Ramon Medrano. I'm a cyber reliability engineer at Google. I work in the identity team in the third site that we have, and I have been doing this for the last decade, believe it or not. I'm very thrilled to be speaking to you today about how we measure reliability in production. We are going to be doing a small workshop of creating a small SLO in a distributed system and giving some hints of how we are going to be creating or what questions we are going to need to answer when we are in the business of creating an SLI and SLO, et cetera. So let's cover that all. So the first introduction I want to make is the most important feature of any system or service is the reliability it has towards its customers and clients. In my opinion, this includes security as well, because any system where we have, for example, in the cloud or any online shopping website, et cetera, will have to convince the users to trust them with their data, for example payment, or if you have a storage system with their own files and data, et cetera. SLO. The systems needs to be reliable in the sense of needs to be available for the users anytime that they need it and needs to be secure. So we are not leaking any data to external actors. The second production is what SRE is. I think everyone signing up to this conference more or less know what it is, but just 30 seconds introduction is what you get when you get to treat an operations problem like running a system or like operating a distributed system as a software problem, meaning that you are going to get automation, you are going to get to write software to manage these operations instead of just running them in, churning through tickets or interrupts, going to the matter, one of the most difficult questions, or the most nuanced questions that we have. When we are applying the SRE practice to a system, either that is being created or a system that already exists is what is the level of reliability we need? Answers like 100% are not correct. Answers like we shall say are neither correct for two reasons, 100% is not achievable. Like anything in a distributed system is subject to break or any problem that we have is going to be showing up to the users at some point we shall see. It's neither good because we can't have expectations on the systems. So we are going to try to see a process to answer this question. So before we go to the process, here are some terminology that we are going to use through the talk. First of all, we have CuJs. Cui stands for critical user journey. What it means is the definition of the functionality that your users care about. For example, if you have a shop, you care about people being able to browse your catalog, you care about people being able to put things in a cart or something like that, right? And you care about people being able to pay. And as well, you might care about people, for example, tracking their orders while they are shipping to them, right? So this is a functionality, core functionality from your service or your platform. That is something that is important to the user. Then we have slice. Slice. They are metrics. They are service level indicators that describe what the user experience is. With regards, some functionality could be a functionality as complex as checkout or as concrete as storing in my redis cache one element, right? Because you might want to have slice as well for these subcomponents on your system. Slos, they are the objectives we have for the service level. So they are the objectives we have for the user experience in different parts of the application. Like for example, we might want to have an objective that like three nines, like 99.9% of our applications is serving correctly for the functionality of storing a small value in rallies. Or like 99.99% of the cards successfully are checked out after the user decides to do so. And then we have slis. Slas is an agreement, it's a contract. So an SLA basically is a contract between you, for example, as a service provider and your customers, indicating like if your service level indicator goes under 13, SLO for some time, you will give them refund or you will give them some credits for your platform or whatever it is, how you build for your clients. So with terminology introduced, we can go to slos. And why SRE cares so much about slos all the time. Basically, slos are the lingua franca that we use across the whole business cycle of a product or our service, right? So from the concept, then we have a business description of like, okay, so we want to do machine learning, right? So how we want to do that, which service we want to introduce it? Is this like a new service? Is this like upgrades to a service that exists, for example, et cetera. Then when we have definition for the business of what this means, this project, this service, this platform, whatever we are building, we shall go to development. So we are going to go write code, design components, laying down in production, start to have some traffic, et cetera. Then we launch the service and then we have operations. So we are going to have to do weekly rollouts or like daily rollouts or whatever is your cannons. We are going to have to monitor, for example, the new versions that they are correct. We are going to have to make sure that we have data entirety, running backups, et cetera. And finally, all that stuff goes to the market. And in the market, if we have service that produces revenue, for example, we're going to have to manage that. If we have an internal infrastructure, we are going to have to discuss with our clients, like if this performing correct, if we need to do more functionality and so on. All this gets aligned through SLos. So Slos get discussed with the product team, gets discussed with the developing team and gets discussed with the SRE or the DevOps team. So we have an agreement of like this is the level of reliability that we want that makes sense for the business and it's reasonable to implement within some time frame or cost. The thing is, if everything goes through slos, it's like how we are going to go to create an slO. So this is what we're going to be looking at in the next minutes. So first of all, one plug in there if you want to play with slos. So you want to have like a small test bed service that we have that we are going to discuss in the next slide. That is our hipster shop. And you want to play about getting slis, defining slos, seeing how they evolve, create some load, et cetera. You can use this cloud operation sandbox. It's based on GCP. And you are going to get to for example, deploy one small distributed system and start to get some synthetic load and start to see how, for example, if you inject faults, how it affects the slice. Like what is the reliability of different things. The system that we are going to use as a matter of the example for this is a distributed system that we call hipster shop. It's available in GitHub and you can deploy it in many places. Just for example, if I have cloud shell in GCP, you can just run all these services and have a small chop that will have different services written in different languages, interacting and sending rpcs to each other. There is going to be even a database. So you can play with different classes of slis, you can play with different classes of distributed systems and languages as well. So first of all, how do we start creating an SLO? The first thing that we need to think about is the CUJs, the critical user journeys. The critical user journeys is the interactions, is the functionality that our customers and users do care about deeply, right? Is the interactions or is the APIs or is the functionality that gives or defines the success of our product. So we need to list them and order them by business impact. For example, in our shop there are three things that we want our shop to provide to the user. That is, we want the users to be able to browse our catalog, check out whenever they select something in their cart, and add some production to their cart. If we order them by business impact, the thing that is more important for us is for people that is already having a card to check them out so we can actually proceed to do the sale. Add to card is the second one because we want people to be able to create cards that we can check up later and finally we can browse products. This is a simple list, that's an example, but in different business the list will be different. Or you might have different CuJs in the same priority, depending on how your product comes to be. So a critical user journey. I think one word that needs emphasis is user. So you need to think of the coas that you are defining from the point of view or of your customers. It's not something that is as internal as for example, you happen to run a redis cache and you want to have a cua that involves explicitly the redis cache because that could be somehow leaking your abstraction to your customers, right? So if you are an infrastructure provider, that is, for example within a company, you are the one that is like me, that is running the authentication service for Google, you might have cujs that involve infrastructure in the sense that your users, there are going to be other products like for example, Gmail needs sign up to work for issuing credentials to access the mailboxes of people. That's fine. In this case, your user is going to be the product that is calling you as an infrastructure service. So your user in this case would be Gmail. So Gmail could be says as Gmail. I want to see for example, user credentials being generated properly for my product to continue to the mailboxes. That could be a variant for an infrastructure Cuj in this case. Then once we have the cuJs, we need to create indicators of the healthiness or the successfulness of these cujs. So when we say like we are going to be looking at the checkout service for our customers, how is the indication of or what are the metrics that we can use to describe how successful this service or this coa that we are providing to a user is being? Right. So we need to have metrics as simple as possible, but sufficiently rich so they capture exactly what the user are expecting us to provide them. So there isn't like a balance there that we need to strike. Well. So slis, we have first of all different types of SLI depending on the services or the platforms or the programs that we are running, right? If it's a transactional service, like a classic one that is doing rpcs to other services, for example, we have an endpoint that people can do, even people like persons can do like transactions in the sense there we have the classic availability, latency and quality. So we might say we want so many requests to succeed in less than x milliseconds. That's the classic SLI, right? Then we have data processing, which is like for example, you have a pipeline that is iterating through databases or you have processing that is asynchronous, right? There you might have indicators about freshness of the data that you are production. You are going to have indicators about the coverage. Like for example, you might want to be summarizing data. So you might want to have an indicator of like for each run of your batch job we cover like 90%, 80%, 99% of our customers, right? And throughput is what, how many rows per second, for example, you are processing. If you are running storage, you might have throughput as well. Like how much queries, how much rows, how much data you are processing and what is the latency that you are having to process queries. For example, if you run an infrastructure service that is a data lake, you might want to say okay, so we are able to process these many queries per second and each query takes like so many milliseconds to optimize and execute. Then we need to like once we have the type of the SLO, the SLI, sorry, that we are going to be using, we want to specify like the specification is like going to details for this particular service instance, right? SLO for availability, for example, we can say like this is the proportion value of the quads in the sense of like we are serving 200 that are served in x latency, right? We might want to include latency and availability in the same SLI or not, depending on how we want to play this description of our service, right? And then we need to implement it. The implementing is like okay, given the services that we have and the components that it has, how we are going to get the metric to be calculated in non abstract way. So we are going to need to say okay, are we going to use events that our application is logging into the APM, for example, are we going to use logs that they come with a slight more latency for example, but they are more precise. Are we going to just instrument our applications or the export metrics directly so we can have a system like Prometheus, for example, to scrape them and calculate the stuff? Are we going to instrument our clients or are we going to just treat our front end services as a proxy for that? So we have, for example, less complexity, but there is like we don't incorporate the latency that the networking the last mile will introduce to the user experience. So those are decisions that we need to take to calculate and to implementing specifically the SLI that we want to show to the users. We want to measure from the users and show to the teams. So in our case, like for the checkout CoA that we were processing, we're going to be focusing into these two components of our application, right? So we have many components, but this CoA covers specifically our front end and the checkout backend service. That is the one that is going to be doing the business logic for the checkout that will, in their terms, developing other things. Like for example, when you do a checkout, you might want to call the payment service typically, right? And an email service to confirm the user that the order was successful, right? So the SLI that we're going to be implementing here, this is an availability SLI. So it's a transactional service. So we are going to be doing that howto we are going to implement it is going to be the classic proportion of valid checkout requests that they are served successfully. A successful request is going to be something that will have 200, right. And we are going to actually implementing it by implementing the front end. So we are going to incorporate this metric. Checkout service response counts, right. And we are going to exclude the 500. We can see the like server errors and those are not successful requests. This example is using istio service mesh. But wherever you want to propagate that metric is where you are going to make the calculation. So then it comes the SLO, once we have the SLI, and this is the hard part, right, because calculating an SLI, implementing an SLI is just a descriptive metric of a system. And your developers, your SRE team will have expertise in that, so can get into an agreement of like this is the indication. But how much we want to target of this metric to go, right. In this example, for example, we have the classic three nine. So 99.9% of the checkout requests should be successful for this SLI, right. SLO 99.9 is going to be the target for the SLI that we defined before. And if we are over, we are good. And if we are under, we might even have some SLA contractual obligations to fulfill, right? I say this the hard part, because this involves cost, right? So in the n, slo, think that if you are one nine, at one nine, you are cutting your error budget by ten. So in these three nines, SLo, we are going to have 0.1% of our requests as budget for failure. So if we fail all these requests, we are still good. So we can use this margin to say, like, for example, we want to do our rollouts faster, or we want to take some risks on the schema changes or whatever, is what the team is prioritizing. If we add one nine more, that sounds great, because we are going to go to four nines, but our budget is going to be one 10th of that. Therefore, the complexity of the operations will multiply by ten ish, right? And therefore the cost of maintaining and operating the system will be ten times more expensive. So we have to be very careful of having slos, that they are achievable and that they are something that is representative for the users of expectations towards the system. Typically, one thing that has worked well for me in the past has been to incorporate in the same room production developers and SRE and say like, okay, what SLO should we do for the business and product will come up with like, well, we need the highest possible, right? Because that's great. And then you specify the cost. For example, sure, we can do like five, six nines, but this is going to cost you this much headcount, this much development time, this much complexity in the deployment of the code, whatever that is. Right. And then things will come to balance of saying, okay, we can now achieve this cost, for example, for developing and operating the system towards this SLO, which is something within the user's expectations. So, summarizing on the process to create appropriate slos, we have, first of all, we need to list the user journeys and order them by business impact. This is very important that our product teams, they are involved because they are the ones that they know very well what is the business impact and what is the expectations of the users. At this point, you can as well get some indications of the criticality of different things. So the user journeys that they are on the top of the list, obviously, probably are going to be like receiving higher slos, because they seem to have more impact to the business. And critical user journeys that they are down the list, probably they are just less relevant, they are accessories, et cetera. So you might want to have more headroom for the slos in those. Second, you need to determine what are the indicators that describe the successfulness of these Cujs. Depending on the CuJ that you are considering, or depending on the component that is involved into the CuJ, the type of the SLI is going to be different and the implementation might even be different as well. And then you need to go back with product and development and say like, okay, what targets do we want for these Slis? Like what are the objectives that we want to meet here? Complexity and cost. They are going to be as well an important component to discuss. And as well you're going to need to hand define a measurement period. Right. Slo, you might want to have like sliding windows or only considering the natural month, depending on what is the characteristics of your services. Finally, you can just implement everything. Write the code to export the metrics to calculate, to have a batch pipeline that is going to process the logs, whatever is what you do. And finally, you're going to have to deploy some alerts. The nice thing about the alerts is that they become pretty simple to implement because you have an SLO and an SLI. So the alert is going to be whenever we are having an SLI under the SLO for 30 period, like, I don't know, 1 hour, five minutes, 15 minutes. The higher the slO, usually the smaller the window. You just trigger an alert. If you want to know more about SRE, if you want to know more about the practice, if you want to know more about how to implement these things in your company, you have these books. Now, we have like a family of four books. The first book is the one that defines the general principle of SRE. The second one, the workbook, is really focused on how to implement the first book in existing organizations. And the other ones, they are very specific. So if you want to talk about security and reliability together, there's the third book for you. And the last one is like a version of the workbook, but is specifically tailored for large organizations, like for example, large enterprises and so on. So how you can steer the culture within the company to have SRE in there. Thank you for watching. Thank you for listening. I would be happy to answer any questions that you have either in the chat or in the Twitter or any other social network that you use and see you around.

Slides

Download slides (PDF)

See all 20 talks at this event!

Conf42 Site Reliability Engineering 2023 - Online

May 04 2023

Measuring Reliability in Production

Video size:

Abstract

Summary

Transcript

Slides

Ramon Medrano Llamas

Senior Staff SRE @ Google

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2023 - Online

May 04 2023

Measuring Reliability in Production

Video size:

Abstract

Summary

Transcript

Slides

Ramon Medrano Llamas

Senior Staff SRE @ Google

Join the community!