Conf42 Site Reliability Engineering 2022 - Online

Four Golden Signals: Monitoring the health of your service

Video size:

Abstract

As Site Reliability Engineers it is our mission to ensure our services are highly available, secure, and scalable. With hundreds or thousands of different metrics across a (potentially) distributed system that you could monitor and alert on, where do we begin? How do we define what it means for a service to be “healthy”?

This lightning talk focuses on the four golden signals of monitoring that serve as a solid foundation for actionable monitoring of the health of your service.

In this talk we’ll explore what the signals are, what they mean for you and your customers, and put what we’ve learnt into action with monitoring a demo application.

Summary

  • The four golden signals in monitoring are latency, traffic errors and saturation. These key signals will help keep your metrics focuses helping you track down issues faster. They also monitor what's important for your users too, which means you're not responding to spurious alerts.
  • Catgen is a service that I wrote that has two components, a frontend and a backend. The back end has some artificial latency that's introduced just so we have something interesting to plot. Now how do we take what we've learned today and apply that?

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome. My name is Michael and I'm a site reliability engineer at Teleport and in today's talk we'll be talking about the four golden signals, what they are and how you can use them to create a solid foundation to monitor the health of your service. We'll also take what we've learnt today and apply it to a small service that I've written and monitor it using Grafana. Let's dive in, you as site reliability engineers, it's our mission to ensure our services are highly available, secure and scalable, and a big part of how this is achieved is through monitoring. But let's take a step back and explore why is it that we monitor in the first place? Alerting and troubleshooting are obvious ones. If something is broken, I want to know about it as soon as possible so I could fix it. And I'm also going to need to see my key metrics graph so I can pinpoint where the issue is during troubleshooting. But it can also help us in other parts of our job too. It allows us to understand our trends and answer questions like how quickly are my daily active users growing? Or how big is my database getting over time? The answer to these questions could be key in things such as capacity planning. We also use monitoring to understand the impact between changes. How much lower is my latency after this latest software rollout? Did we hit a regression with performance? It can even be helpful for experiments as well, where a software change is rolled out to a subset of our users. In this case, I'd want to understand was this change beneficial? Perhaps I'm running an ecommerce website and the change resulted in people finding products more easily or more quickly. Ultimately, we need to think of monitoring as a way to view the system's health, where anyone that supports the service can have a single source to determine the overall performance and availability. When I started my career ten means ago as a system administrator, I'd used Nagios to collect dozens of metrics across all of my hosts, from cpu usage to how many inodes were being utilized, and I'd get patient in the middle of the night for any one of them. But there wasn't always a clear understanding of what the impact of each metric was, or whether it was even an issue in the first place, say an in memory database using 70% of its available ram. Is this an issue? Perhaps not, but maybe it is if it was a HTTP server. If I'm going to be paged, I want to make sure that it's actionable and clear what the issue is so fast forward to today. Software is more complex, and now there's hundreds or thousands of different metrics across a distributed system that I could possibly monitor an alert on. Weve do I begin? Which ones should I focus on, and how do I make actionable alerting from the metrics that I do have? This is where the four golden signals come in. The four golden signals in monitoring are latency, traffic errors and saturation. And these key signals will not only help keep your metrics focuses helping you track down issues faster, but they also monitor what's important for your users too, which means you're not responding to spurious alerts that aren't meaningful. Let's dive into what each of these signals mean in greater detail. Latency is simply the time it takes to service a request. For websites or user facing services, this is especially important, or your users will get impatient and abandon their request. If it takes more than a second or two to open a web page, I'm probably not going to wait around and see how long it actually is going to take. Similarly, if it takes too long to add an item to a shopping cart, for instance, I'm probably going to abandon my shopping cart and look elsewhere. It's important, however, that you distinguish the latency between successful requests and unsuccessful requests. If one of the back ends to my service goes down and the front end starts services errors, it may be serving the errors rather quickly, which would be misleading in my graphs. That's not to say that you shouldn't graph error latency at all. After all, slow errors aren't any much better than fast errors, and it also might point to an issue that needs to be investigated. Latency is often something that the first thing you'll notice, your users will notice, and when this increases, so does their frustration. The next one is traffic, and this is a measurement for how much demand is placed on your service. For web servers, this would be HTTP requests per second. For an image processing pipeline, this could be how many images were processed per second. Whatever your service does, this metric should encapsulate how busy it is. Errors can range from explicit errors such as HTTP 500 errors, or failing GRPC requests based on error codes, but they could also be implicit too. Maybe your service is taking more than a second to respond to a request, and maybe your slos define a request taking more than a second as a failure. This would need to be measured and tracked, or take, for instance, serving the wrong content altogether. In the context of a web server, this could still be a HTTP 200 and be considered a successful request. But if you're serving the wrong content, then it's not really successful. And finally, saturation. This is the overall capacity of the service. This will require you to understand what resources are most constrained. For instance, maybe you're monitoring an I O intensive database. In this case, I'd want to take particular care to monitor the queue depth of I O operations for the disk. It'd also be important to monitor how quickly my disk would fill up. Whilst it's typical to use indirect signals such as cpu or memory or disk to measure saturation, you can also determine this with load testing and use static numbers too, in which case you would set alarms based on how soon your traffic approaches your limit. Whatever metric you pick, it needs to be clearly shown where the limit is and how close you are to it. Now we understand a bit more about each of these signals and what they mean. Let's put it into practice with monitoring a small service. Here we've got cat Gen. Catgen is a service that I wrote that has two components, a frontend and a backend. The backend is responsible for displaying this picture here of the cat, and then the front end is the static assets. So if I click on generate another, we'll see that a cat is served, and if I keep clicking, I'll keep getting cats. The back end has some artificial latency that's introduced just so we have something interesting to plot. So if I keep clicking this, I'll keep getting cats at varying speeds. Now the way that this is implemented is it's all running locally on Kubernetes. So I'll show you. I've got two pods here, a backend and a front end. Looking at the code, it's a simple golang binary where I register a Prometheus histogram right here, which is the HP request duration, and we'll use this to plot our latency. Now down here is the actual handler that serves the cats. So there's a directory called images that's embedded in the binary, and it will pick a random index and it will services this, write the JPEG header and then introduce a random amount of delay from zero to 50 milliseconds. And you can see that the handler is cat JPEG. So if I was to run this back end by itself, if I just simply visited cat JPEG, I would just get a cat JPEG. Similarly, if I hit metrics, I'd get my Prometheus endpoint as well. The front end is a collection of HTML and CSS and some Javascript that is served by an NginX server. If I look at the Nginx config you'll see that I define an upstream called backend and this server or the DNS name cat server comes from the Kubernetes service. So if I run a kubectl get service I'll see the cat services service. So that points at the back end. I set up my access logs and then I have two locations. The root is simply under usershare nginx HTML and reads the static files from disk. Beneath here we can see that that slash backend cat JPEG is proxied to the back end. So if I look at the source of Cat gen, if I view your page source you can see that I have an image that's served as backend cat jpeg. So this isn't actually served by the front end itself, services by the backend. Now how do we take what we've learned today and apply that? So today I'm going to use Grafana and I'm going to create a dashboard and weve going to start to plot some key metrics. So given that it's a HTTP server, there's a few things that I care about. So for my traffic, which is one of our golden signals, I'm interested in the amount of HTTP responses I'm getting. So let's have a sum of the Nginx HTTP response times seconds count. Rather we'll plot 30 seconds here and this will give us a nice graph of the amount of requests that I'm getting. So this is HTTP requests and you'll notice that I already have quite a few requests. So if I edit this and change the unit to request second, this is how many requests a second I'm getting. And I'm actually getting quite a number of requests already. The reason for that is I actually have curl running here and I can run that again and this is just generating artificial load if I go back to Grafana now. So this is my traffic which is one of our golden signals. Let's work on adding in, say errors. So because it is a HTTP server, I am primarily concerned with HTTP two hundred s and five hundred s. Those are my success and errors. So let's get in the sum of the rate of Nginx HTTP response county seconds and we want to break this out by status. You can see we've got the HTTP two hundred s here and then our non 200 and we can see most of our requests are successful HTTP errors or HTTP status code and let's change the unit to request a second as well. Just to make that clear. Let's change it to last 15 minutes. So we've got errors, weve got our throughput, what else do we need? We also need latency, which is one of our golden signals. So latency, if you remember before we actually collecting that in our back end, if I go back to the back end on the back end go, you can see I've got the HTTP request getcat duration seconds. So let's copy that and go back here. And we also want the sum of the increase of that metric. And we'll pick 30 seconds as our resolution or interval for that. And we also want to plot this as a heat map. So change our heat map. Where have I gone wrong here? Some increase. There we go, 2 seconds. And we'll call this, call this back end latency, change our unit to milliseconds. Um, sorry milliseconds. And let's change this to buckets. That's a lot of decimal. Let's change that to two. There we go. So we've got our back end latency, we've got our HTTP request, which is our throughput, we've got our status codes. So this service is looking pretty healthy. But there's one more golden signal that we're missing here and that is our saturation. So how full our service is. Now, the cat gen is primarily driven by CPU, so we're going to plot cpu. Now I already have a cpu usage in my library. See here we've got cpu usage broken out by front end and back end. Now one thing you'll notice about this is the front end seems to be doing a lot more work than the back end. A good reason for this is that my curl over in my terminal here is hitting the front end, but the back end is not actually serving those cat pitches. This only happens when you actually download the image, for instance. So this content is being services statically by the curl. So if I go to my terminal and run the curl manually, you'll see that I get the same output. What I don't get is the client actually downloading this image, which has happened to the back end. So you're actually seeing less load on the back end because of that right here. So this could tell me that maybe I need to scale up my front end, maybe I need to look at logs and see what's happening. In this instance, I'm being hit by essentially a bot, so maybe I need to throttle them, maybe I need to block them all together, for instance. But you can see here, this dashboard provides a pretty good overview of the status of what's happening with cat gen right now in terms of latency status codes, requests, and my saturation. I could also run some benchmarking and set static high watermarks for HP requests, for instance. So I do some load testing. I know that I can handle 500 requests a second, for instance. I could set that within there, and I could set some alarming as well to know when I am reaching that or approaching that threshold and respond appropriately. Now, these four golden signals really help us set up a solid foundation, but it's not the only thing that I should monitor. There's certainly other things, and there's certainly other metrics as well. So if I go to add another metric, you could see all these metrics here are possibly things that I could add in, and you'll see that there's quite a number of them as well, things from the actual Kubernetes hosts themselves, the kubelets, Prometheus itself. So there are a number of things that we could be plotting, and that's not to say that we shouldn't plot them. The whole purpose of the four golden signals is really to give us a starting point to really understand what's going on with the service. Thank you for listening. I hope you've enjoyed today's talk on the four golden signals and how you might be able to apply them to your service. If you've got any questions, I'd love to chat. You can find me on LinkedIn, or feel free to reach out to me at Michaelmcalester at got.
...

Michael McAllister

SRE Team Lead @ Teleport

Michael McAllister's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways