Conf42 Cloud Native 2024 - Online

How to set up reliable monitoring and alerts for cloud applications

Abstract

Learn to ensure your cloud apps stay afloat! My talk covers everything you need to know: From selecting metrics to setting up robust dashboards and alerts for reliable monitoring, including how to leverage eventual failures for improvement. Don’t miss out on mastering cloud app resilience!

Summary

  • Today we'll be talking about reliable monitoring and alerts. Everything you need to know of how to set up your dashboards and alerts for your cloud applications.
  • metrics is all about what do we want to monitoring. So we have two types of metrics. First, the business metrics. And on the other side, not detached from it, we have the technical metrics. You can set up dashboards and alerts that affect both.
  • Dashboards are a visual representation of key metrics, indicators, trends that we want to observe, insights into the current state of the system. It provides us with continuous monitoring and analysis. Let's take a look at an example to understand better what's a dashboard.
  • Time series to display trends over time, and it's easy to spot and see spikes and seasonalities. Important about dashboards is like plot useful guidelines. Another very important thing is that dashboards need to be actionable.
  • The idea of alerts is that they prompt immediate attention. It's usually targeted to specific individuals or teams. Usually we want to be alerted only by critical metrics. Similar to dashboards, we want alerts to be actionable and clear.
  • How can we learn from failures? How can we investigate, make the right questions, and improve our system. This is the art of asking why problems happen. Every error, everything that happens that is significant enough to be a problem, should be investigated.
  • So to wrap up, we talked about metrics why we should select metrics. We looked at some examples what we should look how to set up use for visualizations. We talked about alerts, setting up good practices for alerts escalations. Some tips to investigate and improve failures. I hope it was good for you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Welcome to this talk. Today we'll be talking about reliable monitoring and alerts, everything you need to know of how to set up your dashboards and alerts for your cloud applications. My name is Israel and I am brazilian, living in the UK for the past two and a half years. This is a picture of my family. I married to my wife and we have a son, six years old right now. Well, from hobbies and things, I love playing games, traveling. This is a picture of us traveling across Europe and currently I'm a software engineer at Meta. Brief outline of what we'll be talking about today so first I want to touch why we want to talk about alerts and dashboards and metrics and other things. Then we go a bit in metrics, how to set up dashboards, what is interesting for alerts, and how to learn from failures that will eventually happen. So first, why is it important? So we have a question, like are the systems up right now? Is everything all right? Is everything working as expected? Are our metrics stable enough? Or at least are we keeping the expected trains? If you are expecting to grow every month, is it growing ultimately, can I sleep in peace? Can I rest assured that the system is healthy, in place while I'm taking a vacation, while I'm sleeping, and be sure that I will be alerted if anything goes wrong? Well, metrics is all about what do we want to monitoring. It will be different depending on your scenario, depending on what you want to do. But look at this engine room for ship. I don't know which ship is it, but we can see lots of different things to check. The ship is going well if the engine has the right pressure, the right tuning for everything. And this is important because when you are in a ship, you want to make sure the ship is working properly. And similarly, that's what we want for our cloud applications and our systems. So we have two types of metrics. First, I would say the business metrics. They are the KPIs goals, user behaviors, everything that we want the product to be, everything that we want the users to see what's affecting our business. And on the other side, not detached from it, we have the technical metrics. So we are talking about response time, we are talking about error rate, machine constraints like CPU memory, I O network, things like that. And for this presentation, we'll be focusing more on the technical examples, but it's valid for all of them. So you can set up dashboards and alerts that affect both. Now let's dive into dashboards. All we need to know at a glance. So dashboards, what they are, they are a visual representation of key metrics, indicators, trends that we want to observe, insights into the current state of the system, and it provides us with continuous monitoring and analysis. Usually we want to see it over a period of time, basically like the past x days, hours, weeks, the past month, fortnite and things like that. And let's take a look at an example to understand better what's a dashboard and how we want to look at it. So this I took from the new relic website and introduction part of it, neuralic, is one of the most used ones. This is not sponsored, so just an example. And here we can see lots of information in the dashboard. It can be a bit overwhelming, but I want to highlight two things here. First, we have different widgets looking at different things. I already mentioned error rate, throughput, transactions, time apdex score, and we have many other things that could be observed. And we also usually have period selection here, 30 minutes. But sometimes you want to see if something is stable based on differences anality. So for example, depending on your user behavior or your system distribution, it might be worth checking, for example, weekly. So you would see daily going up and down, for example, depending on if it's day or night. And that could help identify if a system behavior corresponding to what's expected to that seasonality. Okay, now that we saw a dashboard, let's talk about some useful visualizations for dashboard. So first one already mentioned is a time series. We want time series to display trends over time, and it's easy to spot and see spikes and seasonalities. So it could be weekly, daily, hourly, any type of things. But most importantly, when you look at it, you can clearly see when it's going up or down and compare with the historical data. You have to check if it's something that's expected, it's something that's weird and not expected. For example, let's say you have error rate and it's usually stable at 0.1% and suddenly it jumps to 1%. I mean, it's a ten time jump, it's a significant jump, and you definitely want to take a look at it, especially if you know, for example, a version has been rolled out and it could be that this version is introducing some new errors in your system. Maybe it's even time to stop the rollout. Another thing, it's also like usually a time series but stacked area chart when you want to see the cumulative contribution to a total value. So let's say for example, you want to monitoring the percentage of users using your application through different platforms. You want to see, for example, mobile, Android, iOS, the website or mobile website, depending if you wanted to make this difference. And you can see it over time if it's decreasing and increasing something. And you could see, let's say for example, you have a critical bug on your Android application, and you would see that the Android version percentage in the chart drops, disappears, or something like that. There are many ones, but there are bar charts, pie charts, heat maps, scatter plot, cloud charts, histograms, waterfall charts, but don't get overwhelmed by it. We need to know how to read those charts, we need to know what they mean, and most important, what action we want to take from them. So if you have a chart that's showing a deviation, something weird that you can't interpret, that's probably useless. So be simple. I highlighted two of the main ones, but be simple, understand this, understand what metrics, remember, we need to define what metrics we want to observe. And some of these tools will probably already give predefined metrics to follow. And as we get used to it, as we think, hey, I should have observative lead for some specific things. It might checking other things. So for example, bar charts, I usually check to see the amount of a specific error in the error rate, so the contribution. So let's say we have an increase in the error rate, and I quickly look at a bar chart for the past hour to see what's the error that's more frequent, and we can quickly find the issue, or at least quicker find the issue. Important about dashboards is like plot useful guidelines. So sometimes you have what's the value expected. So let's say, for example, you expect the error rate to be between 0.1 and 0.2. So maybe you want to have like two lines plotted in the dashboard to help you see if it's quite close to one of the thresholds or not. Also plot useful limits. We'll talk about alerts later, but think about something that you want to be alerted by. Maybe it's worth also bringing this to a dashboard. Also have handful filters you might want to filter specific surfaces animation. Let's say you identified through the bash dashboard that the Android application has increased error rate. So you might want to filter. Okay, now show me only events coming from the dashboard from the Android application, and then you can check, oh, what is happening, when did it start? And quicker find your way through it. Another very important thing is that dashboards need to be actionable. So if you have, let's keep the example. Error rate. Error rate is up. Okay, why? So which errors are happening? And. Okay, show me the stack trace. I found that this is the most common error. Just click through it, go to the stack, trace the bugs, too. So it's important to have easy access to drill down specific metrics, specific widgets that we want. But be careful. It's also ideal not to build like a too overwhelming dashboard because again, this is something we want to take a look by, a quick glance, understand the status of the system, and if it's too complex, too overwhelming, too much information, we'll end up losing important things instead of having them clearly jump into our faces. That's it about dashboards. Let's talk about alerts a bit. So, alerts, I like to say, like, it's because we all deserve peace of mind and they help in that. So what are alerts? A good example is like the cuckoo clock. So at specific times it just says, hey, it's 05:00 p.m. It's 04:00 p.m. It's 03:00 p.m. Or something like that. I don't know why it got reversed, but you got it. So alerts are basically automated notifications. But of course you don't want to be notified because it's 05:00 p.m. But you probably want to be notified that the error rate spiked by ten times. And so we basically define what we want to observe. What metrics? What are the thresholds you define? Okay, above this, alert me. Send like a message to my phone, or if it's way above it, call me, or something like that. And the idea of alerts is that they prompt immediate attention. It's usually targeted to specific individuals or teams. So some companies I know, they have a team responsible to monitoring, so they are usually like the first level of monitoring. They would receive an alerts, look at dashboards, identify which team should be our individual, should be escalated to, and then involve them. But most of the companies don't have that. It's just like tied to specific people or specific teams, depending on what the alert is, and then we can action. So alerts are very important to keeping us up to date to anything that's seemingly going wrong, not only going wrong. So I will explain that, too. So, alerts for peace of mind. If you want to relax like this person in the picture, usually we want to be alerted only by critical metrics. So if the alerts are too noisy. By noisy, I mean you get alerted every five or ten minutes or every hour or even every day, you will probably end up ignoring them. And that completely misses the point of the alert. So we want to be alerted by the critical metrics, be it business metrics or technical metrics. We want to have clear severity levels. So if it's a critical failure, let's say the system is not responding anymore or your main page is out, it's just returning 404. We need to understand that. And ideally we have different levels of alerting for that too. But also we want to have alerts selecting up, allowing us to have early visibility of possible issues. So I remember once in a company, we almost had the database down because it was slowly, I mean, not too slowly, but too fast either. So the disk was getting full, basically, TLDr disk was getting full. And the problem is that it was getting full in a pace that avoided the alerts for sudden movements. But also it was quick enough that it would happen over weekend. And on a Friday, people found out, okay, we have a problem, but if we had an alert, for example, oh, every time it trends, like next week, we'll be out of storage. It could save us some headaches in the future. And similar to dashboards, we want alerts to be actionable and clear. So if you receive an alerts, we should know exactly what this alert means and what we should do. What we should do could be. Okay, let's take a look at the dashboard specific dashboard. Understand this more to see if there is a problem. What's the problem going on? But also, is there an escalation path? Is there someone I should call or someone that should be working to solve this issue? And on the other hand, if no alert was fired, it should also be a signal that everything is right. Of course, we sometimes miss observing something. We'll talk about it later. But an ideal alert setup would be everything critical is monitored. And if no alert is open, it should be fair and okay to assume everything is all right. Okay, so moving then to the last part, how to learn from failures. So we can set up dashboards, we can set up alerts, but things will break, things will happen. And this is the art of asking why problems happen. This is a very simple one. I tried to look some accidents or things, but I thought they looked like too scary and they triggered something for someone. So, okay, let's take a simple exam. Just a flat tire, problems happen and maybe this is fine. No, this is probably not fine in the sense that we need to act. It's fine in the sense that we don't need to panic. But we need to act. If you keep the flat tire, you can't continue with a car, at least not for a long distance at least. Okay, how can we learn from failures? How can we investigate, make the right questions, and improve our system, improve our observability, improve our alerts, dashboards and monitoring? So first we have to ask, why did this incident happen? And usually it's not a matter of asking why, receiving one answer and be happy with it. So I have an example that I love. Well, let's say the server was shut down, and let's get out of the cloud thing because it gets things more complex for this example, but let's say you have a data center and suddenly the server was shut down. So you check there and, oh, actually the power is out. Someone unplugged it, or it was unplugged. So you just plug it in back and server is back. Okay, that's fine. Problem solved, right? Yeah, problem solved for that time. But we also have to ask, why was it unplugged? Did someone pass by and accidentally unplug it? Did something happen and someone, by panic unplugged it? Or maybe like the janitor unknowingly knew what was happening there, unplugged it to clean it, and forgot to plug it back. I mean, there could say several reasons, and we need to keep asking why this happened until we get to the root cause. This is the first part. And then we have to ask what could have prevented it? Is there any alert that could have triggered. Is that any alert that actually showed us a problem? And then is there any alert that could have triggered earlier in case of the server? Of course not. But maybe there would be an alert in the sense that, hey, someone got access to the server room. Is it okay? Is it an authorized person? Okay, I'm probably extrapolating the example here, but just trying to give some hints and ideas. We also have to understand, were we caught off guard, how can we detect this earlier? And so on. So every error, everything that happens that is significant enough to be a problem, I think it should be investigated and a plan should be made to improve it. We can improve, of course, the system itself, fix the bug, improve the scalability and so on. But we also need to do like, improve the dashboards, improve the alerts, and make sure that if we get close to having the same issue, we catch it as early as possible. As I said, any missing dashboards, any missing alerts, anything that could have been better. Yeah. So to wrap up, we talked about metrics why we should select metrics, some types of metrics we should select dashboards. We looked at some examples what we should look how to set up use for visualizations. We talked about alerts, setting up good practices for alerts escalations and last we talked about learning from failures. Some tips to investigate and improve failures. Thank you. That was it. I hope it was good for you. Enjoyed it and feel free to connect with me and LinkedIn.
...

Israel Heringer

Software Engineer @ Meta

Israel Heringer's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways