Conf42 Site Reliability Engineering 2021 - Online

Incident Response, Incident Management, and Alerts - Where do they fit in CloudOps

Video size:

Abstract

Responding to incidents is not just about wiring up the right tools, it’s also a strategy and process to know how to respond, how to record details of incidents, and how to learn from them after everything has been resolved. There has been a lot of confusion about the relationship of incident response to incident management, and alerting. In this session we will talk about the differences and the best practices for these key processes in healthy cloud operations environments. Expect to learn in this session.

  • 1.) What is an alert and how does it feed incidents
  • 2.) What is the difference between incident response (IR), and incident management (IM)
  • 3.) What are the best practices for IR and IM tooling
  • 4.) How incidents are being handled in modern dev environments

We will also talk about on-call scheduling, shift-left, and how machine learning supports incident response strategies.

Summary

  • You can enable your DevOps for reliability with chaos native. Incident response, incident management and alerts are not the same thing. Where do they fit in cloud operations?
  • Chris Riley is a senior technology advocate at Splunk. The nature of applications has changed to be similar to the nature of cars. Think of your 1998 Honda as your monolith and your 2020 Tesla as your microservices cloud native application. How we approach fixing those problems also has to change.
  • As we transform, we can't just think about supporting our applications the same way. Instead of one server, we have many, many services, all doing very small, very specific things. Because of the complexity of these modern applications, the life of those who are on call becomes difficult.
  • Incident management is an entire architecture of the lifecycle of an alert. From the alert to mobilizing a responder and potentially taking action within side the incident itself. What we get with an incident over alerts is less noise.
  • Join me at another virtual event. The content here is fantastic. For you to take a little bit of your time and spend it with me means a lot. So I hope to see you at another event and have a great day.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at chaos native. LiTmUS cloud hello and welcome to my talk. Incident response, incident management and alerts. And where do they fit in cloud operations. Now, if you just read that and you're thinking to yourself, aren't all of those the same thing, then you've come to the right place, because they are not the same thing. And that's exactly the purpose of these talk. I'm very passionate about the idea that these are not the same thing. Before we begin, I want to use a little analogy. This analogy comes from the fact that I really want a Tesla. And some of my coworkers recently got Teslas, and I'm a little bit jealous. So maybe I want to more emphasize the power of these Tesla than this particular analogy, but really, it's applicable. So when I was growing up, one of my first cars was, it wasn't a Honda Accord, it was actually the Acura, but it's very similar. And with the 1998 Honda Accord, when something breaks, it's pretty clear. And we have a good understanding of what to do when that happens. And the types of things that can break in the 1998 Honda are. There's a lot of things that can go wrong, but they're nothing compared to what we see in the Tesla. Why is that? Well, technology has changed. So in the Tesla, not only are we dealing with the same mechanical challenges or some of the same mechanical challenges you find in the Honda, we have a whole bunch of other technologies that are bundled in here. So, yes, things can go wrong with your brakes and your tires and your motor. Even though the nature of the motor is extremely different, the Tesla is also running a whole bunch more software that you never found in the Honda. And the complexity of that software is such that it also is running in a cloud native type way, with containers and microservices, et cetera. So when something breaks in your Tesla, finding out exactly what went wrong, you can't just rely on an alert or a check engine light. You have to dig deeper. And to dig deeper, you need better tools for it. So that's what this talk is all about. The nature of applications has changed to be similar to the nature of cars. If you think of your 1998 Honda as your monolith and your 2020 Tesla as your microservices cloud native application. And as a result, how we approach fixing those problems also has to change. So my name is Chris Riley. I am a senior technology advocate at Splunk. What does that mean? It's basically I was a developer. I really enjoyed software development, but it wasn't my forte. So I'm a bad coder turned advocate. I couldn't give up the lifecycle and what it takes to build better applications faster. So now I talk about it. If you scan that QR code, you'll get access to other information about me. Please connect. I love to hear from people who have attended my sessions, both good and bad. If you have feedback on how to do better, please let me know. That's how I improve, but also just reach out to say hi. And there's a few fun little games I have on my social profile there. So back to what we were talking about, and that is that transformation is a given in the technology space. It is hard even in a these six month span to keep current on all the things that are going on in the DevOps and the application development market. So as a result, we have to think about change as a constant. Now we can't just expect to do what we did historically with modern applications. That also has to change. And ideally the way we monitor and support our applications should be ahead of the transformation of the application itself. Most organizations we find are currently in this lift and shift and refactor stage of application development with a lot of companies are in that cloud native world and they're born cloud native and they kind of have all the practices I'm talking about today kind of ironed down from day one where they've actually embedded their monitoring as a feature of their application. But most people don't have that luxury. We're going from shifting our workloads to the cloud and then breaking them apart with the strategies pattern and making them more like a cloud native application. So because this is happening, we have to acknowledge the fact that as we transform, we can't just think about supporting our applications the same way. What used to be in monolithic application architecture is we could always go to that one server and we could look at the alerts on that server, pull up the alerts, kind of see what's going on. And generally because these servers were set it and forget it, the things that went wrong were predictable. Like the 1998 Honda. I don't know if you've experienced this, but back when I was maintaining servers and data centers, we always had like one server, the Billy Bob server, and that Billy Bob server. Every two weeks we had to restart. And when we restarted it, everything was magically better. Again, did you try turning it off and on? And we did, and it fixed the problem. We weren't really looking at what was happening at the application, and alerts actually were enough context to address these problem, because that infrastructure never changed. The number of things that could go wrong were fairly limited. Now we're talking about applications with distributed architectures, microservices based architectures. So instead of one server, we have many, many services, all doing very small, very specific things, and each service has a contract to the rest of the organization, but the service itself doesn't care really. So all they care is these take in some sort of input and they output some sort of data. But anybody can consume that service. Well, because the web of these things can interconnected in unknown ways. It's not possible for me as an operator to go to these Billy Bob server, because there is no equivalent in the Billy Bob service. There is no one service that I can just always go to look at the events for that service, restart the service, and be good. We are not able to conceptualize and hold in our minds the architecture of these applications and the entire nature of the application like we could in monolithic days. So we can't expect to go and find the problem. The problem has to find us, and it has to find us in a way that has good enough context that we can resolve the incident quickly. So that's where, when we start to look at the relationship between the services, we more have a visual, kind of like these. And again, if we were to approach it with an alerting approach, we could probably go and find an alert that would give us some sort of detail. But it may be the wrong source, because we can have an API gateway throw an alert and say things are broken, only to come to find out way too long down the road that it had nothing to do with these API gateway. It wasn't the API gateway that was causing the problem at all, it just has cascaded down. And the API gateway was the first to scream at us. So this becomes very complicated. And as you can see, alerts are not sufficient in this type of scenario. But also because of the complexity of these modern applications, the life of those who are on call becomes more difficult. Nobody wants to be paged at night. Being on call sucks. You could be a developer who's on call, you could be an SRE, a cloud operations engineer. Everything we do, and we are very guilty about talking about features and functionalities of technology, but everything we do is in the service of making the on call experience better. That's ultimately what we're trying to do here so we can get all convoluted about monitoring and observability and all that stuff. Really all we're trying to do is make that on call experience better. What does better mean? Well, first of all, it means hopefully you're never woken up at night. That may still happen. Second means if you are oncall and you are woken up, you're the right person. It actually gets to the person who should be resolving the problem. If you aren't the right person, hopefully you're giving enough data that you can resolve it yourself or know exactly who to go to to resolve it. So logs simply are not enough. You cannot be on call, go from an alert to logs. Everything has to change. Well, because of that, we get this beautiful new term and it's called observability. And if you're really cool, you abbreviate it eleven y. Maybe even pronounce it Ollie. That's the upper echelon of the terminology club there. If you're a good curmudgeon like me, and you hear these terms, first thing you do is say, no, we don't need another term, it's just monitoring. Yes, you're right, observability is just monitoring. But the reason we use the term observability is the nature of monitoring has changed, which means the things that we need to do to support our applications better has also changed. And in the world of observability, alerts are not incidents and incidents are not tickets. Now, if you remember at the very beginning I said, you're in the right place if you think that these are the same things, because they're not. They are fundamentally different and they need to be different because whereas alerts potentially got us to a log, that was good enough, we now have to live in this world of incidents. But incidents imply something that are completely different than ticketing and using a ticketing system. Well, let's go back to the car analogy and see how this comes together. So alerts are like your check engine light. Your car has told you something is wrong, a system inside of your vehicle has screamed I am broken or something is not correct. Here, tell them something is wrong. That's all you generally get with an alert. You get a pointer to an issue. Now that's valuable. If you're driving down the road, check engine light comes on, you're probably going to be more hyper aware of your environment and maybe start to think about taking action. So you pull over to the side of the road what's the next step? Well, the alert itself, unless you have smoke coming from your hood and you know you're overheating or there's something else obviously obvious that you can address. You don't know enough to fix the problem, so you need help. Well that's incident response. Incident response is your roadside assistance. It's who you call to come and support you in finding the problem and resolving it. Now maybe they get there and they can jump your battery, they can replace your battery, they can give you coolant, whatever it is, maybe they can fix these problem. That's ideal scenario. Or maybe with their expertise they understand that this has escalated and they need to bring in help. How do they do that? They take you to the shop. So you go into the shop and they plug you into a computer and they get diagnostic codes. Or in the case of these Tesla, they get a whole flood of information coming out of the system. So incident response is that whole lifecycle of mobilization and you want it to happen as quickly has possible so you can resolve the incident. Well where does that put incident management? Well the foundation of incident management came from ITSM practices and it's driven with a ticketing system. In the states we have something called Carfax. Carfax is the equivalent of incident management. It is a history of all the things that has happened to your automobile. So once your car has been brought into the shop, a Carfax record will be created, an incidents will be created, and it will be historical details of everything that has happened, what happened, and potentially you can use that information in the future. That is what incident management is and should be. Although many organizations will try to use incident management and ticketing as their incident response tool, and that's not the most effective way to do it. And the reason it's not the most effective way to do it is a lot happens in the lifecycle of an incident, from the alert to the operations of an incident, which is not just an alert. As a matter of fact, an incident can be many alerts put together to paging escalation, getting to the right person at the right time, giving them all the context, and then finally moving that into maybe chat ops or a ticketing system or whatever it is to keep that system of record. Now if this view didn't look complex, this view looks even more complex. This is an entire architecture of the lifecycle of an alert, from some sort of system, a tool, to creating an incidents and mobilizing a responder and potentially taking action within side the incident itself. So what we get with an incident over alerts is less noise, hopefully depending on the nature of your systems. I still am blown away by the fact of the number of systems. I see where they're like oh yeah, we get 1000 alerts a day and we just kind of ignore that. Why don't you go fix them? That should be your first effort. But the idea of an incident is less noise because incidents are aggregation of alerts. There's rules and logic associated of when something actually becomes an incident, when it is significant enough to be an incident. Incidents also give us more context. Alert are triggering off of something very specific in an application or these infrastructure. So the amount of detail is very limited. Where an incident can be based on a history of incidents can be linked and correlated with confluence articles, run books, et cetera. Which gets us to the point where incidents can drive action. Because there is more context and detail as a part of the incident, then there can be greater action and greater execution as a part of that. Now again, all we're really trying to do here is make the life of on call suck less, which is great because it also makes the application run better. So in a typical incidents lifecycle we have the process of acknowledgment. You touch a whole bunch of tools, you might reach out to a whole bunch of people. Usually there's two methods of interacting with other people. The first method is what we call spray and pray, where you blast out to the entire organization. You hope somebody picks it up. Very common in the knock type environment. And it used to work okay, but it was never super effective. What happened is there's a lot of back and forth between people. You're touching a lot of tools to find the information you need and the amount of touch points was tremendous. So it usually lasted about 6 hours and five reroutes, et cetera. This is totally based on the nature of your application, but this is actually a consistent average we have seen with customers. Then once you get to the source of the data, you found the right person, or maybe you do what's called lazy mobilization. These you always find that one person who fixes everything, but you quickly are on the path of burning them out and you start towards resolution. Resolution is not just, oh, we need to restart the service, resolution is, we restarted the service and now everything is green. Because usually, and especially in a microservices world, the source of the problem will cascade into other services and cause issues down the road. And it takes time for those all to come online. So with true incident response, and that was driven by a ticketing system. We're trying to normalize and flatten that curve. We're trying to get acknowledgment to be as quick has possible. That's the mobilization roadside assistant. We're trying to touch as few tools as possible. That's context, understanding, getting to the right person, and then having the tools like run tools to resolve the incident has quickly as possible. So this is all how this comes together. Our observability practices, which is monitoring, is what gives these alerts and the context to our incident responding practice, which is mobilization and action to our incident management and ticketing process, which is our record and tracking of incidents over a long period of time. And another way to look at this is kind of this hierarchy of knowledge or insight or success ultimately, which is the alerts. We get tons of alerts, and some of them are useful, some are not. But these can roll up into incidents where we have more meaning and we can start to mobilize and do something with the alerts that is meaningful and get that in front of the right person. Then through troubleshooting, the incident links us directly to a dashboard. And once we are at that dashboard, we start to get insights of what's actually going on, where we may dive into a log at that point, we may be going to a log in a microservices environments. We may be looking at traces and spans for the details of what's gone wrong. Everything should be right there that we need to resolve these incident as quickly as possible. Once we have insight and we know what's going on, we take action. So we want to compress the time between all of these steps, and we also want to reduce these noise, which would be shrinking this way of all of these steps. So if you came here thinking that incident response alerts and incidents and tickets were all kind of one and the same, hopefully you now understand that they are not. And what I'm encouraging you to do is if you have not established a practice of incident response versus just being your monitoring tools and staring at dashboards, and hoping that at some point in time something interesting happens on your dashboards, or you're using a ticketing system, and hoping that somebody is really fast at typing tickets and they put enough detail in there that somebody will grab it and resolve the incident. Building a true incident responding strategy is necessary for the SRe practice. It's necessary for any sort of cloud operations, because the nature of applications in the cloud have changed. They're distributed. You don't have the billy bob server that you can always go to to resolve the problem in the same way that you always resolve it. You don't have the 1998 Honda. You are now driving the Tesla. Congratulations. Now you have to support it in a different way. So hopefully that was compelling. Please again, reach out if it makes sense. And thank you so much for enjoying me. Hopefully you enjoyed it. Join me at another virtual event. The content here is fantastic. I know there's a ton of information out there, so for you to take a little bit of your time and spend it with me means a lot. So I hope to see you at another event and have a great day.
...

Chris Riley

Senior Technology Advocate @ Splunk

Chris Riley's LinkedIn account Chris Riley's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways