Conf42 Platform Engineering 2023 - Online

Incident Response? Let's do science instead

Video size:

Abstract

Being on-call is stressful. Stress impacts our brain and the way we think, making decision-making harder, but resolving incidents relies on making good decisions quickly! This talk looks at a more scientific-based approach to incident decision making using Karl Popper’s theory of falsifiability.

Summary

  • Ivan Merrill: This talk is an evolution, really, of my thinking around incidents. Having spent 15 years in the financial sector, working primarily in kind of monitoring and observability. Add in some research that other people have undertaken to help you.
  • The tech industry is still pretty young. Incident response can learn a lot from safety engineering in other domains. Too often it can descend into chaos. Let's make sure that we can take these ideas into the tech industry.
  • Incidents are a set of activities bounded in time that are related to an undesirable system behavior. One of the hallmarks of a complex system is the potential for a catastrophic outcome. And the fact is that incidents aren't actually very easy.
  • An over reliance on dashboards and runbooks can lead to low meantime to innocence for every team. Spending a long time on the wrong hypothesis is a massive time sink and can be costly for organizations. Fear of failure is a big thing and it's really important during incidents.
  • History doesn't repeat itself, but it often rhymes. And so it's really important that we learn and we improve with every incident that we have. But there are some pitfalls that we can fall into here, too. We need to record the context of why decisions were made.
  • The complex system isn't just the technology, right? It's actually everything involved, and that's the humans as well. As we introduce change, we introduce new forms of failure. The next failure is likely to be even more catastrophic.
  • A more scientificbased scientific approach to how humans perform and document incident investigations can improve reliability. We can apply Karl Popper's theory of false viability to creating hypotheses within incidents. But transferring this experience is really, really hard in most cases in many companies.
  • Think of lots and lots of different hypotheses. Keep it high level, though, and don't go too deep. Try and disprove as many of these as possible. Simplest is often the most common. All practitioner acts are a gamble. But hopefully by introducing more scientific method, by recording our actions, we can reduce this gamble.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hi, I'm Ivan Merrill, and this talk is incident response. Let's do science instead. And this talk is an evolution, really, of my thinking around incidents. Having spent 15 years in the financial sector, working primarily in kind of monitoring and observability, rolling tools out, educating people on how to use them, and getting involved in incidents. And as you can imagine, working with some fairly big financial institutions, some of those incidents can be pretty scary and have quite a large impact. So what I've tried to do is look at some things that have gone really not so good that I've seen before, some ideas around what I think can do and we can do better, and also add in some research that other people, far cleverer than me, have undertaken to help you, hopefully with some really practical information that can maybe help you with incidents. So first things first. Incident response, I really strongly believe can learn a lot from safety engineering in other domains. Right? The tech industry is still pretty young. And as someone who's spent a lot, all my career, really on the kind of operation side, I know that it's great to see now that sres are getting more focused and DevOps really kind of brought another bit of focus onto ops kind of roles and ops functions and everything else like that. But actually big incidents, complex incidents on big systems and stuff, we're still pretty new at it. We're still pretty young at it. And when we look at aviation, health, the emergency services, they've been dealing with really, really important incidents for an awful lot longer than us. And their incidents actually are often much more impactful. They are quite literally life and death. So it's absolutely right that we look at these industries, look at these domains, and understand why they're doing what they're doing. Right. I would feel horrified if I saw some firefighters turn up and they were just kind of, just started all, everyone kind of went for themselves and crazily started like using their hose and spraying water onto random bits, right? That's not what they do. They have clear kind of structure to what they're doing, some thought being their decision making. And so let's make sure that we can take these ideas into the tech industry as well, because we really need to. And too often it can descend into chaos. So firstly, a definition, right? I love a definition. And John Osborne and Richard Cook are far better to give one than I am. So I've relied on their kind of authoritative views here to take this definition, which is that incidents are a set of activities bounded in time that are related to an undesirable system behavior. And there's a couple of things I'd like to take away from this. One is that it says an undesirable system behavior. It doesn't just say something broke. Right? So that's .1 and .2 is it says a set of activities, not just this one thing. So it's not just one thing, and it didn't just break. It's a set of activities bounded in time that are related to an undesirable system behaviour. So we have a definition. Right. And why is all this important? Well, the fact is that catastrophe is always around the corner. So says Richard looks, who was an amazing person, and I strongly suggest you seek out his work. And he wrote about how complex system fails. And this is one of the items in that kind of list, really. We are never far away from catastrophe. And in fact, it's actually worse than that. Right. The complex systems that we are building. Right. One of the hallmarks of a complex system is the potential for a catastrophic outcome. So just the fact that we are building these complex systems means that we are creating the potential for catastrophe. We can never escape it, right. And we can build systems as resilient as we can, and we can really invest an awful lot of time in making them resilient, a lot of effort, a lot of money. But the fact is that there will always be a way for it to reach a catastrophic outcome. Some things are completely beyond our control, and so we need to work out how to actually deal with this catastrophe. And the fact is that incidents aren't actually very easy. They're generally pretty difficult. If they were easy, I would not be doing this talk and sharing some experience and hopefully some useful information with you. So as a result of them being pretty hard, in fact, there are a number of common pitfalls that people generally seem to fall into that I've certainly seen anyway. So I'd like to just kind of COVID these, and I'm sure some of them will appear quite familiar to you. So the first thing is an over reliance on dashboards and runbooks. Right. And what happens here that I've seen quite a lot is that people generate dashboards and they are a predetermined set of signals, which can be great signals, whatever. That's a whole other subject. But they have got these signals and they're trying to understand what's going on. This incident, based on these predetermined set of signals, and they might not be able to work out. Right. What's going on. This incident doesn't map nicely to their dashboard. And particularly when you have kind of several teams that all have their own dashboards and everything else like that, everyone's looking and their dashboards are all green, but the incidents is still occurring. And I found out that this is actually called the watermelon effect, which I really like, whereby everything is green on the surface, but if you scratch it at that surface, everything is red underneath. Right. And it can easily lead to this kind of very low meantime to innocence for every team, because everyone's looking at their dashboard, their dashboard is green. They're all saying they're fine, but the incident is going on, and runbooks aren't actually necessarily any better. This is not saying runbooks aren't great. They can absolutely be great. But we mustn't only rely on our runbooks, right. Because if all that ever happens is that we only have teams that know how to respond to incidents by looking for a relevant runbook, then all they can do is if situation x occurs, they do y. Right. There is no room really, for these people to build their troubleshooting skills that are really so important just in incidents, right. And they're not gaining any experience in how the system works. They're not rebuilding their mental model of their complex system with every incident. Right. There is no learning going on here. And so we need a way of to investigate incidents that allow us to build our unnecessary experience and learn what works and what doesn't work and help us kind of upgrade our mental models because our systems change. Right. And another thing that I have seen that, I mean, hopefully, quite obviously seems bad is guesswork. We know we need to do something, right? But we don't have a clue what to do. And so we sometimes guess, and there are many forms of guessing. Some things are just really big gambles, some things maybe a bit more educated guess, like just immediately failing over to a second site or anything else like that, right. And the fact is that sometimes it does actually work. Right. But that's luck, and luck is not a reliable kind of strategy. Right. We can't build up a troubleshooting skill based on luck. There's no learning opportunities for it. There is no hypothesis that's being built. There's no context to our decision making, and it's certainly not able to. We're certainly not able to build a runbook or anything else like that based on luck. Roll the dice and take option five. In this case, that's not going to work, right. So we need a way to structure our thinking and to help us move forward when information is actually low because we do find ourselves in situations where it is really not clear from the information that we have what to do. So how do we deal with that? Spending a long time on the wrong hypothesis is a massive time sink and can be really, really costly for organizations. Right. The amount of times we see we've got a high error rate, customers are complaining, things are breaking, and we look at whatever's giving us the most errors and we follow that through and we spend an hour or so investigating and everything is looking like it's absolutely going to be this thing. And suddenly this little bit of information appears and it's absolutely blown our hypothesis out, right. It cannot be the thing that we've just spent an hour investigating and we're human, right. And when we kind of have an idea of what's going on, we suffer from confirmation bias, right? We look for things that are going to reinforce our hypothesis. We want to believe in ourselves. We feel this is what it's going to be. We get this feeling. And so we surround ourselves with information. We seek out the information that's going to say, that's going to agree with what we think, right? This is confirmation bias. And so we need a way to prevent that as much as possible, right. Because we do not want to be spending a lot of time investigating the wrong things. Fear of failure is. Yeah, it's a really big thing and it's really important during incidents, right. Because incidents, no matter how much we can apply best practice and everything, they are high stress situations, right. And quite often good practices and good decision making can go out the way. And if we're scared of doing the wrong thing, we may never take action. In fact, that's one of the kind of the key factors in procrastination, right. We are generally just scared of what's going to happen if we get it wrong. And so we just never start. We need psychological safety in incidents. We need to feel good in the situations, in the decisions that we're making, right? And we can be in a situation where we are 99% certain that this is the thing or we can have all of the evidence and everything else like that done everything, right. But we are fearful of the consequences. And that, again, can delay resolving the incident. But also, it's really not nice for the people involved in incidents, right. It doesn't make people want to get involved in incidents either, which we need them to. So, yeah, fear of failure is a really important thing. We need to provide a way to have psychological safety and in our decision making. So, yeah, and so I've covered a few things that I've seen quite a few times in incidents, and I really like this quote. Right. History doesn't repeat itself, but it often rhymes. And I think this applies to incidents as well, because we shouldn't really ever be seeing the same incident over and over and over again. Right. But we can quite often see patterns of incidents. And so it's really important that we learn and we improve with every incident that we have. And many organizations are now performing some kind of post mortem, post incident analysis, whatever you want to call it, but there are also some kind of pitfalls that we can fall into here, too. So just going to take another moment just to look at a couple of those. And the thing is that it seems easy to look back at an incident and determine what went wrong. The difficulty is understanding what actually happened and how to learn from it. Hindsight is 2020, right? It's not actually very good to look back and go, oh, this thing happened. There is no understanding when we point out something that just seems really obvious afterwards because it didn't seem obvious at the time. If it was obvious, people would have, would have made that course of action. They would have seen that, oh, it was a bad deployment, we redeployed, everything was better, da da. Well, that's quite a kind of simplistic view of an event, right? And if it was that simple, people would have just resolved it straight away. But it wasn't. So it didn't feel that simple. And if we take that kind of approach, then we are preventing learning, right? And actually, so one of the things that I think we can understand from this is that we need to record the context of why decisions were made, right? So that if it seems so obvious afterwards, why didn't it at a time, right? Why were people thinking it was this? What is the context of that decision that they were making? And that's something that's Quite often missing in post incident analysis. And we can talk about it in terms of normative language, right? So when reviewing an incident, normative language kind of says that a judgment is being made, right? And that's often based on someone's perception. And it's really, really easy to do. I'm quite certain that I've done it before. I'm sure most people here have. But the implication, right, is that if people had done this thing, then everything would have been resolved much better. Everything would have been fine, it would have been resolved quicker. Da da. But they didn't, you know, and that's just silly them. Right? The team missed this obvious error, which they ought to have seen. Well, what if they missed it because there were a sea of alerts that made it impossible to see any single alert? Right? What if they had seen the alert, but they didn't know it was important or knew it was important, but they didn't know what to do? Normative language doesn't help us. Right? And in actual fact, it's removed an opportunity for learning because this person has made this judgment that actually this was just kind of this silly person, this human error, and that was the cause. Right. But that's really not helpful and doesn't help us improve in the future. Human error, to be clear, should never really be considered as a major factor of an incident. Right. It's rarely ever the case, and it just removes every opportunity for us to learn. It's very easy to say, just don't make that mistake next time. Right? But how is that person allowed to make that mistake by the system? So, yeah, we need to think of a way that we can explain things in a more factual way and avoid normative language. Mechanistic resolving. Right. I kind of almost like to think of this in a kind of Scooby Doo type way, right. And bear with me because this is a little bit stretching it, but we would have gotten away with it, too, if it hadn't been for that meddling regional failure on our database service. Right? That's my Scooby Doo kind of villain impression. Sorry. But what I'm trying to get at, right, is that there is a temptation to reduce really complex failures to simple outcomes, right? So this happened because of that, right. Everything failed simply because of our regional database failure. Right? And everything can be explained by this simple thing. And this kind of leads us to fall into this trap of there being a single root cause. But we know from our definition, right, that it's often a set of actions. So, yeah, we need to avoid that, because very, very rarely, if ever, I'm not sure I've ever seen one. Is there actually a root cause? So, mechanistic reasoning, right? It simplifies the issues faced. It's almost impossible for there to be a single root cause. Yet this can often lead to that thinking. Normally, there are a series of contributing factors, right? And again, it's removing the opportunity for learning because we've got that root cause. It was that. But if we know it was that, then, okay, fine, solve that. But there are normally, as I said, contributing factors. So much else going on around it. So we need to be able to think of those as well and learn on all of those things. So, yeah, we need to be exploratory in our approach, right? We need to have a way to evolve our thinking as new information arrives and understand that fundamentally, we may never actually know all of the causes, right? But hopefully we can learn from as many as we can. And Richard Cook, who I've mentioned a couple of times already, and again, strongly recommend looking out at, kind of described this as above the line and below the line. And this is a really, really important concept when we think about our systems and stuff, right? Because the complex system isn't just the technology, right? It's actually everything involved, and that's the humans as well. So the technology is below the line, and that's here. I've got the code, the infrastructure, other tech stuff. This is my simplified version of it, by the way. And we are above the line, and we don't actually see the code running on the computer. We don't kind of see zeros and ones traveling between network interfaces and everything else like that, right? Or deployments going from our CI CD platform to our infrastructure. We don't actually see that. We interact with that, right? And we can only infer what's going on beneath that, right? Based on our interactions. And the fact is that we are actually the adaptive part of the system. We are the ones that introduce change, right? We introduce change to our code by making releases. We introduce change to our architecture. But we make new deployments, we make new releases, we introduce new features, everything else like that, right? We are the ones that are introducing change. We are the adaptive parts of our system, and we are above the line, okay? And this is kind of an ongoing thing, and it's all changing the way that our systems are being. Changing how our catastrophe occurs or can occur. And also, what should I say? Well, I think the best way to unthink about this, right, is that if everyone stopped working on your system, right, there was no more human interaction whatsoever in terms of supporting it or anything else like that. Deployments, how long would that system survive? That service, that thing? How long before users would be complaining? So it's a really important being to understand is that we are actually part of the system and our behavior is impacting the system. And even more than that, we are the ones that define what an incident is. Again, I mentioned our definition of an incident, right? Undesirable behavior. Who defines what is undesirable behavior? We define what is undesirable behavior. But also, as we are part of the system, if we improve our behavior, then we are actually improving the system's behavior. Because again, we are part of the system. And I said we introduce change, we are the adaptive part. But as we introduce change, we introduce new forms of failure. And as we resolving one incident or one type of failure, then we are potentially introducing a new one, right? A new weird and wonderful way in which our service can fail. And again, this is another richard looks thing from how complex systems fail. It's even worse than that, right. In that as we make our systems more resilient to the types of failures that we've already seen, it actually takes more for the system to fail, right? So the next failure is likely to be even more catastrophic because each time we're kind of making it more resilient, we're fixing it, we're preventing all of these kind of smaller failures. It's just going to take a bigger thing for it to go wrong. But the fact is that we can't stop change. We need to have change, right? It's part of our jobs and for good reason, right? We do need to release new features, we need to stay ahead of the competition. We need to do all these things. We have to introduce change. Change cannot stop, but it is introducing new forms of failure. And it's important that we're aware of that. So when we think about these things, been looking at research and actually how people go around troubleshooting, right? And the fact is that some people just seem to be understand what's going, seem to be able to understand what's going on much more effectively than others. They seem to kind of smell what's going on, right. They've got this evolved approach to troubleshooting. Right? And the fact is that they are not working out how things are breaking in the same way as people who are new to a service or new to a technology have, right. When we first start out troubleshooting, we rely on our system and domain knowledge, right? We know the technology, we have, our mental model and we kind of try to work out what's broken based on these things. Well, this thing is calling that thing. So let's look at that. And we kind of build a base hypothesis on what we think can be going on based on these understandings of the technology, right, and the service. But actually, as we evolve our troubleshooting experience and let's become more experienced with our system, we actually move lets from our understanding of these things and more to our experience in what we've actually seen before, we can start to build hypotheses based on how similar the symptoms are that we see here to symptoms that we've seen in other incidents. Right. So what's actually happening is that we're able to build and remove and such a sedentary cycle through hypotheses much faster. Right. Because we can start to say, okay, well, I've seen these signals, these are similar to these incidents over here. How can I remove this hypothesis or this hypothesis that I've seen before from this incident, right. They are literally cycling through these hypotheses really quickly until they get to one that seems to fit. And that is how people are evolving their troubleshooting experience. Right. But it's clear that this takes a lot of time and effort and experience. Right. These people haven't. You can't just turn up to a system and gain this kind of smell and understanding of what's been before in it. Right. Experience is really hard one. Right. But transferring this experience is really, really hard in most cases in many companies, you actually need to have been involved in that incident directly to have any real knowledge of what's going on with it. Right. About how to apply it in this particular case. And that's quite difficult to do. And again, we are part of the system, but there is change on the adaptive part as well. People come and go. So what do we do? Well, I think that we can bring more science into this. Right. And again, I love a definition. So let's take the ever reliable Wikipedia, its view on science. So science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Lovely. Worth noting, we're not talking about the universe, we're talking about our service or system. So we can reduce the scope quite drastically, which is fortunate. And something specific is that scientific understanding is quite often now based on Karl Popper's theory of falsifiability, which is really hard to say. Try and say it three times. And yeah, anyway, no matter how many observations are made which confirm a theory, there is always the possibility that a future observation could refute it. Right? And there is a quote here, right. Induction cannot yield certainty. Science progresses when a theory is shown to be wrong and a new theory is introduced which better explains the phenomena. So essentially we learn with each theory that we disprove, right. And we've kind of got this thing here that you can see this kind of classic scientific experiment thing going on, that we've got an observation question, research hypothesis. We test with an experiment, we analyze the data and we report the conclusions, but it doesn't end there. We keep going, right? We can never truly know. And this is the theory of false viability from Karl Popper. And I think actually this is something that we can apply to creating hypotheses within incidents. I think this is something that actually we can apply that will help us. So I think it's quite straightforward actually to convert this scientific method into incident resolution behaviors. So an observation is our system is exhibiting an undesirable behavior brain, based on our definition. And so we do some research, right? We look at our monitoring and observability systems and based on what we see, we create a hypothesis on what is most likely happening. We think it's this particular thing that's going on because we've seen it in our system, right? And so we do an experiment and this is important, we attempt to disprove the hypothesis, right? Disprove, not prove. Again, thinking about having the wrong hypothesis and losing a lot of time, we attempt to disprove our hypothesis. In our analysis we say, are we able to disprove this hypothesis? If so, great, okay, that wasn't the thing. But we learned something, right? And now we can move on to the next most likely hypothesis and we can repeat it and we can keep going through this cycle until we're unable to disprove a hypothesis, at which point this becomes our most likely working theory, obviously until such time as we learn anything that disproves it. And we can either narrow the hypothesis or we can start to take action based on the hypothesis. This is kind of a way that we can apply this structured scientific thinking to an incidents. And here is a theory, here is my hypothesis. In fact a more scientificbased scientific, in fact a more scientific hypothesis driven approach to how humans perform and document incident investigations can improve reliability. Because I'm talking about not just creating these hypotheses in this way, but ideally doing some kind of, kind of documentation, writing down, providing some context because we want to learn from all these things, right? And so here's one I made earlier. A possible explanation for high error rate is that there is a high database latency, blah, blah, blah. I'm not going to read it all out, but we can disprove this by, and you might be thinking kind of, why would I want to write all this stuff out? Why would I want to consider this structured thinking? This seems a lot of effort. And why indeed, you know, fair question, this is my hypothesis and maybe I've got it wrong. Well, let's have a look at some of the things we've thought about before, right? Well, but because if we're using Karl Popper's theory of falsifiability, we're removing bad avenues of investigation as soon as we can, right? We're attempting to disprove something rather than just go on improving it and keeping on proving it, improving it, improving it and proving it until such time that we don't prove it, right? So we're removing quickly bad avenues of investigation because we're looking to disprove it allows for changes in incident changes, right? We do get more information as incidents go on and they evolve. And so, like science, it's only ever our current understanding, right? Until such time that it becomes disproven. And that might happen because as I said, incidents do change over time. This can formalize the language used to explain decision making, right? And this is really, really super important. We're creating a way to communicate to others in a clear way, right? We can record our hypothesis, we can record the outcomes of our experiments, and this can really improve learning because we have much more information to work from, right? We have the context of what's going on. And that provides us with a level of psychological safety because there is a clear kind of documented reasons as to why we're doing what we're doing, right? We have the reasoning behind the decisions. We have the proof of why we did what we did, right? And there is safety in that because we can show exactly how we got here. And that's really, really important. And also, as I said, it avoids normative language. We're talking about kind of a very scientificbased approach here. We're talking about things that are based on facts, hypotheses. We are looking at this hypothesis until such time as it's disproven. We've got our little kind of. There you go, our little hypothesis there. It's factual language, right? So we're avoiding all our normative language and our mechanistical reasoning. And yeah, hindsight bias is reduced as there is context, right? If we are going through all of these hypotheses, people are able to see after the event why we did what we did. There is context, okay? This was their hypothesis. They thought it. Because of this, they disproved it. Okay? It makes sense that they went on to there. We can see their thinking. So you might be thinking, okay, that sounds actually pretty good, but how do I even create a hypothesis? Well, John Osborne, again, very great person in this field, his actual thesis that he did, looked into this and failures in kind of Internet services. Right. And so he produced these steps. The first being is to look for changes. And I think this is something that actually we've all seen quite often, and certainly I've seen it before in incidents, an incident is created and the first thing is what's changed. And we know that change is a very common source of failure. And so it's a really great place to start. If that doesn't help you, if maybe you've created some hypotheses based on what's changed and you've disproven them all, then you go wide, but you don't go deep. So what this means is think of lots and lots of different hypotheses. Maybe every team that's involved is asked to think of maybe one or two hypotheses. Right. Keep it high level, though, and don't go too deep. Try and disprove as many of these as possible. Right. You can always zoom into these later. As I said earlier, once you've got a hypothesis and you can't disprove it, maybe you can narrow the scope, but for the moment, widen the net, think of lots of different things, try and disprove them, and then don't always forget Occam's razor. Right. Simplest is often the most common. Right. So when going through these things, when starting to think of more hypotheses, think of Occam's razor. And that is kind of my talk, really. And hopefully that's given you some food for thought on how you can improve or start or whatever your incident response. And I'll leave you with this, another Richard Cook quote, because he is so influential. All practitioner acts are a gamble. Right. With science, there is no 100% guarantee, there is just a hypothesis that we can't disprove. Right. We have to accept this, particularly during incidents. Right. We never know for absolute certain that what we're going to do really is going to fix something. There is always the opportunity for another level of catastrophe because the actions that we take are above the line and we're interacting with things that we don't understand about. They exist below the line. So we don't know for certain. And therefore it's a gamble. But hopefully by introducing more scientific method, by recording our actions, by providing more structured response, we can reduce the size of this gamble. And also, if not, or whatever happens, we have these actions recorded that allow us to learn for the future. So that's it. That's me. Thank you very much and have a great conference.
...

Ivan Merrill

Head of Solutions Engineering @ Fiberplane

Ivan Merrill's LinkedIn account Ivan Merrill's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways