Conf42 Site Reliability Engineering (SRE) 2024 - Online

Clinical troubleshooting: diagnose production issues with scientific precision

Abstract

When diagnosing an illness, a team of doctors will iteratively catalog symptoms, possible explanations, and a plan of action. If your team needs to troubleshoot software failures, this simple diagnostic framework can move mountains.

Summary

  • Dan Slimmon: Clinical troubleshooting is a scientific procedure for collaborative diagnosis. He says it allows groups of people to troubleshoot problems much more effectively. Slimmon says if you use it consistently, you will experience shorter incidents and less stressful incidents.
  • The throughput of the flower service drops by 40% right in the middle of the business day. Another service breaks, the honeybee service. Two independent teams now coordinate to troubleshooting the issue. Devon Devon: It's a mess because every new person who joins the call has to get context.
  • Barbara and Alice use a shared Google Doc to look at the problem. They come up with three hypotheses to rule out or fortify any of them. They look at everything from the same point of view. That is so powerful during an incident when you're under time pressure.
  • When we talk about symptoms and hypotheses, you want your symptoms to be meaning that they are statements of fact about the observed behavior of the system. You want them to be quantitative as possible. And then finally you want to make sure you have no supposition in your symptoms.
  • You want your hypotheses to be explanatory, meaning that they explain one or more of the symptoms. And you want them to be falsifiable, which is a little tricky concept, but essentially means testable. It's really all about ruling things out, not proving things.
  • You can take rule out actions which rule out one or more of the hypotheses. You can also take research actions which don't rule anything out precisely. Finally, you should try wherever possible to use diagnostic interventions. If you practice clinical troubleshooting, you'll have shorter incidents.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. So happy that you've joined me here at Conf 42, SRE 2024. I'm coming at you from beautiful New Haven, Connecticut, the Elm city. And I'm going to talk to you today about clinical troubleshooting, which is a technique you can use every day in your career as an SRE to diagnose problems faster and solve them for good. And I can tell you that if you and your team do use clinical troubleshooting consistently, you will experience shorter incidents and less stressful incidents. That all sounds pretty good, right? But what do I know? I am Dan Slimmon, and I've worked for about 16 years of my life in operations, Sre DevOps. Most recently, I've worked at companies like Hashicorp and Etsy, and I've troubleshot. I've investigated thousands of production issues during my time as an SRE. Over those 16 years and thousands of issues, I've developed this clinical troubleshooting methodology that I'm so excited to tell you about today. It allows groups of people to troubleshoot problems much more effectively. Clinical troubleshooting is a scientific procedure for collaborative diagnosis. Essentially, that means you can use it anytime you have a group that needs to work together to explain why a complex system is malfunctioning. It doesn't have to be under intense time pressure, like during an incident. But incidents are a great example of a time when you need to do collaborative diagnosis. You get a bunch of people on a call, maybe each of them only knows part of the system. They have an imperfect view of each other's thinking processes, and so you're doing diagnosis collaboratively. Clinical troubleshooting is a scientific way to do that. So let's dive into a story. It's a wonderful, calm day, and the flower service in our stack is clicking along, doing its job, serving up nectar tokens for our users, and then suddenly it catches on fire. The throughput of the flower service drops by 40% right in the middle of the business day, which should never happen. Here's a little graph showing the throughput. It measured in terms of pollen tokens per second served, dropping from about 1000 to about 600 /second that's bad. And it's so bad that Barb gets paged. Barb is an SRE, and she just got paged about this problem with the flower service throughput dropping down to from 1000 to 606,600 /second she only knows a little bit about the flower service, but she's here on the incident call. She's ready to go. She's going to figure this out. It's a chance for Barb, captain of SRE, to show her quality, so to speak. So Barb spends a few minutes looking at graphs. She's searching for the error message on Google and the source code. She's building up context. And so she's holding a lot in her head. She's holding a lot of abstractions, a big house of cards in her head. When Alice joins the call and Alice says, Alice is an engineer on a different team, she says, hi, Barb. This is Alice. I saw the alert in slack. Can I help? But Barb, as we discussed, is holding a lot in her head. She's got several competing theories that are maybe not fully baked yet. She's still processing what she sees on the graph dashboards and on Google. And so she says, thanks, Alice. You can sit tight for a moment. I'll fill you in soon. She doesn't want to stop and explain everything to Alice when she hasn't totally explained it to herself yet. So Alice can watch Barb screen share, watch what she's doing. But she has even less context than Barb does on this issue. This goes on for a few minutes, and then suddenly another service breaks, the honeybee service. Barb gets a page saying that the error rate for the honeybee service is now elevated. She doesn't have time for this. She's trying to fix flour. So Alice jumps in, says, oh, I can look at honeybee. So now you have Barb over here looking to flower and Alice over here looking at honeybee. They're both looking around for data to explain their respective issues. This is kind of good because they can work in parallel. And then suddenly, boom, straight through the brick wall comes crashing. Seth. Seth is Barb and Alice's grand boss. He wants to know the answers to lots of questions. And he's spitting out those questions like so many watermelon seeds. He's saying, how are customers affected? Do we need anyone else on the call? How are the flower and honeybee problems related? What's our plan? These are good questions, and it's reasonable that Seth wants to know the answers to them. But because all the context on these issues is stored inside the heads of our heroes, Barb and Alice, Barb now has to take her hands off the keyboard and spend her time answering. Seth's very reasonable, but perhaps disruptively delivered questions. And while she's answering those questions, guess who else shows up on the incident call? Ding, ding, ding. It's the support team. Supporters started getting customer reports of errors from the honeybee service. And they say, is this the right call to talk about that? I know it's supposed to be about flour, but, you know, there's an incident. They have a lot of the same questions as Seth, but the support team's questions are going to have to wait because. Ding ding. Two more devs join the call, Devin and Charlie. Ok, so Barb has to put her conversation on Seth, with Seth on hold for a minute so she can assign these new responders to help herself and Alice respectively. So once she's answered Seth and supports questions, then she and Alice can spin up Devon and Charlie on context and everybody can get back to troubleshooting the issue. It's a mess because the effort is fractured. Essentially, we have two independent teams now, both trying to coordinate on a single call. We have a mess because every new person who joins the effort now has to interrupt Barb or Alice to get the call to get context on what's going on. And it's a mess because despite having been in the incident call for 20 minutes, there's still no real plan about how we're going to diagnose and fix this problem. If you've been on incident calls, you know what this kind of mess feels like. It wastes time. People step on each other's toes, we get disorganized, we miss things, and the incident goes on way longer than it needs to. And that's why it's useful to have a process like clinical troubleshooting. Clinical troubleshooting is what's called a structured diagnostic process. Having a structured diagnostic process makes a lot of these problems that we just. That just turned our incident into a mess go away. And does this by exposing the plan so that everyone who joins the call knows what's up and what we're doing. It does this by helping you avoid dead ends, which allows you to solve the issue faster. And most importantly, I think it lets you audit each other's thinking. When we audit each other's thinking, by which I mean how our coworkers are thinking about the issue and compare it to our own mental model, we can reason collectively, and because we're reasoning collectively, we can reason better. And that's how clinical troubleshooting gives us shorter incidents, fewer incidents, and less stressful incidents. So what is this structured diagnostic process that makes so many of Barb's analysis problems go away? It's this simple workflow. First we get a group of people together who have as diverse as possible set of viewpoints that'll be important for later. Working together as a group, we list the symptoms of the problem under investigation. Symptoms are objective observations about the system's behavior. Next, working from those symptoms, we brainstorm a number of hypotheses to explain the symptoms we're observing. And finally, we decide what actions to take next. Given the relative likelihoods and scarinesses of the hypotheses we've listed, we take those actions, and if they don't fix the problem, we go back to the symptoms. If they do fix the problem, we're done. So let's see how this works. Let's go back in time, through this time portal to the beginning of our wonderful calm day. The flower service is going along, and suddenly it's on fire. Throughput drops by 40% in the middle of the workday. Oh, no. So Barb gets paged. Barb spends a few minutes figuring out context, just like she did in the bad timeline. But this time, when ding. Alice joins the call and says, hi, this is Alice. How can I help? Barb remembers this phenomenal talk she saw at con 42, SRE 2024, called clinical troubleshooting. So instead of saying, hold on, Alice, I'm figuring some stuff out, can you just wait for a minute? She says, welcome, Alice. Let's do some clinical troubleshooting. She makes a shared Google Doc, or whatever her shared document system of choice is. She makes a doc, she shares it with Alice, she shares it on her screen. And the doc has these headings, symptoms, hypotheses, and they start the process. They write down the symptoms. What are the symptoms? Well, we had an alert at 841 utc that the flower services, through wood, dropped by. And we also know from Barb, having poked around a little bit before Alice joined the call, that the latencies of requests to the flower service dropped at the time that the throughput dropped. So it's getting less traffic, but the traffic it is serving, it's serving faster. Alice's sorry. Barb's first hypothesis that she's been cooking up before Alice joined the call is that the flower service is somehow crashing when it receives certain kinds of requests. And that explains symptom one, because the crashed requests aren't getting logged and so they're not showing up in the throughput data. And it would also explain symptom two, because maybe these requests that are causing flower to crash are normally requests that take a longer time than usual. So since they're not getting into the logs, they're not showing up in the latency data. And the average latency is shown to be reasonable hypothesis. And just like that, Barb has brought Alice over next to her. So instead of Barb digging into graphs and Alice twiddling her thumbs, Barb and Alice are looking at the same thing from the same point of view. And that is so powerful and so critical during an incident when you're under time pressure and you have to come up with a plan and share the plan, looking at everything from the same point of view, you have to have common ground. So here's their common ground. Two symptoms, one hypothesis. And that means they're ready to act. They think of two things to do. The first thing they're going to do is check the local kernel logs for any crashes. That would be evidence of that hypothesis. And the second thing they're going to do is read the readme for the flower surface. Because neither of these people is that familiar with what goes on inside the flower surface and what it does, they assign each other to those tasks explicitly. Now Barb's task takes a little longer because the kernel logs aren't in the log centralization system that they have at this company. Alice finishes reading the readme pretty fast and she comes back says, I have a new hypothesis from having read the readme. It jogged my thinking process. So hypothesis number two, maybe some downstream process, some process that is a client to the flower service is sending the flower service less traffic. Maybe the flower service is running because it's getting less traffic and it was over provisioned. So the traffic it is still getting can be served with less resource contention. Another reasonable hypothesis. So while they're discussing this hypothesis, they get that second page. They get that page about the honeybee service, fine, no reason to panic. They take that page, they add it to the symptoms list. They got an alert at 854 utc that the honeybee services error rate is elevated and that jogs their memory and makes them come up with a new third hypothesis. Which is maybe connections to the flower service are getting dropped at the tcp level. So maybe there's a, maybe there's like a. I know, we know there's a proxy on the flower service, the little Nginx that sits there, maybe that's dropping the requests. And so fewer requests are getting to the flower service. Okay, well now they got three hypotheses so they can start coming up with actions that can rule out or fortify any number of these hypotheses. For example, action three that Alice and Barb come up with is what if we check whether the honeybee and the flower disturbances started at the same time? Look at some graphs, compare some graphs, see if these two things are actually related. And the fourth action they come up with is to get Devon on the call, because they both happen to know Devon, another engineer on the team, knows a lot more about the honeybee surface than they do. We're in such a better place now than we were in the previous. In the. In the previous incident, because we are all looking at the same plan, we all have the same information, and any discrepancies between how we're thinking about the problem are taken care of by the fact that this is all explicit. Alice can go look at those graphs and she sees, oh, look at that. We had a linearly growing error rate from the honeybee service before the flower services throughput dropped. That's pretty interesting. So that means that any hypothesis that, in which these two things are independent, or that the honeybee error rate dropped after this, implies that the honeybee problem, which showed an increasing error rate before there were any observations from the pollen tokens count, must be prior in the causal chain to whatever's causing the throughput of the honey bee service to drop. That's really important because that lets us rule out hypothesis one, which is that the flower service is crashing on some kind of request. And it lets us rule out hypothesis three, which was that the connections to the flower service are getting dropped at the tcp level, because that would not have been caused by anything that shows the flower, the honeybee surface to be getting errors before the flower service. Sawney, any results? So we've ruled out two hypotheses, which is more progress toward a solution. We also get Devin on the call. There's Devin. So now when Seth shows up and starts asking questions like, are these two alerts part of the same issue? What's our plan? Many of the answers to Seth's questions are already on screen. Since Seth was mostly asking those questions because he was stressed out and not sure if his team was on it, he can now rest easy in the corner and lurk, secure in the knowledge that his team is on it. And they do have a plan. And likewise, when ding, ding, ding, the support team joins and they want to tell us about their honeybee problem, well, we can say we're aware of that. We're taking it seriously. We have that here on our symptom list. If you'd like us to add the customer reports that you've got, like, we can add those to the symptom list so support can tell customers that we're working on their problem. They can post to the status page, and they can go on the sidelines and do their job while leaving the incident calls bandwidth mostly untouched. So compared to the chaos of the other timeline, which looked like this, we're in a much better place. We're in a much better place because we've leveraged Alice much more despite her limited familiarity with the systems in question. We're much better placed because anyone new who joins the call can easily spin up context, and we're much closer to understanding the problem. We've already ruled out a whole class of hypotheses, and we have more symptoms that we can use to generate more hypotheses. And that's all because we followed through on a simple commitment to clinical troubleshooting. Now, clinical troubleshooting is very simple. You can use it unilaterally today. You can just start using it, and it's simple enough to explain it on the fly to whoever you're troubleshooting with. And I guarantee you'll get results immediately. You'll be blown away by how fast the scientific procedure helps you solve issues and how consistently it helps you get to the bottom of issues, even when they're very confusing. But I can give you a few tips for how to use the process as effectively as possible, starting with when you're assembling a team. When you're assembling the group that's going to do clinical troubleshooting together, you need to make sure that you're bringing as diverse perspectives as you can. So you're going to want to have engineers who are specialists in different parts of the stack. You're going to want to have maybe a support person because they have perspective on the customers or, you know, a solutions engineer or something because they're going to have perspectives on how customers use the product. You're going to want to have, you know, as many different roles as you can, because when you have more roles, more more perspectives on the call, you're going to generate a wider array of hypotheses, which is going to help you solve the issue faster. You also want to make sure that as you talk about the issue, you keep bringing focus back to the dock. People are going to propose different ways of doing things, and they're maybe not going to be thinking about the clinical troubleshooting procedure unless you, as the leader of the troubleshooting effort, keep bringing focus back to the doc, adding the things that they're saying to the doc if they fit, and trying to in a different direction if they don't fit. You'll see what I mean in a second. When we talk about symptoms and hypotheses. So symptoms, when you're coming up with symptoms, you want your symptoms to be meaning that they are statements of fact about the observed behavior of the system. You want them to be quantitative as possible. Sometimes you can't be, but to the extent that you can be, you want to associate your symptoms with numbers and dimensions. And then finally you want to make sure you have no supposition in your symptoms. So state the facts, state them as quantitatively as you can, and save the suppositions about what might be going on. For the hypothesis column. For example, if you have the hypothesis, the flower surface throughput dropped by about 40% at 841 utc. That's a well formed symptom. It is an objective statement about the observed facts. It is quantitative. It has numbers on two dimensions, time and throughput. And it doesn't have any supposition about why that observation, why that fact is occurring. Whereas if your hypoth, if your symptom is just the flower service is dropping requests, it's not quantitative, it doesn't have any numbers in it. It's not. And it contains a supposition about why the throughput number changed. Right? It's a subtle supposition because, yeah, if the throughput dropped, it looks like the service is dropping requests. But as we've seen from our example, maybe it's not dropping requests. Right? That's a supposition based on the symptom that goes in the hypothesis column. Speaking of the hypothesis column, you want your hypotheses to be explanatory, meaning that they explain one or more of the symptoms, or could potentially explain one or more of the symptoms. And you want them to be falsifiable, which is a little tricky concept, but essentially means testable. Falsifiable means if the hypothesis were wrong, you would be able to prove that it's wrong by taking some action. You should be able to imagine something you could do that would disprove that hypothesis. This is pretty important, because if you have hypotheses that are not falsifiable, then you can go chasing your tail for a long time trying to prove them. Actually, you should be trying to disprove them or create more hypotheses that you can then disprove. It's really all about ruling things out, not proving things, which is a little bit of a mental shift that you have to make if you're going to use this procedure effectively. For example, a downstream service is bottlenecked, which results in less traffic reaching flour. It's a pretty good simulation, pretty good hypothesis. It's explanatory, it explains the two symptoms that we've observed, and it's falsifiable because you could prove, you could show that the same amount of traffic is reaching flower, it's not actually receiving less traffic, and then that would disprove that hypothesis, which would be progress. If you have the hypothesis the flower service is crashing, that's not a good hypothesis because first of all, it might not be falsifiable. Depending on your stack, it may not be possible to prove that the flower service is not crashing. And second of all, it doesn't really provide a clear explanation of the symptoms. Like, okay, maybe the flower surface is crashing, but why is that causing throughput and latency to drop? It's not clear from the way the hypothesis is written. So finally, that brings us to the actions column. So there's a few different kinds of actions that you can take as part of a clinical troubleshooting effort. You can take rule out actions which rule out one or more of the hypotheses, and that's what gets you closer to the definitive diagnosis of the problem that you're seeking. So that's the main kind of action that you're going to want to take as you're going through this process. However, you can also take research actions which don't rule anything out precisely, but they will help you generate more symptoms and more hypotheses, which will hopefully make the path forward. And finally, you want to. So you should try wherever possible to use what are called diagnostic interventions. Diagnostic interventions are actions that may just fix the problem if a particular hypothesis is right. But otherwise, if they don't fix the problem, then they at least rule that hypothesis out. And that sort of kills two birds with 1 st when you can find one. So, for example, a rule out action is check whether the flower and the honeybee disturbances started at the same time. If you, as we saw, if you do that and they, and they, you can learn something that will allow you to rule out one or more of the hypotheses. A research action is something like read the flower surfaces. Read me. Reading the readme isn't going to rule anything out, but it may give you some ideas about symptoms that you can go check, or hypotheses that you may be able to falsify or that may explain the symptoms. And then finally, an example of a diagnostic intervention is say we had a hypothesis that the honeybee service was having this elevated error rate because of some bad object that was in its cache. If you were to clear the honeybee services cache, that would be a diagnostic intervention. Because if it fixes the problem, then great, we fixed the problem. We know that was something like what the problem was. If we clear the cache and it doesn't fix the problem, then we get to rule out that hypothesis. So either way, we're making progress. So those are my tips. And like I said, if you practice clinical troubleshooting in your you're going to have shorter incidents, fewer incidents, and less stressful incidents. And I love to hear from you about that. So if you are going to do this, if you do this and you have questions about it, or you want to tell me a success story or a failure story, there's my email address. Dan, I also urge you to check out my blog, which covers topics in sre incident response, observability, and it's a very good blog. So I'm excited to hear from you as you try this out. Oh, also, I teach a four day incident response course for engineering teams. You can check that out at d two e engineering for more info on that. Before I go, I also want to sincerely thank Miko Polakowski, the indefatigable host of Conf 42, for making this incredible event happening happen and giving me the opportunity to speak to all you fine folks out there in the inner sphere. You've been a fantastic audience.
...

Dan Slimmon

Managing Partner, SRE @ D2E

Dan Slimmon's LinkedIn account Dan Slimmon's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways