Conf42 Incident Management 2022 - Online

One more step in Learning from Incidents: Sharing incident findings effectively

Video size:

Abstract

One more step in Learning from Incidents: Sharing incident findings effectively Oftentimes post-incident activities involve a post-mortem meeting and document. These 2 vary in quality, from focusing only on a single root cause, maybe a 5-why’s, or on the bright side, a thorough investigation that takes into account multiple points of view. Regardless, once the meeting is complete, the output usually ends up in a document hidden in a drive that no-one ever opens and the knowledge that was discovered during the postmortem stays only with those who attended the meeting.

If learning from incidents allows us to turn outages into opportunities, how can we make those learnings reach the most people? We do this by carefully and thoughtfully sharing our findings.

By sharing our findings we allow for more equitable learning (account for people with conflicts or illnesses who couldn’t attend the review meeting), can get buy-in for next steps, get people in the org to have a more well-rounded knowledge of how things work.

This talk is meant for anyone who is involved in incidents in any way: responders, subject matter experts, impacted users, and facilitators but mostly for technologists who want to make the most out of their incidents through learning!

Throughout this talk I will give an overview of why sharing matters and go deep through the different ways that we can share incident learnings depending on your needs and audiences as well as provide examples with the hope that folks are able to start applying these forms of communication in their own orgs. Furthermore, I want folks who watch this talk to leave with a sense that change can happen and that we aren’t meant to keep repeating the same problems over and over again.

Summary

  • We started focusing on not only learning from our incidents, but telling others about them. And then something magical happened. Folks started listening, and they engaged with what we were talking about. Little by little, our recommendations were making it into the quarterly plans.
  • Vanessa Huerta Granda talks about sharing incident findings effectively. She believes learning from incidents is the key that can help software organizations improve how we do our work. She wants to make this work more attainable and sustainable to the everyday engineering.
  • Last year, Jelly released a guide on how to learn from incidents. The guide explains that your work should not be completed to be filed. It should be completed so it can be read and shared across the business. Sharing timely information is also sort of marketing for the work that we do.
  • So going back to incidents who arent the different audiences. Within these different audiences, folks can have different purposes for wanting or needing to see the learnings from an incident. The way that you share the information with them will be different.
  • With all of this in mind, let's take a look at the different formats in which you can share the information. We want to make it easy for people to share their learnings. So you'll see how we do that in the next few slides.
  • The report is focused on telling the bigger story of what happened. The abstract is your incidents elevator pitch. The next format is to share a recording of the actual review call. Finally, a weekly update consists of all the incidents that were analyzed that week.
  • When it comes to incidents, I'm a fan of focusing on learning rather than having a post mortem that then becomes an action steps factory. The difference between this and an action item factory is that you're giving yourself and other folks the time to truly understand the learnings.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody. So I couldnt talk about incidents forever. In fact, I do talk about incidents always, but for a while it felt like nobody was listening. Years ago, when I started working on incidents and talking about incidents always, we were doing really great work with incident retrospectives. We were learning super duper interesting stuff, but those learnings and communication never really went anywhere. We then realized that in order for these recommendations to actually happen, we needed others to see why we were pushing for them. The engineers themselves couldn't push for these changes because they couldn't succinctly explain what they were requesting. Instead, they would point to some incident reports that were full of screenshots of errors in timestamp timelines that didn't specifically explain what was needed. And the noneengineering folks couldnt really understand why these incidents were impacting them. After all, we were still making money. And that's when we started focusing on not only learning from our incidents, but telling others about them. We realized that the reports that we were creating weren't telling the whole story, so we redid the way that we would write them so they could be more complete representations of how we experienced the incidents. These long reports, though, were kind of tough for folks to read. So we started adding abstracts and summaries as onramps, and then we added weekly updates so folks could quickly ingest what was happening and start realizing that incidents were applying every day and they were having a huge impact on how we do our work. And then we started synthesizing some of that information so we could go to product owners and decision makers and make a case for our long term recommendations. And then something magical happened. Folks started listening, and they engaged with what we were talking about. And not going to lie, this didn't happen immediately, but little by little, our recommendations were making it into the quarterly plans, and they were making it into the team cultures and the process changes. And not all incidents went away. We were still doing some really cool engineering stuff, but we were out of that cycle of infinite incidents that were continuing to happen over and over again. We started having the bandwidth to tackle different problems, and that was great. And we were able to do that because we were able to share the incident findings. So hi everyone. Vanessa Huerta Granda I work in solutions at Jelly IO. I've been working in technology for the past decade, focusing on incident response and learning from incidents. And I truly, truly, truly believe that learning from incidents is the key that can help software organizations improve how we do our work. And I want to help in making all of this work more attainable and sustainable to the everyday engineering. So today I want to talk to you all about sharing incident findings effectively and what we can gain from doing that. So what do you do after a postmortem? So you've worked really hard after your incident. You had a collaborative postmortem, you listened to several different points of views, and you've come up with some really rich learnings and recommendations. What do you do then? What do you do after your post mortem? Some of us will probably write a report. I've done this usually in Jira. Some folks like to use Google Docs. You can write up your action items. You can tag the folks that sometimes is part of that report. More Jira tickets, really great stuff. Maybe you had the review meeting over Zoom. So you share that recording and you hope that people listen to that 1 hour session or you close up the ticket now that the incident is over and folks can access it whenever they want to. Maybe you're just done. This incident was a lot. You're tired, you just want to move forward. You have work to do and there's really going to be another incident tomorrow anyway. So you move on with your life. So while we all do different things after the postmortem, do we know if others are interacting with our learnings, or are the learnings mostly living and dying in Google Drive bright have sometimes been I've worked very, very hard on some incident reports that have very, very complete information, but nobody ever ends them. And it's frustrating, especially when our organizations are the ones that are mandating. We spend time creating these postmortem documents, and so if we're mandated to complete some sort of incidents report or a five Weiss template, it must mean that organizations believe that sharing incident findings is important. But why do they think that? Why do we think that? When we really think about it, teams appreciate the culture of keeping everyone in the loop. People like transparency. Last year, Jelly released a guide on how to learn from incidents. We call it the Howie Guide. You can actually find it on our website at jelly IO. And when discussing sharing incidents findings, the Howie guide explains that your work should not be completed to be filed. It should be completed so it can be read and shared across the business. Even after the learning review was taken place and the corrective actions have been taken, we work hard at arriving to our learnings to uncover themes and takeaways. How often are reports written to be filed rather than to be read and shared? Why else do we share information? Well, we also share just for learning sake and sharing timely information is also sort of marketing for the kind of work that we do, right? Sharing that information is a marketing piece of the learning from incidents programs that we lead. Findings may also impact others in the organization. Some outcomes or insights may impact these person or a team or the way we do our work. And letting them know of these learnings is a part of a just culture. We also do it as a TLDR for the decision makers. Maybe we need buy in for next steps from leadership or other stakeholders. Actually, we usually do. And these folks are probably not going to be very likely to attend every single review meeting, especially if you have a lot of incidents. And finally, you also have folks who didn't attend the review meeting. Maybe they were out that day on PTO. Maybe they needed some ends down to work on other stuff. Maybe they were working through another incident itself. So there are many, many other reasons for learning. And all of that is to say that sharing incidents findings can help get our point across to a wider audience. So if we've worked hard on uncovering our learnings, but others aren't seeing it, what's the point? I've often heard engineers discuss, sadly, how they're stuck in this cycle of incidents, that they believe that there's things that they can do about it, but they're not in a position of power to get those things done. But sharing incident findings is a way out of that cycle of incidents. You may say, vanessa, I'm already sharing my report. It's in the drive. Anyone can access it. And that's true. But my inbox is not at zero. And I bet a bunch of other people's inbox isn't at zero either. And I can tell you that having something available to me doesn't mean that I'm going to read it. So how do we get folks to read it? So let's think about movies. The way that we learn about and the way that we learn from movies are different. You can watch a 120 minutes movie, but maybe then you want to hear people talk about it for 45 minutes, or you want to listen to a ten episode podcast about this movie. Or maybe you want to share a review with someone that they can read in five minutes and decide if they want to go to the movies with you or not. Or maybe you just want to tweet some spoilers, right? Like Spoiler Arent and the movie Titanic, the boat sinks. Sorry for that spoiler, y'all. The truth is that different audiences need to learn different things from what you're sharing and who are these different audiences? Going back to the movie analogy, you can have your huge fans. They're the ones who are going to listen to that like ten episode podcast, your casual viewer, the ones who want to read that review before they commit to spending $12 on the movie ticket and these another $12 on the popcorn. You have your studio executive, your Oscar voter. They're hopefully watching those 120 minutes forms movies. The film industry is catering to different types of audiences differently, just like we will when it comes to incidents. So going back to incidents who arent the different audiences. So you have your engineers, you have the people that you invited to the review meeting but couldn't make it. Maybe some people who are impacted by the outcomes or the insights from the analysis, or just people who want to learn more about what your team does. You have your managers. These can provide necessary context to others. They can say, this is why my team does this. This way you have your execs, your leadership who can approve of suggested changes. You have stakeholders who can both be technical or not technical. The way that you share the information with them will be different. And then you have your outside parties, right? You have your customer support, the folks who are like answering the folks when something is going wrong. You have your public relations folks, the folks who are writing the tweets when your site is down. So within these different audiences, folks can have different purposes for wanting or needing to see the learnings from an incident. And the purpose can be many, right? Sometimes when I share something, I'm requesting an action, right? I'm saying, hey Jen, please make this change or add it to your to do's. Sometimes they just need to know, hey Jen, I'm making this change. This is how it's going to impact you. Sometimes you're just updating them, right? Like hey Jen's boss, remember the incidents from the other day? This is what happened. But sometimes you want to change folks'minds, you want to say, hey, Jen's boss, please don't fire Jen. This wasn't her fault. Read this report, you'll find out why. With all of this in mind, let's take a look at the different formats in which you can share the information. So these are some of the formats that I like to use. I've iterated through them throughout the years. We'll go into more detail in a bit. But we have the report, these abstract, the summary recording, weekly updates and presentation. A lot of what we've built here at jelly was done with this in mind. We want to make it easy for people to share their learnings. So you'll see how we do that in the next few slides. First, let's start with the report. Right? That is probably the format that you're most used to, but this is different from your standard postmortem. This one is focused on telling the bigger story of what happened and the context around how the events came to be. The goal with the report is always to learn from the incident. As you can see here what I'm including in my report, I'm including a narrative timeline, a visual narrative timeline, because we're telling what happened with the incident from different points of views. We're not just saying start, middle, end, we're going back and forth trying to understand what people were experiencing at the different parts of the incident. The report is great for asynchronous communication. Again, it should be written to be read and collaborated with, not just to be filed. So when I'm writing a report, I really like to encourage folks to make comments on it, to link out to it, to include it in their prs and how they do their work. The reports arent the most in depth written artifact that's coming out of your incident. A report will give folks an in depth understanding of the these around the incident and how we find them. They'll mostly be read by folks that are involved in the incidents or teams using similar technologists, or folks those buying you need for possible action items, but they are long, right? These are the most in depth artifact that's coming out of the incident. So reading a report is probably going to require quite a time investment from your reading, from your reader. So in order to catch people's attention, you probably need something else. And here is where the abstract comes in. And the abstract is my personal favorite way to share about incidents. It is your incidents elevator pitch. So it's one to two paragraphs on what happened, why we should care about this event, any contributing factors or themes, et cetera. These abstract is meant to help folks decide if they want to commit to learning more about the incident and reading that full report. You can share it with anyone. I personally love sharing it with executives and leadership. Here is these report. It's a jelly screenshot. We're calling it the executive summary, but as you can see, it tells you when the incident started, how long it lasted, who was involved, what the impact was, and next up is the summary. And the summary gives more context. It's a slightly more comprehensive version of the incident. You can include action items, include who suggested them, and here you can see we included key takeaways as well. So we usually share this with people who can be impacted by the learnings and anyone whose buy in you need. And when I'm sharing a summary, what I like to do is I like to tag people and explain to them why I'm sending this to them so I can share the summary and say, add Jen sharing this so you can see that we're making this change. And then Jen can go in and find out more information. The next format is to share a recording and that's just exactly what it sounds like. It's a recording of the actual review call. And when sharing it, it's really helpful actually to include a message with timestamps of when key moments were being discussed. So I can share my recording and say at minute ten we discuss the impact. At minute 15 we discuss this these at minute 25 we discuss action items. And this is the most similar format to attending the review meeting. But there are some drawbacks. Number one, it takes a while to get through it, right? If it's a 1 hour meeting, it's a 1 hour recording. And number two, viewers or listeners can collaborate with it. When you're in a meeting, you can raise your hand, you can say, hey, this is how I experienced it. You can't do that if you're just watching these recording. But this is a great format to share with those who were involved in the incident but couldn't attend the meeting. If I'm being completely honest, leadership or other colleagues who were not involved in the incident are probably not going to watch or listen to this type of review. That's okay. They are not the audience for this format. That is fine. And then you get the idea of a weekly update. And the weekly update consists of a quick review of all the incidents that were analyzed that week. It can be a list of all incidents with their abstracts and a link to the full report for more information. You can also include additional data points like teams impacted for a quick access to additional learnings. It's a great, great option for larger organizations that have lots and lots of incidents. Everyone can take a quick glance to the list, find the incidents that they're interested in based on keywords like services impacted or technologies involved and these read further whatever they're interested in. I personally had some really great luck with this. I used to send weekly updates to everybody in technology. All the managers could just skim. If I'm a manager and I'm working with technology a I could see, okay, these technologies were part of incidents let's see how this could impact me. Let's see what I need to learn from this. Let's see what I should share with my engineers. So if you've been paying attention so far, all of the formats that I've discussed arent forms, individual incidents. And now we're moving on to when you're looking at a universe of incidents. So there's a difference between the insights that we share from one incident versus multiple incidents, the micro versus the macro insights. When it comes to incidents, I'm a fan of focusing on learning rather than having a post mortem that then becomes an action steps factory. And when I say an action items factory, I mean the post mortems that we've all been at, right? Like we just sit there, we're not here to learn, we're just here to say like, oh, let's change this bug, let's change that, let's change that, let's change that. Half of those tickets are never going to be completed. We're not going anywhere. Those are the kinds of post mortems that lose faith in the process. But when you have more incidents and you have more learnings, you can start proposing changes, because odds are if you have these sample size of one, you're not going to be able to make a large structural change because of it, right? If I have one incident, I'm not going to be able to say like hey, we should do a reorg based on this, but if I have more incidents then I can start suggesting things as an analyst. When you start spotting macro trends, you can't and you should make a cause for them. That's how you change your lives for the better. The difference between this and an action item factory is that you're giving yourself and other folks the time to truly understand the learnings. You're reflecting on your work over a span of time and you're making decisions together. To give you an example, in the past I worked at an organization where we had a very centralized incident response system, meaning we only trusted a few people to start an incident. That's because we believe that only a few people had the understanding of our systems to make things decision. As we grew, as we DevOps more things, we realized that this process was actually delaying us and learning about high impact incidents. And we started asking ourselves, what can go wrong if we change this? What can go wrong if we change this process? We had several discussions and we were like, let's give this a try. But this was a change that was outside of our control and when we have a change that's outside of our control, I like to focus on presentations. So from time to time, you will get the chance to present to a wider audience, especially when you're trying to make recommendations. When I'm doing this, I usually like to walk folks through the timeline of can incident so they understand what they're dealing with. A lot of the time, the people who are attended these presentations are not living incidents every day. They're not like me, who can talk about incidents forever. So I like to walk them through the incident. I like to show them that visual timeline. I present, any data that I have, any themes that I have that I want to discuss, and these I go into my recommendations, and I always, always target things to my audience. Right. If I'm targeting this to a very technical audience, I include very technical details. If I'm targeting this to a business audience, I include business details. Okay, but how do I get them to agree to my changes? When you're proposing changes that are outside of your control, I usually suggest that folks think of this workflow and ask these questions and answer them to whoever you're proposing the changes to. So let's think back of the example. These I want to change the incident process from only a few people being able to call incidents versus everyone gets to race an incident. What is the suggestion in this case? It was a suggestion to change the way we run incidents. Who needs to approve it? This was a process that everyone in tech, in the tech.org knew, so some engineering leadership probably needed to approve of it. Going up to the CTO, actually. What do they need to see? They need to see that we have done our due diligence. What are we basing this information off of? We're basing it off of a good number of incidents where the incident process was not like the root cause, it wasn't the thing that caused the incident, but it was a contributing factor. And then these incidents led to a number of discussions, and responders agreed that this was worth pursuing. We can see all of this information because we have very thorough incident reports that we can look back to. What could go wrong? Well, if we communicate this wrong, it could cause more confusion. But we're already thinking ahead, right? In this case, the responders that are suggesting this change, we're suggesting that we try it out for a quarter and then we revisit our progress. What is our end game? Maybe having more control over the process. We wanted to see what would happen if we open up the floodgates. And who is doing all this work. This example is easy because I own the process, so I'm the one that's making it happen. But it also had to be part of my quarterly planning. I had to have my manager approve that I would be spending time working on the change management process. Guess what? It worked. And many other recommendations also worked, and some didn't. That's fine. That's how the world works. And you can also make it work for yourself. Because we had done our work throughout all of the individual incidents, we were able to uncover what was happening at a macro level. Cause we had done our work, we could then confidently answer all those questions in the last slide and make a case for the changes that we were suggesting. And so next time that you feel like you're stuck when it comes to learning from incidents, next time when you feel like you're in a cycle of repeating, repeating incidents where you feel like you know the answer but you're not getting anywhere, remember that these process doesn't end in the post mortem. Sharing incidents learning is indeed a pivotal step in turning your incidents into opportunities. Thank you very much. I'm Vanessa. If you would like to hear more about incidents, feel free to follow me on Twitter at these underscore hue. Underscore Jace. I hope you all have a wonderful day.
...

Vanessa Huerta Granda

Solutions Engineer @ Jeli.io

Vanessa Huerta Granda's LinkedIn account Vanessa Huerta Granda's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways