Conf42 Incident Management 2022 - Online

Incident Management - Talk the Talk, Walk the Walk

Video size:

Abstract

Remember when we were at school, and people said - “Actively listening in class guarantees 50% prep for the upcoming test”? The same goes for being proactive at work in ways that will instantly prepare you to manage incidents better (at night or in general).

In this talk, I will lay out the foundations of incident management, including key questions that if you’ll be able to answer - You will be able to easily manage incidents, no matter the time and place. I will also show the best practices I’ve finalized over the years that helped me get a clear vision of how to manage production incidents in the quickest & efficient way possible.

Embracing the tips I’ll give you will guarantee you’ll not only talk the talk but also walk the walk when it comes to incident management.

Summary

  • Hila Fish is a senior DevOps engineer. She has 15 years of experience in the tech industry, which means a lot of production incidents. She will show you how to take a proactive approach to managing incidents.
  • Incidents management is a set of procedures and actions taken to resolve critical incidents. Business mindset is needed to grasp the overall impact of incidents and mitigate damages. A structured process can lead to incidents production, improved, meantime to resolution and eventually to cost reduction.
  • During an incident you should really keep calm and ask yourselves, and I'm also going to address the keep calm further on in the presentation. The next pillar is investigation and diagnosis. The last pillar is resolve and recovery. Choose the fastest solution without compromising system's health and stability.
  • Next up is are there any action items needed after the issue got resolved? Is a relevant incident runbook in place? If it's outdated, maybe it needs to be updated. Could I help prevent similar incidents from happening again?
  • War room is when you have more than three, four people handling an issue. We should really have an incident manager that really divides the work and tell people what to do. This person should be calm and collected and see things clearly. It should be kept minimal and dynamic.
  • There are a lot of qualities that you should have when you handle an incident. Think on your feet, differentiate between relevant and non relevant information. Be humble. If you're stuck, ask for help. A methodical walk will help you gain faster incident resolution.
  • The proactive approach after the fact, after an incident took place, looks something like that. Document it in an on call shift handoff, which is a summary that I will post in my team's channel. Postmortem notes. Open a jira or a Monday ticket and fix it to prevent the next incident.
  • Identifying services, escalation points on a day to day basis will save time and money on incident management. Learning application flows is very, very important. Be a go to person as they say. Less downtime means business success.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hi everyone. Thank you so much for joining me. This talk will be about incidents management. Talk the talk and walk the walk. When I was in high school, the common belief was that if youll actively listen in class, you'll have 50% of the exam prep already in your pocket. I want to show you how I adopted this belief to an actual proactive approach that you could take that will help you manage incidents more efficiently, in a more structured way and eventually preserve much needed hours of sleep. So first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer and I work for weeks. I have 15 years of experience in the tech industry, which means a lot of production incidents. Recently I have joined the AWS Community Builders program. I live in Israel and I help organize events, DevOps events like DevOps Days Tel Aviv and Statcraft monitoring conference. I'm a mentor in courses and communities, including communities for women in tech, specifically technical women in tech. I'm a DevOps culture fan. I think this is what helps companies achieve great things. And I'm a lead singer in a cover band, as you can see in this picture, which is a lot of fun. Okay, so today, what are we going to cover today? Incidents management in general. The necessary flow for me, the structural flow to take the mindset that you should have while dealing with production incidents. And how can you be proactive and come prepared for incidents. So let's start. So, first of all, incident management is a set of procedures and actions taken to resolve critical incidents. And it's basically an end to end process that defines how incidents are detected and communicated, who is responsible to handle them, what tools are used for investigation and response, and what steps are taken for resolution. And the thing were with incidents is that we need to reframe our perspective. Because walking enough years in the industry, we know that everything fails, right? All the time. So since failures are given, we can't be in an ad hoc putting out fires mindset. We need to refrain the mindset to be structured and say that, okay, we know that this is going to happen, but at least I'm prepared to deal with it. So business mindset is needed to grasp the overall impact of incidents and mitigate damages. Because without incident management, structured process, or even without handling it properly in general, then we could potentially lose valuable data. Downtime could lead to reduced production and revenues, and the business could be held liable for breach of service level agreements. Because as we know, each business has its own sls defined eight nines, eleven nine. So it's very important to treat incident management with all seriousness. So that's why we need to reframe our perspective, have a business mindset and have the incident management be a structured process because a structured process can lead to incidents production, improved, meantime to resolution and eventually to cost reduction since downtime was reduced or eliminated entirely. So wait, a structured process of an incident, but how could it be? We have a lot of unknowns. Sometimes it's incident x from that reason, sometimes it's that. So how can it be structured if it's not consistent? So years it youll be a structured process so there are pillars that you can follow through. I'm going to cover each one of them, identification and categorization, notification and escalation, investigation and diagnosis, resolve and recovery and eventually incident close. And what you should do then. And I also put some reference link here for an article by on page. You can deep dive about these pillars later on. Okay. So during an incident you should really keep calm and ask yourselves, and I'm also going to address the keep calm further on in the presentation. So you should really ask yourselves these questions. First of all, in the identification and categorization pillar, do I understand the full extent of the problem? If so, awesome. I can ive right in and notify people if I need to. Because sometimes if it's a crucial issue, I know that I need to update person x or even customer success to alert users that the application is down or not. Depends on the issue and the full extent of it. And if I don't know or don't understand the full extent of the problem, then I should gather more information that will help you, help youll help me understand what's going on and what needed steps I needed to do based on that. Next up is can this wait and be handled in business hours? Because maybe you got paged by 04:00 a.m. But maybe it's not that important and the alert is falsely labeled as critical. But it's actually not critical, it's minor. So we should address that. And if we're not sure if this can wait for business hours or not, we should really ask and we use the information that we gathered in order to understand that and escalate if we need to. And also if we saw that the incident is not really critical, it's minor, we should really change the severity or the runbook accordingly. And also another thing that we need to check is that was I notified about this alert or this issue from the proper or expected channels? Because if so, awesome. And if not, maybe I need to not. Maybe I should add note to self to fix that, because if I heard about an issue from a user complaint and not from petrol duty or obstining or stuff like that, then we should really handle that. Next pillar is notification and escalation. So who should be notified about this incidents? Here we have two routes during an incident and in general. So during the incident you should decide by incident importance. Because again, if it's critical, if the application is down and affects a lot of users, then we should alert the support or the customer success teams to communicate to customers if need be. And in general, maybe there are other teams or key focal points that care about the system and we need to keep them posted about the system's health and status. The next pillar is investigation and diagnosis. So what information is relevant toward incident resolution? Because we should really focus on what's important and what's relevant and put the unimportant stuff aside because focusing on the non relevant information will throw you off route and make you lose valuable time. When youll deal with an incident, doesn't matter if it's during business hours or not, it will lose you valuable time. So you should really focus on what's important right now. Did I find the root cause? Do I understand the root cause of the issue? If so, awesome. I can progress according to that. But if not, I should investigate more. And if I see it takes a lot of time, I should really escalate to other team members or team leaders or other teams to help me understand the root cause. Because we don't want to lose valuable time and have the system in downtime. If I could prevent that by just asking for help. Right? And also we should really prioritize root cause over surface level symptoms because let's say I got an alert service is down on silver x. Okay, I started it. Nice. Then it happens again. And then it happens again. And then I shouldn't just start the service and go back to sleep or just continue with my day. If it happens. I need to check what's going on, right? I need to check the root cause of why the service got stopped and what's going on. Because our focus is the environment, the system's health. So we need to make sure that we know what's going on and not just fix, not really fix, but put a band aid over the scenario. And the last pillar is resolve and recovery. So which possible remediation step is the best one to take? Maybe I found the issue. Maybe there are a lot of stuff I can do about it right? So the thing is that you should choose the fastest solution to eliminate downtime without compromising the system's health and stability. Because, yeah, we all want to go back to sleep when something happens because we care a lot about our quality of sleep, but in this point of time, we should really care more about the system's health. So we should do whatever is good for the system's health because it will bite us in the if we don't. And this is what were here for, right? We are either DevOps engineers or sres. What is SRE? Site reliability engineer. If I don't care about or don't take care of the site reliability, I'm not really doing my job. So there's that. Next up is are there any action items needed after the issue got resolved? Maybe it was the middle of the night and there was really time to go to the developer's code and wix the issue properly. So maybe there was a patch done. Okay, if there was a patch done and management knows about it, everyone knows about it and agreed that it should be done. At that point, all good, but we should permanently fix the issue because again, we want to, a, prevent a recurring issue and b, we want to make sure the system health is good. And if it's just a patch and not a permanent solution, then probably the system health is not that great. And last but not least is closure. So once the incident got resolved, do I need to notify anyone on this incidents resolution? We need to be end to end communicator. So if at the beginning we alerted customer success or support teams that there's an issue, we now need to tell them, okay, issue got resolved. Please make sure that, a, you communicate it to the users and b, let us know. Maybe we think the issue got resolved, but something happens and they still experience issues. Right. So they are our QA of some sort to make sure that everything works okay. But they also need to communicate the users that now the system should be back to normal. Were alerts okay. Or they need tweaking because as I said, maybe we got an alert in the middle of the night and it's not that critical. We need to fix the alerts and tweak it. So we should do that. Is a relevant incident runbook in place? If it's outdated, maybe it needs to be updated. Right. So runbooks are things that we have during an incident or should have during an incident that helps us resolve an issue usually, or mostly when we need to have comes sort of judgment on an incident. So let's say, if this happens, I need to do this, but unless the other log shows x, and then I need to do that, right? So there are a lot of things and scenarios where judgment is needed. In that case, we should really have runbooks in place. If we don't have runbooks in place, please, we all need to write them down. And even if we do have runbooks in place, we need to update them to make sure that they are up to date. Could I help prevent similar incidents from happening again? Maybe I noticed something that could be tweaked or changed or fixed or edited in order to prevent similar incidents from happening. If so, create a task for you in Jira or Monday or whatever tool that you use, and help prevent the next incident from happening. And also, of course, does this incidents require a postmortem if years, then okay, jot down the notes as soon as possible while it's still fresh in your mind. Because I think we all know that we are human beings and we remember things better once it's still happening, and not like after an hour or two or even the next day. But just to make an emphasis on that, were was a study conducted by Blumar Zeigonik, who's a russian psychologist. She found, but that we remember more details during an ongoing scenario rather than upon their completion. So, in favor of a better post mortem process, write the notes down as soon as possible. And if there isn't a need for postmortem, still share the knowledge, either through a runbook or a daily brief, or even do a mental check with yourself to make sure that everything was handled as smoothly as possible. And if not, what could be done better? Okay, let's talk a bit about war room conduct. So, war room is when you have, I would say, more than three, four people handling an issue, right? So in that case, we should really have an incident manager that really divides the work and tell people what to do. This person should be calm and collected and see things clearly and not afraid to reduce people's involvement if it doesn't serve the purpose. Because if, let's say we call this guy to help with a certain thing now, this certain thing finished. Okay, now, guy, please go away. Because we don't need the extra noise, because too many people can be too noisy. It should be kept minimal and dynamic. And there's that. And I want to tell you even a story about that. On one of my previous jobs, I was new at the company, and there was a critical AWS issue that created a lot of bad stuff for us downtime not good stuff. And I was new, right? So I didn't speak up because I didn't think that I have something to contribute because I don't know anything yet. I'm new, but I was, were in the war room like it was on Zoom. So I was just there quietly and I saw that they are just going places that are not really helpful. He pulls this rope like this way and he thinks about that and he talks about that and everyone just doing their own thing and not really coming together. So at that point of time, I jumped the gun and said, guys, I don't see that this is coming or going anywhere. Let me help. And then I took the liberty to be the incident manager. And then I told this guy, okay, you check the logs, check x. I don't see that we have a runbook for a proper startup of the application to see the flow that needed to be in a specified order. So please write down the process for us to start the application in the order that is needed. Right. So I took this role upon me and then stuff started to really progress towards resolution. So incident manager is very needed in a war room conduct because we need to have organized way of doing things, as always. Okay. And speaking of an incidents manager, there are a lot of things that you should have qualities that you should have when you handle an incident. It doesn't matter if it's in a war room or on your own, but there are a lot of qualities. Of course there are a lot. I'm not going to tell everything here or mention everyone because I can't mention all qualities, but let's cover the ones that I think that are important and some tips for me how to perfect them. So the first one is think on your feet impromptu action taker. So sometimes the issue will be something that youll are familiar with it, but sometimes it will be something in an uncharted territory. And you need to think on your feet and be ready for anything. And in order to practice that, you can participate in brainstorming sessions at work. So whenever possible, you can jump the gun and participate these sessions because these kind of scenarios of ping pong will help you practice this quality. Next one is differentiate between relevant and non relevant information. As I just mentioned before in the war room story, he said that, he said that. And people talked and talked and talked. I'm like, guys, we don't progress towards resolution. We need to really focus on what matters right now. So that's very important trait to have to differentiate between what's important for fixing the issue and what's not. And basically, the more youll know how a system works, the more your ability to separate the relevant from the non relevant information increases operation under pressure. So let me also tell you a story on another job that I was there, I joined. I also was new at that position as well. And I had my first on call. My first on call were, I forgot the word, my first uncle. And there was a big issue. I mean, a lot of alerts on the screen. Like 100 alerts. It was crazy. And then I looked at the screen, I'm like, okay, I'm new. I don't know what to do yet. Let's call the guy that's there for two years, and he will help me, right? Because he knows what to do. He's familiar with what's going on there. So I called him, and then he sat next to me, and I'm like, okay, now what do we need to do? And then I remember he just looked at the screen. He was like, wow, there are so many alerts. And I'm like, dude, dude, snap out of it. So he was totally out of it. And I'm like, guy, dude, it's not helpful. We need to snap out of it and see what we can do to fix that. Right? And the thing is that stress is a symptom of being out of control, and collection of relevant data will help you decrease stress levels. So when you know what to do, you're in control. And in general, you should really keep a cool head. Snap out of the uncertainty cloudiness, because when it comes to multiple participant incidents, it's also something that you should think about, that the stress level increases because everyone stressed and they want to fix the issue, right? So keep a cool head and just start to gather information that will help you solve the issue and then regain your control. A methodical work. So time is of the essence, right? And there is a pressure to solve things fast, as I just mentioned in the previous bullet. But the thing is that methodical walk will help you gain faster incident resolution. As I showed you before, structured process, follow the rules, follow the questions that you need to ask yourselves, and then it will help you regain control and progress you towards faster incident resolution. Be humble. If you're stuck, ask for help. So it's okay not to know how to fix an issue on your own. That's okay. But you need to understand that it's not your time to shine. People say, I will fix the issue. I will be the hero, and that's that. But no, your time to shine will be if you help the company not lose money and not have downtime, right. So you will have a lot of opportunities to prove yourself on your day to day. The best way to prove yourself on instance is to take a step back and escalate an issue. If you don't know what to do and you can't resolve it on your own, because in that way you have the business interest in health. Remember business mindset, it's exactly that. Problem solver. So if you have a problem solver approach, whatever needed and can do approach, youll can basically do anything because being positive is the way to go. And if you start from that point and not be negative of like, I'm not sure it can be salvageable or whatever, it means that your ability to do stuff increases. So always have a positive can do approach, sense of ownership and initiative. So if you're on call and you escalated something to other person, that's good, right? We just talked about it, but you are still on call. It means that if you escalated something, you're still responsible and you need to have end to end handling of things. So after escalation, wait ten minutes, 15 minutes, whatever it takes, and then ask, hey, what's going on? Do you need help? Do you know what to do? And be sure that you know what's going on and it's really handled because maybe you escalated. But the other person, I don't know, he didn't understand it correctly and he's not really handling it, and nobody's handling the issue right now. So communication is very important. And make sure that if you escalated, this is really handling, someone is really handling it, and it's an end to end process for that good communicator. So you really need to explain the issue to others that will help you and communicate the issue for escalation purposes. So being a good communicator is very important and communication guidelines can be established. So let's say you're not good in communication, you're not great with that, you don't know who to talk to, or you don't have the tendency of updating people, right? So if your company or your department sets communication guidelines, then you will know exactly what channels should be used and what content is expected in those channels and how communication should be documented. And if you have this laid all out for you, then you know exactly what should be communicated and it will help youll be a better communicator and lead without authority. It's mostly relevant on a war room scenario with more than two people involved. And remember that if you're nice and confident and you make people feel at ease and project and everything under control facade, then people will listen to you and follow your lead. And I think that the most important thing is caring. You need to care about what's going on. You need to care about production. You need to care about your team members, your company. If you care, then you will do the extra mile and you will be able to do anything that I mentioned here. And the structure process, and also the proactive approach that I'm being to show you right now. Okay, so we covered the mindset that you should have the business mindset when working on production and handling production incidents. We covered an incident flow, a structured one that will help you handle an incident better. Now let's talk about being proactive. The proactive approach that you should have in order to come prepared for an incident that will happen. Because as the song by the Fujis. Right? Ready or not, here I come. You can hide. So if you're not ready, it doesn't matter. Page of duty or opgeny or victorops or whatever tool that you use will call you anyway when you're on call. So you better be ready. So how can we be ready? Right? The proactive approach after the fact, after an incident took place, looks something like that, in my opinion. So first of all, on call shifts handoffs, I'm not sure if it's something that is done on every company, but let's say I finished the shift. By the way, this is the word that I looked for before on call shift. So after an on call shift, there were several issues. If they were minor, then okay, but if there was something special or something recurring, I should document it in an on call shift handoff, which is a summary that I will post in my team's channel. And then the uncle after me can read what's going on, and that way he or she or they can be updated on what's going on in production. And it will help them have a better shift on their own because if they will have an issue, and the issue was basically recalling because I had it, then they would know better how to handle that. So it's good for audit purposes because it is documented in the Slack channel, but it's also good for your team member success because you want to help them do their job better and have a smoother shift. Postmortem notes. So as I mentioned before, write them down as soon as possible. And even if there's no meeting, do a mental check. Do a retro with yourself, see what you could have done better new tasks. So again, prevent the next incidents. Do you have something in mind that could help, based on what you saw in the incident that could help stabilize the environment? Open a jira or a Monday ticket and fix it to prevent the next incident? Modify alerts so maybe you saw some false positive alerts, and I think we all seen it in our career, alerts that come up and after a couple of minutes get closed. So don't just leave that, right? And don't just wait for the next on call to fix the alerts because maybe they will wait for the next on call and they will wait for the next on call and then it will never get happen and we all will suffer from these alerts. So please fix them. Internet runbooks so I mentioned it before, if you don't have Internet runbooks in place, please write them down and update them along the way. And this will help you to have a smoother process, right. Because you are already prepared, you know what to do in a certain scenario than certain issues. Automation. Let's say you found some candidates for self remediation, some issues that could be self remediate by the process or the flow itself. So if so, open a ticket and make it happen. And if the issue handled, share the knowledge. I mean, people could really benefit from your line of thought and how you fix things. And this share of knowledge is more in depth than in an on call handoff, because in an on call handoff, you just write down a summary, were, I mean, actual share of knowledge to show people how you figured out things. What was your line of thought? What was the flow that you had? It will really help others understand what's going on and come prepared better for incidents. So we covered the proactive approach for after the fact, after an incident took place. Now let's discuss what youll can do on your day to day that will help you come prepared for incidents. So, first of all, the onco shifts, handoffs that I mentioned before, you should read everything, okay? Not only the shifts that the handoffs that the person after you wrote, but your entire team, it shouldn't take long. It just should be like a paragraph. And this could help you understand what's going on in production when you're not there, when you didn't do the changes yourself. So it's very important because that way you will always be up to date with what's going on in production. Escalation, a point of contact. So you support several services at work, right? And you know the needed pieces of information related to your realm. My realm is infrastructure, so that's great, but you should also know other realms as well to have the full picture. So let's say there's an issue with X. I've checked my side of things, I don't see any issue, but I know that John is the one that is handling X from the side of code, from the side of the developer side. So I should escalate to him to check things on his end. Identifying services, escalation points on a day to day basis, and not only ad hoc when the occurs will save time and money on incident management and save someone else's hours of sleep. Right? Because if I don't know who's handling a developer side of service x, then I need to wake my team member. I need to wake my team leader and ask, hey, who's responsible for that, right? So not nice. So I can prevent that and already come prepared and know that these services are handled by these guys or these women or whatever. And then it will save time for me during the incident. And other than just chasing my tail and figuring this out on the spot, I know exactly who can I escalate the incident to. Understanding system architecture. So if I know weaker areas in the infrastructure or in the code maybe, and vulnerabilities and sensitive or blast radio scopes, then to help me understand the severities of incident, to help me understand what needs to be done, either escalation or root cause analysis. So understanding and really learning how our infrastructure works and its vulnerabilities will really help us come prepared for any incidents. And coupled with that is learning application flows, because that way we know the business impact. If something bad happens, we have a service, we know if this is a service that affects a lot of users or maybe a few. So business impact is very important in that case and also for escalation purposes. If I know application flows, I know that this service communicates with this and goes to this and goes to that, then I can do a root cause analysis and go by the flow and see, okay, these logs looks okay here. It's okay. Oh, were I have some issues, if I don't know the flow, I wouldn't be able to go in this path. So learning application flows is very, very important. Team members tasks. So as we know, production happens not only by me or by you, but your team members also contribute to the production changes that happened. And believe me, it really is easy for me to just lay low and deal only with my tasks. But I'm responsible for production. I need to know what's going on so I need to know what my other team members are doing and what changes they introduce to the environment. Because I'm responsible for the environment and I need to know what's going on. So it's very important. And again, coupled with that deployments or changes in production, so ask about the changes and their possible impact. And as I said before, the previous slide ops genie or pagerduty doesn't care if you didn't do the deployments or the changes by yourself. It will call you anyway if you're on call. So you better understand and know what happened in production in order to handle incidents better. And last but not least, be a go to person as they say. If you build it, they will come. If you are a person that is known to be proactive and know what's going on in the system, people will come to you. Youll get push notifications and it will decrease your need to fetch the updates on your own because people will come to you. So there's that. And I say that in order to talk the talk and walk the walk when it comes to incidents management, if you have your qualities in check, so if you know that you're going to be stressed out, work on that and other what is in check, make the process structured and place. Then prevent the next incident from happening. And remember, less incidents means less downtime means business success and business success is eventually your success. So thank you so much. If you have any questions about incidents management or any other SRE topics, I will be more than happy to help. Thank you so much.
...

Hila Fish

Senior DevOps Engineer @ Wix

Hila Fish's LinkedIn account Hila Fish's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways