Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              You. Hi everyone. Thank you so
            
            
            
              much for joining me. This talk will be about incidents management. Talk the
            
            
            
              talk and walk the walk. When I was in high school, the common belief was
            
            
            
              that if youll actively listen in class, you'll have 50% of
            
            
            
              the exam prep already in your pocket. I want to show you how
            
            
            
              I adopted this belief to an actual proactive approach that you
            
            
            
              could take that will help you manage incidents more efficiently, in a
            
            
            
              more structured way and eventually preserve much needed hours of sleep.
            
            
            
              So first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer
            
            
            
              and I work for weeks. I have 15 years of experience in the tech
            
            
            
              industry, which means a lot of production incidents. Recently I have
            
            
            
              joined the AWS Community Builders program.
            
            
            
              I live in Israel and I help organize events,
            
            
            
              DevOps events like DevOps Days Tel Aviv and Statcraft monitoring
            
            
            
              conference. I'm a mentor in courses and communities,
            
            
            
              including communities for women in tech, specifically technical
            
            
            
              women in tech. I'm a DevOps culture fan. I think this is what helps companies
            
            
            
              achieve great things. And I'm a lead singer in a cover band, as you
            
            
            
              can see in this picture, which is a lot of fun. Okay, so today,
            
            
            
              what are we going to cover today? Incidents management
            
            
            
              in general. The necessary flow for me,
            
            
            
              the structural flow to take the mindset that you
            
            
            
              should have while dealing with production incidents.
            
            
            
              And how can you be proactive and come prepared for
            
            
            
              incidents. So let's start. So, first of
            
            
            
              all, incident management is a set of procedures and actions
            
            
            
              taken to resolve critical incidents. And it's
            
            
            
              basically an end to end process that defines how incidents are
            
            
            
              detected and communicated, who is responsible to handle them,
            
            
            
              what tools are used for investigation and response, and what steps
            
            
            
              are taken for resolution. And the
            
            
            
              thing were with incidents is that we need
            
            
            
              to reframe our perspective. Because walking enough years in
            
            
            
              the industry, we know that everything fails, right? All the
            
            
            
              time. So since failures are given, we can't be in an ad
            
            
            
              hoc putting out fires mindset. We need to refrain
            
            
            
              the mindset to be structured and
            
            
            
              say that, okay, we know that this
            
            
            
              is going to happen, but at least I'm prepared to deal with
            
            
            
              it. So business mindset is needed to grasp
            
            
            
              the overall impact of incidents and mitigate damages.
            
            
            
              Because without incident management, structured process,
            
            
            
              or even without handling it properly
            
            
            
              in general, then we could potentially lose valuable
            
            
            
              data. Downtime could lead to reduced production
            
            
            
              and revenues, and the business could be held liable for breach
            
            
            
              of service level agreements. Because as we know, each business
            
            
            
              has its own sls defined eight nines, eleven nine.
            
            
            
              So it's very important to treat incident management with all seriousness.
            
            
            
              So that's why we need to reframe our perspective,
            
            
            
              have a business mindset and have the incident management
            
            
            
              be a structured process because a
            
            
            
              structured process can lead to incidents production,
            
            
            
              improved, meantime to resolution and eventually to cost reduction
            
            
            
              since downtime was reduced or eliminated entirely.
            
            
            
              So wait, a structured process of an incident,
            
            
            
              but how could it be? We have a lot of unknowns.
            
            
            
              Sometimes it's incident x from that reason, sometimes it's that.
            
            
            
              So how can it be structured if it's not consistent?
            
            
            
              So years it youll be a structured process so there are
            
            
            
              pillars that you can follow through.
            
            
            
              I'm going to cover each one of them, identification and categorization,
            
            
            
              notification and escalation, investigation and diagnosis,
            
            
            
              resolve and recovery and eventually incident close.
            
            
            
              And what you should do then. And I also put
            
            
            
              some reference link here for an article by
            
            
            
              on page. You can deep dive about these pillars later
            
            
            
              on. Okay. So during an incident you should really
            
            
            
              keep calm and ask yourselves, and I'm also going to address
            
            
            
              the keep calm further on
            
            
            
              in the presentation. So you should really ask yourselves
            
            
            
              these questions. First of all, in the identification and categorization
            
            
            
              pillar, do I understand the full extent of the problem? If so,
            
            
            
              awesome. I can ive right in and notify people if
            
            
            
              I need to. Because sometimes if it's a
            
            
            
              crucial issue, I know that I need to update person x or
            
            
            
              even customer success to alert users
            
            
            
              that the application is down or not.
            
            
            
              Depends on the issue and the full extent of it. And if I don't know
            
            
            
              or don't understand the full extent of the problem, then I should gather more information
            
            
            
              that will help you, help youll help me
            
            
            
              understand what's going on and what needed steps I
            
            
            
              needed to do based on that. Next up is can
            
            
            
              this wait and be handled in business hours? Because maybe you
            
            
            
              got paged by 04:00 a.m. But maybe it's not that important
            
            
            
              and the alert is falsely labeled
            
            
            
              as critical. But it's actually not critical, it's minor.
            
            
            
              So we should address that. And if we're not
            
            
            
              sure if this can wait for business hours
            
            
            
              or not, we should really ask and we use
            
            
            
              the information that we gathered in order to understand that and escalate
            
            
            
              if we need to. And also if
            
            
            
              we saw that the incident is not really critical, it's minor,
            
            
            
              we should really change the severity or the runbook accordingly.
            
            
            
              And also another thing that we need to check is that
            
            
            
              was I notified about this alert or this issue
            
            
            
              from the proper or expected channels? Because if so,
            
            
            
              awesome. And if not, maybe I need to not. Maybe I should add
            
            
            
              note to self to fix that, because if I heard about an issue from
            
            
            
              a user complaint and not from petrol duty or obstining
            
            
            
              or stuff like that, then we should really handle that.
            
            
            
              Next pillar is notification and escalation. So who should
            
            
            
              be notified about this incidents? Here we have two routes during
            
            
            
              an incident and in general. So during the incident
            
            
            
              you should decide by incident importance. Because again,
            
            
            
              if it's critical, if the application is down and affects a lot of
            
            
            
              users, then we should alert the support or the customer success teams
            
            
            
              to communicate to customers if need be.
            
            
            
              And in general, maybe there are other teams or key focal
            
            
            
              points that care about the system and we need to keep them posted about the
            
            
            
              system's health and status. The next pillar
            
            
            
              is investigation and diagnosis. So what information is relevant
            
            
            
              toward incident resolution? Because we should really focus
            
            
            
              on what's important and what's relevant and put the
            
            
            
              unimportant stuff aside because focusing
            
            
            
              on the non relevant information will throw you off route and
            
            
            
              make you lose valuable time. When youll deal with
            
            
            
              an incident, doesn't matter if it's during business hours or not,
            
            
            
              it will lose you valuable time. So you should really focus on what's important
            
            
            
              right now. Did I find the
            
            
            
              root cause? Do I understand the root cause of the issue?
            
            
            
              If so, awesome. I can progress according to that.
            
            
            
              But if not, I should investigate more. And if
            
            
            
              I see it takes a lot of time, I should really escalate to
            
            
            
              other team members or team leaders or other teams to help
            
            
            
              me understand the root cause. Because we don't want to lose
            
            
            
              valuable time and have the system in
            
            
            
              downtime. If I could prevent that by just asking for help.
            
            
            
              Right? And also we should really prioritize root cause
            
            
            
              over surface level symptoms because let's say
            
            
            
              I got an alert service is down on
            
            
            
              silver x. Okay, I started it. Nice.
            
            
            
              Then it happens again. And then it happens again. And then I shouldn't
            
            
            
              just start the service and go back to sleep or just continue
            
            
            
              with my day. If it happens. I need to check what's going
            
            
            
              on, right? I need to check the root cause of why the service got stopped
            
            
            
              and what's going on. Because our focus is
            
            
            
              the environment, the system's health. So we need to make sure
            
            
            
              that we know what's going on and not just fix, not really
            
            
            
              fix, but put a band aid over
            
            
            
              the scenario. And the last pillar is resolve
            
            
            
              and recovery. So which possible remediation step is
            
            
            
              the best one to take? Maybe I found the issue. Maybe there are
            
            
            
              a lot of stuff I can do about it right?
            
            
            
              So the thing is that you should choose the fastest
            
            
            
              solution to eliminate downtime without compromising
            
            
            
              the system's health and stability. Because, yeah,
            
            
            
              we all want to go back to sleep when something happens because
            
            
            
              we care a lot about our quality of sleep,
            
            
            
              but in this point of time, we should really care
            
            
            
              more about the system's health. So we should do whatever
            
            
            
              is good for the system's health because it will
            
            
            
              bite us in the if we don't.
            
            
            
              And this is what were here for, right? We are either DevOps
            
            
            
              engineers or sres. What is SRE? Site reliability
            
            
            
              engineer. If I don't care about or don't take care of the site
            
            
            
              reliability, I'm not really doing my job.
            
            
            
              So there's that. Next up is are
            
            
            
              there any action items needed after the issue got
            
            
            
              resolved? Maybe it was the middle
            
            
            
              of the night and there was really time to
            
            
            
              go to the developer's code and wix the
            
            
            
              issue properly. So maybe there was a patch done.
            
            
            
              Okay, if there was a patch done and management knows about it,
            
            
            
              everyone knows about it and agreed that it should be done. At that
            
            
            
              point, all good, but we should permanently
            
            
            
              fix the issue because again, we want to, a, prevent a
            
            
            
              recurring issue and b, we want to make sure the system health is
            
            
            
              good. And if it's just a patch and not a permanent solution,
            
            
            
              then probably the system health is not that great.
            
            
            
              And last but not least is closure. So once the incident
            
            
            
              got resolved, do I need to notify anyone on this incidents resolution?
            
            
            
              We need to be end to end communicator. So if at the beginning we alerted
            
            
            
              customer success or support teams that there's an issue,
            
            
            
              we now need to tell them, okay, issue got resolved. Please make
            
            
            
              sure that, a, you communicate it to the users and b,
            
            
            
              let us know. Maybe we think the issue got resolved,
            
            
            
              but something happens and they still experience issues.
            
            
            
              Right. So they are our QA of
            
            
            
              some sort to make sure that everything works okay.
            
            
            
              But they also need to communicate the users that now
            
            
            
              the system should be back to normal. Were alerts
            
            
            
              okay. Or they need tweaking because as I said, maybe we
            
            
            
              got an alert in the middle of the night and it's not that critical.
            
            
            
              We need to fix the alerts and tweak it. So we should do
            
            
            
              that. Is a relevant incident runbook in place?
            
            
            
              If it's outdated, maybe it needs to be updated.
            
            
            
              Right. So runbooks are things that
            
            
            
              we have during an incident or should have during an incident that helps
            
            
            
              us resolve an issue usually, or mostly
            
            
            
              when we need to have comes sort of judgment on an incident.
            
            
            
              So let's say, if this happens, I need to do this,
            
            
            
              but unless the other log shows x, and then I need
            
            
            
              to do that, right? So there are a lot of things and scenarios
            
            
            
              where judgment is needed. In that case, we should really have runbooks
            
            
            
              in place. If we don't have runbooks in place, please,
            
            
            
              we all need to write them down. And even if we do have runbooks
            
            
            
              in place, we need to update them to make sure that they are up to
            
            
            
              date. Could I help prevent
            
            
            
              similar incidents from happening again? Maybe I noticed something that could
            
            
            
              be tweaked or changed or fixed or edited in order to
            
            
            
              prevent similar incidents from happening. If so, create a
            
            
            
              task for you in Jira or Monday or whatever tool that you use,
            
            
            
              and help prevent the next incident from happening.
            
            
            
              And also, of course, does this incidents require a postmortem
            
            
            
              if years, then okay, jot down the notes as soon as possible while
            
            
            
              it's still fresh in your mind. Because I think we all
            
            
            
              know that we are human beings and we remember
            
            
            
              things better once it's still happening, and not like after
            
            
            
              an hour or two or even the next day. But just to
            
            
            
              make an emphasis on that, were was a study conducted by Blumar
            
            
            
              Zeigonik, who's a russian psychologist. She found, but that we
            
            
            
              remember more details during an ongoing scenario rather than upon
            
            
            
              their completion. So, in favor of a
            
            
            
              better post mortem process, write the notes down as soon as possible.
            
            
            
              And if there isn't a need for postmortem,
            
            
            
              still share the knowledge, either through a runbook or
            
            
            
              a daily brief, or even do a mental check with yourself to
            
            
            
              make sure that everything was handled as smoothly as
            
            
            
              possible. And if not, what could be done better?
            
            
            
              Okay, let's talk a bit about war room conduct.
            
            
            
              So, war room is when you have, I would say, more than
            
            
            
              three, four people handling
            
            
            
              an issue, right? So in that case, we should really
            
            
            
              have an incident manager that really
            
            
            
              divides the work and tell people what to do. This person should be
            
            
            
              calm and collected and see things clearly
            
            
            
              and not afraid to reduce people's involvement if it doesn't serve the purpose.
            
            
            
              Because if, let's say we call this guy to help
            
            
            
              with a certain thing now, this certain thing finished.
            
            
            
              Okay, now, guy, please go away. Because we don't
            
            
            
              need the extra noise, because too many people can be too noisy.
            
            
            
              It should be kept minimal and dynamic.
            
            
            
              And there's that. And I want to tell you even a story about
            
            
            
              that. On one of my previous jobs,
            
            
            
              I was new at the company, and there was a critical
            
            
            
              AWS issue that created a lot of bad stuff
            
            
            
              for us downtime not good stuff. And I was
            
            
            
              new, right? So I didn't speak up because I didn't think that
            
            
            
              I have something to contribute because I don't know anything yet.
            
            
            
              I'm new, but I was, were in the war
            
            
            
              room like it was on Zoom. So I was just there quietly and
            
            
            
              I saw that they are just going places
            
            
            
              that are not really helpful. He pulls
            
            
            
              this rope like this way and
            
            
            
              he thinks about that and he talks about that and everyone just
            
            
            
              doing their own thing and not really coming together. So at that point
            
            
            
              of time, I jumped the gun and said, guys,
            
            
            
              I don't see that this is coming or going anywhere.
            
            
            
              Let me help. And then I took the liberty to be the
            
            
            
              incident manager. And then I told this guy, okay,
            
            
            
              you check the logs, check x. I don't see that we have
            
            
            
              a runbook for a proper startup of
            
            
            
              the application to see the flow that needed
            
            
            
              to be in a specified order. So please write down
            
            
            
              the process for us to start the application
            
            
            
              in the order that is needed. Right. So I took this
            
            
            
              role upon me and then stuff started to really
            
            
            
              progress towards resolution. So incident manager is
            
            
            
              very needed in a war room conduct because we
            
            
            
              need to have organized way of doing
            
            
            
              things, as always. Okay. And speaking
            
            
            
              of an incidents manager, there are a lot of things that you should
            
            
            
              have qualities that you should have when
            
            
            
              you handle an incident. It doesn't matter if it's in a war room or
            
            
            
              on your own, but there are a lot of qualities. Of course there are a
            
            
            
              lot. I'm not going to tell everything here
            
            
            
              or mention everyone because I can't mention all qualities, but let's cover
            
            
            
              the ones that I think that are important and some tips
            
            
            
              for me how to perfect them. So the first one is
            
            
            
              think on your feet impromptu action
            
            
            
              taker. So sometimes the
            
            
            
              issue will be something that youll are familiar with it, but sometimes it
            
            
            
              will be something in an uncharted territory. And you need to think on
            
            
            
              your feet and be ready for anything. And in order to practice that,
            
            
            
              you can participate in brainstorming sessions at work.
            
            
            
              So whenever possible, you can jump the gun and participate
            
            
            
              these sessions because these kind of scenarios of ping pong
            
            
            
              will help you practice this quality.
            
            
            
              Next one is differentiate between relevant and non relevant
            
            
            
              information. As I just mentioned before in the
            
            
            
              war room story, he said that,
            
            
            
              he said that. And people talked and talked and talked.
            
            
            
              I'm like, guys, we don't progress towards resolution. We need
            
            
            
              to really focus on what matters right now.
            
            
            
              So that's very important trait to have
            
            
            
              to differentiate between what's important for fixing
            
            
            
              the issue and what's not. And basically, the more
            
            
            
              youll know how a system works, the more your ability
            
            
            
              to separate the relevant from the non relevant information increases
            
            
            
              operation under pressure. So let me also
            
            
            
              tell you a story on another job
            
            
            
              that I was there, I joined.
            
            
            
              I also was new at that position as well. And I
            
            
            
              had my first on call.
            
            
            
              My first on call were, I forgot the word, my first uncle. And there
            
            
            
              was a big issue. I mean, a lot of alerts on the screen.
            
            
            
              Like 100 alerts. It was crazy.
            
            
            
              And then I looked at the screen, I'm like, okay, I'm new. I don't know
            
            
            
              what to do yet. Let's call the guy that's there for two years,
            
            
            
              and he will help me, right? Because he knows what to do.
            
            
            
              He's familiar with what's going on there. So I called him, and then
            
            
            
              he sat next to me, and I'm like, okay, now what do we need to
            
            
            
              do? And then I remember he just looked at the screen. He was like,
            
            
            
              wow, there are so many alerts.
            
            
            
              And I'm like, dude, dude, snap out of it.
            
            
            
              So he was totally out of it. And I'm like,
            
            
            
              guy, dude, it's not helpful.
            
            
            
              We need to snap out of it and see what we can do
            
            
            
              to fix that. Right? And the thing is that
            
            
            
              stress is a symptom of being out of control, and collection
            
            
            
              of relevant data will help you decrease stress
            
            
            
              levels. So when you know what to do, you're in control.
            
            
            
              And in general, you should really keep a cool head.
            
            
            
              Snap out of the uncertainty cloudiness, because when
            
            
            
              it comes to multiple participant incidents,
            
            
            
              it's also something that you should think about, that the
            
            
            
              stress level increases because everyone stressed and they want
            
            
            
              to fix the issue, right? So keep a cool head and
            
            
            
              just start to gather information that will help you
            
            
            
              solve the issue and then regain your control.
            
            
            
              A methodical work. So time
            
            
            
              is of the essence, right? And there is a pressure to solve things fast,
            
            
            
              as I just mentioned in the previous bullet.
            
            
            
              But the thing is that methodical walk will help you gain faster
            
            
            
              incident resolution. As I showed you before, structured process,
            
            
            
              follow the rules, follow the questions that you need to ask yourselves,
            
            
            
              and then it will help you regain control and progress
            
            
            
              you towards faster incident resolution. Be humble.
            
            
            
              If you're stuck, ask for help. So it's okay not to
            
            
            
              know how to fix an issue on your own. That's okay.
            
            
            
              But you need to understand that it's not your time to shine.
            
            
            
              People say, I will fix the issue. I will be the hero,
            
            
            
              and that's that. But no, your time to shine will be if
            
            
            
              you help the company not lose
            
            
            
              money and not have downtime, right. So you will have a lot of
            
            
            
              opportunities to prove yourself on your day to day. The best way to prove
            
            
            
              yourself on instance is to take a step back and escalate an
            
            
            
              issue. If you don't know what to do and you can't resolve it on
            
            
            
              your own, because in that way you have the business interest in health.
            
            
            
              Remember business mindset, it's exactly that.
            
            
            
              Problem solver. So if you have a problem solver approach,
            
            
            
              whatever needed and can do approach, youll can basically do anything because
            
            
            
              being positive is the way to go. And if you start from that
            
            
            
              point and not be negative of like,
            
            
            
              I'm not sure it can be salvageable or whatever,
            
            
            
              it means that your ability to do stuff increases.
            
            
            
              So always have a positive can do approach, sense of
            
            
            
              ownership and initiative. So if
            
            
            
              you're on call and you escalated something to other person, that's good,
            
            
            
              right? We just talked about it, but you are still on call. It means
            
            
            
              that if you escalated something, you're still responsible and you
            
            
            
              need to have end to end handling of things.
            
            
            
              So after escalation, wait ten minutes,
            
            
            
              15 minutes, whatever it takes, and then ask, hey, what's going on? Do you need
            
            
            
              help? Do you know what to do? And be sure that you know
            
            
            
              what's going on and it's really handled because maybe you escalated. But the
            
            
            
              other person, I don't know, he didn't understand it correctly
            
            
            
              and he's not really handling it, and nobody's handling the issue right now.
            
            
            
              So communication is very important. And make sure that if you escalated,
            
            
            
              this is really handling, someone is really handling it,
            
            
            
              and it's an end to end process for that
            
            
            
              good communicator. So you really need to explain
            
            
            
              the issue to others that will help you and communicate the issue
            
            
            
              for escalation purposes. So being a good communicator is very
            
            
            
              important and communication guidelines can be established.
            
            
            
              So let's say you're not good in communication, you're not great with
            
            
            
              that, you don't know who to talk to, or you
            
            
            
              don't have the tendency of updating people, right? So if
            
            
            
              your company or your department sets communication guidelines,
            
            
            
              then you will know exactly what channels should be used and what
            
            
            
              content is expected in those channels and how communication
            
            
            
              should be documented. And if you have this laid
            
            
            
              all out for you, then you know exactly what should be communicated and
            
            
            
              it will help youll be a better communicator and lead
            
            
            
              without authority. It's mostly relevant on a war
            
            
            
              room scenario with more than two people involved.
            
            
            
              And remember that if you're nice and confident and you
            
            
            
              make people feel at ease and project and everything under
            
            
            
              control facade, then people will listen to you and follow your
            
            
            
              lead. And I think that the most important thing is
            
            
            
              caring. You need to care about what's going on. You need to care
            
            
            
              about production. You need to care about your team
            
            
            
              members, your company. If you care, then you will do the extra mile
            
            
            
              and you will be able to do anything that
            
            
            
              I mentioned here. And the structure process, and also the proactive approach that
            
            
            
              I'm being to show you right now. Okay, so we
            
            
            
              covered the mindset that you should have the business mindset when working
            
            
            
              on production and handling production incidents. We covered
            
            
            
              an incident flow, a structured one that will help you handle
            
            
            
              an incident better. Now let's talk about being proactive.
            
            
            
              The proactive approach that you should have in order
            
            
            
              to come prepared for an incident that will happen. Because as
            
            
            
              the song by the Fujis. Right? Ready or not, here I
            
            
            
              come. You can hide. So if you're not ready,
            
            
            
              it doesn't matter. Page of duty or opgeny or victorops or whatever
            
            
            
              tool that you use will call you anyway when you're on
            
            
            
              call. So you better be ready. So how can we be
            
            
            
              ready? Right? The proactive approach after the fact, after an
            
            
            
              incident took place, looks something like that, in my opinion.
            
            
            
              So first of all, on call shifts handoffs, I'm not
            
            
            
              sure if it's something that is done on every company,
            
            
            
              but let's say I finished the shift. By the way,
            
            
            
              this is the word that I looked for before
            
            
            
              on call shift. So after an on call shift,
            
            
            
              there were several issues. If they were minor, then okay,
            
            
            
              but if there was something special or something recurring,
            
            
            
              I should document it in an on call shift handoff, which is
            
            
            
              a summary that I will post in my team's channel.
            
            
            
              And then the uncle after me can read what's going on,
            
            
            
              and that way he or she or they can be updated
            
            
            
              on what's going on in production. And it will help them
            
            
            
              have a better shift on their own because if they will have
            
            
            
              an issue, and the issue was basically recalling
            
            
            
              because I had it, then they would know better how to handle that.
            
            
            
              So it's good for audit purposes because it is documented
            
            
            
              in the Slack channel, but it's also good for your team
            
            
            
              member success because you want to help them do their job better and have
            
            
            
              a smoother shift.
            
            
            
              Postmortem notes. So as I mentioned before, write them down as
            
            
            
              soon as possible. And even if there's no meeting,
            
            
            
              do a mental check. Do a retro with yourself, see what you
            
            
            
              could have done better new tasks.
            
            
            
              So again, prevent the next incidents. Do you have
            
            
            
              something in mind that could help, based on what you saw in the incident
            
            
            
              that could help stabilize the environment?
            
            
            
              Open a jira or a Monday ticket and fix it to prevent
            
            
            
              the next incident? Modify alerts so maybe you
            
            
            
              saw some false positive alerts, and I think we
            
            
            
              all seen it in our career, alerts that come up and after
            
            
            
              a couple of minutes get closed. So don't just leave that,
            
            
            
              right? And don't just wait for the next on call to
            
            
            
              fix the alerts because maybe they will wait for the next on call and they
            
            
            
              will wait for the next on call and then it will never get
            
            
            
              happen and we all will suffer from these alerts.
            
            
            
              So please fix them. Internet runbooks so
            
            
            
              I mentioned it before, if you don't have Internet runbooks
            
            
            
              in place, please write them down and update
            
            
            
              them along the way. And this will help you to have a
            
            
            
              smoother process, right. Because you are already prepared, you know what to do in
            
            
            
              a certain scenario than certain issues.
            
            
            
              Automation. Let's say you found some candidates
            
            
            
              for self remediation, some issues that could be self
            
            
            
              remediate by the process or the flow itself.
            
            
            
              So if so, open a ticket and make it happen.
            
            
            
              And if the issue handled, share the knowledge. I mean, people could really
            
            
            
              benefit from your line of thought and how you fix things.
            
            
            
              And this share of knowledge is more in depth than in an on
            
            
            
              call handoff, because in an on call handoff, you just write
            
            
            
              down a summary, were, I mean, actual share of knowledge
            
            
            
              to show people how you figured out things. What was
            
            
            
              your line of thought? What was the flow that you had? It will really help
            
            
            
              others understand what's going on and come prepared better
            
            
            
              for incidents. So we covered the proactive approach
            
            
            
              for after the fact, after an incident took place. Now let's discuss
            
            
            
              what youll can do on your day to day that will help you come prepared
            
            
            
              for incidents. So, first of all, the onco shifts,
            
            
            
              handoffs that I mentioned before, you should read everything, okay?
            
            
            
              Not only the shifts that the handoffs
            
            
            
              that the person after you wrote,
            
            
            
              but your entire team, it shouldn't take long. It just should
            
            
            
              be like a paragraph. And this could help you understand
            
            
            
              what's going on in production when you're not there, when you didn't do the changes
            
            
            
              yourself. So it's very important because that way you will always be
            
            
            
              up to date with what's going on in production.
            
            
            
              Escalation, a point of contact. So you support
            
            
            
              several services at work, right? And you know the
            
            
            
              needed pieces of information related to your realm. My realm is infrastructure,
            
            
            
              so that's great, but you should also know
            
            
            
              other realms as well to have the full picture. So let's say
            
            
            
              there's an issue with X. I've checked my side of things,
            
            
            
              I don't see any issue, but I know that John is the one
            
            
            
              that is handling X from the side of code, from the
            
            
            
              side of the developer side. So I should escalate to him to check things
            
            
            
              on his end. Identifying services, escalation points
            
            
            
              on a day to day basis, and not only ad hoc
            
            
            
              when the occurs will save time and money
            
            
            
              on incident management and save someone else's hours of sleep.
            
            
            
              Right? Because if I don't know who's handling a developer side
            
            
            
              of service x, then I need to wake my team member. I need to
            
            
            
              wake my team leader and ask, hey, who's responsible
            
            
            
              for that, right? So not nice. So I
            
            
            
              can prevent that and already come prepared and know that these services are
            
            
            
              handled by these guys or these women or whatever.
            
            
            
              And then it will save time for me during
            
            
            
              the incident. And other than just
            
            
            
              chasing my tail and figuring this out on the spot,
            
            
            
              I know exactly who can I escalate the incident
            
            
            
              to. Understanding system architecture. So if
            
            
            
              I know weaker areas in the infrastructure or
            
            
            
              in the code maybe, and vulnerabilities and sensitive or
            
            
            
              blast radio scopes, then to help me understand the
            
            
            
              severities of incident, to help me understand what
            
            
            
              needs to be done, either escalation or root cause analysis.
            
            
            
              So understanding and really learning how our
            
            
            
              infrastructure works and its vulnerabilities will really
            
            
            
              help us come prepared for any incidents.
            
            
            
              And coupled with that is learning application flows, because that
            
            
            
              way we know the business impact. If something bad
            
            
            
              happens, we have a service, we know if this is a service
            
            
            
              that affects a lot of users or maybe a few. So business impact
            
            
            
              is very important in that case and also for escalation purposes.
            
            
            
              If I know application flows, I know that this service communicates with this and
            
            
            
              goes to this and goes to that, then I can do
            
            
            
              a root cause analysis and go by the flow and see,
            
            
            
              okay, these logs looks okay here. It's okay.
            
            
            
              Oh, were I have some issues, if I don't know the flow, I wouldn't be
            
            
            
              able to go in this path. So learning application flows is very,
            
            
            
              very important. Team members tasks. So as we
            
            
            
              know, production happens not only by me or by
            
            
            
              you, but your team members also contribute
            
            
            
              to the production changes that happened.
            
            
            
              And believe me, it really is easy for me to just lay low
            
            
            
              and deal only with my tasks. But I'm responsible
            
            
            
              for production. I need to know what's going on so I need to know what
            
            
            
              my other team members are doing and what changes they introduce to the
            
            
            
              environment. Because I'm responsible for the environment and I need to know what's going
            
            
            
              on. So it's very important. And again,
            
            
            
              coupled with that deployments or changes in production,
            
            
            
              so ask about the changes and their possible impact.
            
            
            
              And as I said before, the previous slide ops
            
            
            
              genie or pagerduty doesn't care if you didn't do the
            
            
            
              deployments or the changes by yourself. It will call you anyway if
            
            
            
              you're on call. So you better understand and know what happened
            
            
            
              in production in order to handle incidents better.
            
            
            
              And last but not least, be a go to person
            
            
            
              as they say. If you build it, they will come.
            
            
            
              If you are a person that is known to
            
            
            
              be proactive and know what's going on in the
            
            
            
              system, people will come to you. Youll get push notifications
            
            
            
              and it will decrease your need to fetch the updates on your own because people
            
            
            
              will come to you. So there's that.
            
            
            
              And I say that in order to talk the talk
            
            
            
              and walk the walk when it comes to incidents management, if you have your qualities
            
            
            
              in check, so if you know that you're going to be stressed out,
            
            
            
              work on that and other what is
            
            
            
              in check, make the process structured and place.
            
            
            
              Then prevent
            
            
            
              the next incident from happening. And remember, less incidents
            
            
            
              means less downtime means business success and business success
            
            
            
              is eventually your success. So thank you so much.
            
            
            
              If you have any questions about incidents management or
            
            
            
              any other SRE topics, I will be more than happy to help. Thank you
            
            
            
              so much.