Conf42 Incident Management 2023 - Online

A Team That Rises From Crisis: Leveraging Incidents for Growth and Improvement

Video size:

Abstract

In a rapidly changing business landscape, survival isn’t enough; it’s time to thrive. Join me as I share real-world strategies that have helped my team to turn crises into catalysts for growth. Get actionable insights to elevate your team’s resilience. Don’t weather the storm—dance in the rain.

Summary

  • Dol: Today we're going to talk about a team that rises from a bug. Do you know when your next bug will occur? Dol: It doesn't have to be a bug, it can be a pivot you have to do in the product. The essence of this talk is how we are going to deal with it.
  • Growth culture is the foundation of every going organization. Focus on learning means that you want to make sure every day that tomorrow you'll be a bit better than you were today. If your company does not have a guild, start one today.
  • System design, or let's say system documentation. Have you ever stopped and read that it's documentation? This could be a great way to see, to look for the next storm.
  • Design reviews is a way for you to tell people what you plan to do before you do it. You address an issue, you find a solution, and before you start to implement it, you present it to your team, to your manager, to everyone relevant to this issue. Do that before you implement, not after, because then it gets incredibly harder.
  • Tests are your best way to connect with the mind of the person who wrote the code you're dealing with. Read the test, not just pass them. And finally schedule a time for it. A few hours every week that you invest in these areas, it will get you better.
  • The second point is continuous communication. Open communication means that you feel free to tell people what you think. The second thing is clear goals. You don't want to hide the bad stuff. You want to be as transparent as possible.
  • The final thing in this subject is regular check ins. Talk with the people in your team and ask them how are they. You'll be surprised with some of the answers you may hear. Make sure you do this on a regular basis and not just once a year.
  • The next point I want to talk about is supportive environment. You want to encourage collaboration. The next thing is to forego blame. In order to improve your product, you need to learn from these mistakes and not looking for the person who didn't.
  • The next point I want to address is experimentation, or space for experimentation. Just because you have something that works does not mean it cannot work better. The best way to find out is to be creative.
  • The next point I want to address is the resources and time again, like in the last point, you want to provide your researchers, your team, the means to do so. Give people the tools they need in order for them to provide with a full solution. The last thing Iwant to talk about is the culture of the teams and the continuous improvement of it.
  • The next point I want to talk about is empowering champions. How often do you talk about the good things, the achievements, the progress, the people who made them? First of all, celebrate victories. You want to share success stories. It will inspire the people in your organization.
  • The next point I want to talk about is task pitching. What if your employees could pitch some of the tasks to you? By giving your employees the opportunity to pitch their tasks, you're basically giving them a sense of responsibility. The ownership of this task is much higher.
  • The next part I want to talk about is lead hackathons. You want to give your employees stronger sense of ownership. It could be any initiative someone in your team is trying to advance. We want to enable space for experimentation. Finally, we want to empower the champions.
  • First thing you want to do is to identify the incident. After that, activate your response team and have them start working on it. Make sure the incidents does not spread to a different part of the system. Throughout this entire process, communicate with the stakeholders.
  • If your postmortem does not have action items, you did not did a postmortem. You simply wasted an hour of your team. Instead of postmortem, let's do premortem. Let's handle the incidents before they're happening and not after.
  • Increase your resiliency. First of all, you can upgrade it. Do that before something happens, not after. Also invest in your transparency and discoverability. Share your knowledge. Make other people rise from your incidents.
  • So let's recap. You want to increase your resiliency. You don't want any tiny bug to disrupt your system entirely. When the winds of change blow, build walls while other build windmills. Use it as your tool. Look for it. Evolve from it.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Dol. And today we're going to talk about a team that rises from a rises. And in order to do that, I want to start with a small story. Let's say, I'm sure you're familiar with this. The sky is clear, the grass is green, everything is great. You're in your car, traveling to your work, everything is just superb. Your new feature just got up yesterday, and the customers are using it. And happy, your product manager is happy, your own manager is happy, everyone is just happy. Until you have a bug. Now, you don't know exactly when it started, you don't know exactly where, but you're seeing a tremendous amount in support calls. And you notice when you arrive to the office that three of your teammates are looking at the same screen. And you know that's bad news. So when did it happen? Okay. I mean, like yesterday, you were okay, everything was great. But now this is what you're dealing with, right? And think about it. We all know we're going to have bugs in the future, right? We all know we had bugs in the past. But let me ask you this. Do you know when does your next bug will occur? Do you know where? Do you know what kind of effect will it have on your product? Weird, right? Because you know that there's going to be a bug. You're certain about it, but you don't know where. You dont know where, you dont know nothing, right? But it doesn't have to be a bug. It can be your best employee saying goodbye and leaving for his startup. It can be a pivot you have to do in the product because the market demands it could be a million things. My point is, shit happens, right? And today, this is just what we're going to talk about. We don't know when, we don't know where, but we know it's going to happen. And the essence of this talk is how we're going to deal with it. So, hi, I'm Dole. And today we're going to talk about your teams and how will it affect the next rises. So, a few words about me. I'm married with Elle and father of Ayala. I'm a senior software engineer. I've been around nine years, give or take, in various amount of teams. And this is amount of incidents I had to deal with, so to say. Okay, so now that we've introduced, let's talk about the reason why we're here. So basically, we sailing in rough waters, right? We know that, and we don't know when the storm is going to hit us. We don't know where, but we know it will. It's out there waiting for us. And the question is not when it's going to be. The question is, how will this affect your team? And by asking these questions, I actually want to ask another question. The lead question I want to ask is how resilient your team is for this kind of storm. Okay. Because it's out there. We know that. So in order to answer this question, I want to answer three smaller questions or maybe address three issues. The first is growth culture, which stands for, can your team really absorbs the shock? Okay. The second, does your team can handle the incident? I. E, can the team respond to the shock? And the third, rise. Can your team evolve from the shock? Okay, three questions. So the first one, growth culture. Growth culture is the foundation of every going organization. It's a set of skills and principles that every organization depends on. And basically, this is the first thing that determines how will your teams reflect this incident and how will this affect it. There are a few principles I want to discuss, and the first thing is to focus on learning. What does it actually mean? Focus on learning means that you want to make sure every day that tomorrow you'll be a bit better than you were today. It's true for yourself, it's true for your team. It's true for your company. It's true basically everywhere you look around. But how do you do that? So the first thing you might think about is online recesses, right? You have courses, you have books, you have lectures, you have meetups, you have whatever. Everything is available nowadays for free or for minor price. And it's out there just waiting for you to reach and grab it. Do it. You don't know the tremendous effect it will make both on you and your team. The second thing I want to talk about is guilds. Now, what actually a guild is, a guild is a group of people with a certain domain that come together and talk about a certain subject. They ask questions, they discuss things. They try to get to improve. Now, if your company does not have a guild, start one today. Don't wait. The people around you are waiting for this kind of platform, and the sooner you do that, the better. Now, this doesn't have to be necessarily with your own people from your own team or your own company. This could be in the form of meetups where you can meet with people from other companies and discuss their issues, which I'm guessing quite similar to your issues, and think about the way you can resolve things. Learn from experience of others, share some of your own great work. Now these are all pretty standard ideas, but I want to talk about a few more. For starters, system design, or let's say system documentation. Okay. Think about it. You're in a company. You have a product, a big one, a massive one, complex one. It does tremendous work. It does all kinds of. But have you ever stopped and read that it's documentation? Have you ever stopped and read its entire capabilities, whether it's weak spots, its strong spots, what it can do, what it cannot do? Think about it. This could be a great way to see, to look for the next storm. Okay, the next thing I want to talk about is design reviews. Design reviews is a way for you to tell people what you plan to do before you do it. You address an issue, you find a solution, and before you start to implement it, you present it to your team, to your manager, to everyone relevant to this issue. You're saying this is the problem and this is how we're going to face it. These are the drawbacks. These are the strong parts. This is my solution. And then you wait. You wait for the feedback. You wait for people to tell you you should look at this part. You wait for people to tell you, I don't think this is possible. You wait for the drawbacks and they improve your solution. Do that before you start to implement, not after, because then it gets incredibly harder. The last but not least thing I want to talk about in this subject is tests. Think about it. Tests are your best way to connect with the mind of the person who wrote the code you're dealing with. It's as if he's telling you, this is what I meant for this code to do. This is what I want to check. This is what considered to be a good behavior. This is what's considered to be a bad behavior, okay? Whether if this person is one of your teams, one of your teammates, or even you from the past and you don't remember why you did what you did. Read the test, not just pass them. Okay? It will give you a clear view what the writer meant for. And finally schedule a time for it. I cannot stress this enough, schedule a time for this. If it's not in your calendar, it does not exist. Okay? Make a slot every week, every day. I don't know. It doesn't matter. A few hours every week that you invest in these areas, it will get you better. It will help you improve. The person you will be tomorrow is much better than the person you are today. Please schedule time for it. The second point is continuous communication. Continuous communication starts with basically communication and open communication. What does it mean? Open communication means that you feel free to tell people what you think. I know it sounds simple. You'll be surprised in how many teams. That's not the case. Tell people what you think. You think this feature is wrong. Tell them you think this solution is too complex. Tell them you think we have a problem that we don't see today but we'll see tomorrow. Tell them open communication is one of the pillars for a team that evolves and grows and improves. Okay? You have to feel safe to talk your ideas. The second thing is clear goals. When you are given a goal or when you give a goal, both you and your manager should know what the outcome should look like. Again, simple. I know, but you'll be surprised. Both you and your manager should know what your goal is, how you plan to implement it and what will the final outcome will look like. It's as if he's telling you to train a horse and you draw a dog instead. You don't want to do that. Okay? You want to set clear goals and you want everyone to know what it is that you are changing to do and how will you achieve that. Another thing is transparency. This is super important. You don't want to hide the bad stuff. You don't want to. Don't talk about the weak spots. You want to be transparent as possible. If something is too complex, tell the people about it. If you made an error, if your data is corrupted, if your system is not going to stand the overload, talk about it. If something happens, communicate it. Okay? This is the ground rules of communication. You need to talk about the bad things as well as the good things. And this is super important to be as transparent as possible. Not just in your own team, but with other teams as well. If your team made something, it affect another team, tell them about it before they find out themselves. Okay? And the final thing in this subject is regular check ins. By regular check ins I mean talk with the people in your team and ask them how are they? Are they okay? Are they feeling well? Is the subject they're trying to deal with the last week or so is bothering them. Do they need any help? Okay. You'll be surprised with some of the answers you may hear. Make sure you do this on a regular basis and not just once a year when you have to. Okay? This is super crucial. This is super important. You want the people in your team to be happy. You want to know that they enjoy the work you're doing. You want to know they understand the work they're doing and why they're doing it. Okay. And why not something else? This is super important, and you want to make sure every once in a while that everything is okay as you expect it to be. The next point I want to talk about is supportive environment. Now, what does that actually mean? Supportive environment, again, is with the culture of your team. It takes a long time to build it, but once you have it, it is a strong tool that you want to hold on. How can you do that? By first, you want to encourage collaboration. You want people to work with each other. You want people to understand each other's ideas. You dont want one person walking on something and the next day another person trying to work with it and doesn't understand what the hell is going on and why that person did what he did. You want people working together, talking the same language, understanding the same problems, working together and solving them. Okay? The next thing you want to do is provide support and resources. If you have a teammate that struggles on some mission and needs help, help them. If someone needs an extra time or an extra tool or skill or whatever, help them make them better, because if they're better, your team is better. Okay, straight up. If they are better, your team is better, and you want to give them all the tools and resources you can in order to make that happen. The next thing is to forego blame. Let's say something did happen. Something broke, something was corrupted, something was. I don't know, things happened all the time. You don't want to look for the person responsible for it. You don't want to look at the past. You want to look at the future. What will we learn from it? How can we use this to improve next time? How can we make sure this will not happen? Okay. These are the questions that will help you lower the chances that this will happen again. Not who did this and should we fire him? This is not helping anyone, especially not your product. Okay. In order to improve your product, you need to learn from these mistakes and not looking for the person who didn't. The next point I want to address is experimentation, or space for experimentation. What does that mean? So you have your team, and you're working on your solution, and you're advancing with it, and it's going well, but you want to keep enabling these parts of the work as well. What does that mean? You want to encourage creativity. Just because you have something that works does not mean it cannot work better. Okay? You cannot improve it. Now, beware of premature optimization, because that's a lesson we all should learn. But basically, just because something works does not mean it cannot work better or faster or cheaper or, I don't know, maybe there's a way to improve it. And the best way to find out is to be creative. Try look for new angles, new solutions. Maybe we can do something today we could not do before. Think about it. The next point I want to address is the resources and time again, like in the last point I said, you want to provide your researchers, your team, the people who is in charge of this task, the means to do so. Okay? If something is a bit complex, if something requires a new approach, give people time to investigate. Give people the tools they need in order for them to provide with a full solution and not something they made just to make you happy. Okay? If you give people the confidence and the feeling that they can do and whatever they want, and explore new boundaries, within limits, of course, but explore new boundaries, you'll be amazed how much progress it will achieve for you. The last thing I want to talk about is the culture of the teams and the continuous improvement of it, okay? Again, just because something is working does not mean it could not work better. Now you want to do it on your own level, on the team level, on the company level. Okay? You want to always ask yourself, can we improve? Can we do better? Why are we doing what we're doing today? Are we sure this is the best solution? Or was this the quickest solution we can find? And we quickly moved on. You'll be surprised. Sometimes you'll go and you find something in your code or your project or whatever that's been there for years because someone meant for it to change one day and that simply didn't happen. Okay? And you can achieve from it. You can grow from it. Continuous improvement culture is a major tool in your toolbox that enables you to move on and progress. The next point. Sorry, the next point I want to talk about is empower champions. Okay? What does that mean? Traditionally, we used to talk about, or we talk about a lot about bad things, when things break, when things need to be removed, to be fixed, to be changed, to be improved, how often do you talk about the good things, the achievements, the progress, the people who made them? Okay. Now, empowering champions can be done in a lot of ways. First of all, celebrate victories. Okay? If someone made a goal, he deserves to dance after it. Okay? Messi can tell you that need to recognize the team achievements and talk about them and talk with other teams about them for them to learn and improve. You want to share success stories. You want for the people who've been in the trenches to talk about it and how they made it out. Okay? It will inspire the people in your organization. It will give great motivation. It will give the people the thing they need. You want to take time to reflect about it. It could be once a week, once a month, once in a quarter, once in whatever. Okay? But you want to share and celebrate the success stories. You deserved it. You made it. It's yours. The next point I want to talk is task pitching now. Okay, think about this scenario. You're at the beginning of your sprint of your quarter. I don't know. But at some point you think about the new tasks and goals you want to achieve and you give them. If your manager giving you the tasks, if you're giving the tasks to your employees, basically orders come from up below. But what if they work the other way around? What if your employees could pitch some of the tasks to you? Okay, what does that mean? Like, let's say every sprint, you talk with your employees and you decide to do. You want to do a, b and c, but what if you save a place D, where your employees can say, we noticed that there is a problem in this area and we know that people are afraid of touching it, and maybe we can change that. Or you remember the bug we had earlier and it caused some disruptions in something. Let's take a look at that. Or remember that algorithm we said we wanted to explore? Let's do that. Now, by giving your employees the opportunity to pitch their tasks, you're basically giving them a sense of responsibility because it's their own tasks. They pitched it, they made it, they present it to the team. And if the team or you decide to accept that pitch, great. It goes on the board and it's a proper task. But it's not a task that you gave and your employees now need to do. It's a task that your employee gave to themselves. Okay? The feeling of commitment, the feeling of responsibility. The ownership of this task is much higher. Okay, think about it. The next part I want to talk about is lead hackathons. And I'm pointing this with a dot here because it does not have to always be hackathons. It can be initiatives in general. Okay. It relates to the point we just talked about. But basically you want to give your employees stronger sense of ownership, stronger sense of responsibility. Now, hackathons are a great way because you can take the junior employee and make him the team lead of your manager, for example. But it doesn't have to be a hackathon. It could be any initiative someone in your team is trying to advance, help them with that, teach them with that, make them become more responsible for that action. It will make them feel much more empowerful in the organization and much more sense of this is done because I wanted it done. I wanted it to be done. And this is what we came up with. Great, great success story to celebrate. Think about it. So, let's recap. We said we want to focus on learning. We want to keep advanced and enrich our knowledge. We want to have a continuous communication. We want the team, the people in the team could be able to talk with each other and have an open communication as well with other teams. We want to have a supportive environment. We want people to not be afraid to make mistakes. We want people to know that it's a safe place that can advance with. They can be bigger with. Okay, we want to enable space for experimentation. We need to enable to do mistakes and to go the wrong way in order to find the right way. Okay. We don't just simply land on the right way. We find it after spending time in the wrong ways. Okay. We want to enable space for experimentation. And we want people to know it's safe to experiment. Super crucial. Finally, we want to empower the champions. We want people to have a strong sense of ownership about the things they're doing. We want people to advance more initiatives. We want people to not just do what they're told, but also to feel part of the company, the solution. We want people moving new ideas. Great. So you have a growth culture. But the second thing we want to ask, can you handle the incident? Okay, so let's go back to our story from before. The first thing you want to do is to identify the incident. What does that mean? Remember earlier when you were driving your car? Everything was great. Suddenly, you reach to the office, three people looking at the same screen with sad faces. The first thing you want to do is to identify what just happened. Let's say that this is your system, okay? This is your algorithm, this is your whatever. And you see that people are looking at one of the monitors with long faces, and you understand something went wrong. The first thing you want to do, activate the response team. You need to understand who are the best people to address this type of issue. It could be from your team, it could be from other teams. It could be collaborate, cooperation, I don't know. But you want to activate the response team because this is what you're doing right now. We have an issue in production. We need to fix this now. Activate your response team. Great, so you did that. But let's say that the time it took for you, your bug is not only in this component, but you also have it in other components, okay? It's spreading and you don't know why or where it's going to spread next. The next thing you want to do once you've activated your team, is containers. You want to make sure that this error, this mistake, does not affect other teams, other services, other protesters, okay? Containment, it means shifting traffic to another service. It means using a different data than you used to because this data is corrupted. I don't know every bug, its own solution. But the first thing you want to do is to contain the bug so you can control it, okay? So it doesn't spread to your entire system. Great. So if you've contained the problem and you have a team working on a solution now, what you want to do is to investigate the root cause. You want to understand what happened that caused this issue. Why didn't you saw it before? Okay? You want to address the issue and figure out what's exactly happening right now that is broken in your system in order for you to fix it. So let's say you did that and you saw this small part in your data that is causing something, that start a chain reaction, that this is the outcome you're seeing now that you have people working on the incidents and you have contained the issue, now you can try and to restore and recover. You can fix the bug, make sure it works, make sure you can reproduce the problem and your fix is actually fixing it. And after all that is done, you can now deploy and make your system smile again. Okay? Make sure. Okay, so this is basically the process. Now you have to make sure throughout this entire process that you always communicate with the stakeholders. If there's an SLA issue, if something is broken, if something is not doing what it's meant to do, you want the people that could be affected from it, whether it's other teams or your customers, to know about it, okay? People are relying on the information you're given. You want to give them the right information. You want them to act accordingly. Okay? Super important. Especially while you're still handling the case, okay? You don't want to wait for the case to be solved, which sometimes can take even hours, maybe even more, for people to know that something happened, for people to suddenly see that the thing you're saying is not actually right or it's not correct, and to lose their trust in you. That's the worst thing. That's the worst possible thing that can happen and you don't want to be there. So communicate always. So let's recap. We said you first want to identify the incident. After that, you want to activate your response team and have them start working on it. The first thing they want to do is to contain the incident. Make sure the incidents does not spread to a different part of the system or different process that could affect other teams and other logics. After you have contained the incident and you know that it's not going anywhere, now you can start and find the root cause of the problem and try to solve it. Once you do and you feel confident with your solution, you can restore the system to its forking state and recover from the bug. And finally, always, throughout this entire process, communicate with the stakeholders. Super important. Super important. Sorry. Okay, so we know now that your team can a, absorb the incident, right? And b, solve it. But there's something missing, right? I mean, we had a bug, we handled it, right? But what's next? Now, Churchill told us, never let a good crisis go to waste. And this is exactly the part I'm talking about. We already have that incident. We already had that bug. Let's use it. We've learned something today, right? Let's use it. Let's rises. Okay, so going back to our story, you came in the morning, you saw the people in your teams looking at the monitor. Long faces. Activate response team. Great. You solved the bug. Everyone is happy. The first thing you do is postmortem, right? You want to understand what happened. You want to investigate how can this not go not will happen. Sorry. You want to make sure that this will not happen again, right? Great. Now, two major important things I want to talk about postmortem and these are action items and follow ups. Every postmortem should have action items. I could not stress this enough. If your postmortem does not have action items, you did not did a postmortem. You simply wasted an hour of your team. Okay? If you don't come up out of that postmortem with action items, you're simply saying, we had an issue, we solved it. Let's just wait for the next issue. You need to have action items that will make sure that next time, this will not happen to you. Now, action items are great, but if you don't follow them, if you don't make sure that you actually did them, they're useless. Okay, let's say you came out of the post mortem, you wrote your action items. We need to fix this. You need to check that. We need to make sure that this happens before that. Great. But did you did it? Is it actually done? So what I'm saying is the minute you go out of the postmortem, you want to schedule a new meeting for a week, for two weeks, for a month, I don't know. But you want to schedule the new meeting where you will follow up the action items and make sure what has been done and what hasn't. Okay. If you dont do these two things, you don't have any point in making postmortem simply a waste of your hour. Okay, great. So you did a postmortem, you did action items and you followed them. Great. But I want to suggest a new idea. Like, think about it. Instead of being reactive to the incidents that just happened, like something happened and you react to it, let's become proactive. Let's try to handle the incidents before they're happening and not after. What do I mean? I mean, instead of postmortem, let's do premortem. Okay, bear with me. So basically what just happened. You had an incidents, okay? You fixed it. Great. And you post mortem it, you understand why it happened. How did the fix work? And now you're just waiting for the next incident, which will you fix postmortem and so on. But what if we reversed that cycle? What if instead of waiting for the incidents, we will do a premortem? We will understand what the next incident might be. Because we know this area of our system is a weak area. We know that this area of our system is relatively new area. I don't know. We know something. Instead of waiting for the next incident to occur, fix it. And to discuss about it, let's discuss about what will happen. The stuff that we are afraid of, the stuff that we are not so comfortable about, we're not so sure about. Let's understand why. Let's make a plan of how we're going to fix that, how we're going to improve that, and let's upgrade our system. Okay. Instead of waiting for the storm to come to us, let us come to the storm, but better prepared. By doing that, you can control some of the variables of your equation and not just wait for your equation to find you. So, great. So we talked about this. Let's talk about another thing. Increase your resiliency. Think about it. Something happened. Usually something small happened that caused the chain reaction. That made something big break, made something big corrupt. If your system was more resilient. Maybe the small thing still happened, but maybe the effect was not so big. Okay, if your system is resilient, it can itself absorb some of the shocks, like we talked in the first chapter. Now, what does it mean to make your system more resilient? First of all, you can upgrade it. You can find the areas in your code that needs to be upgraded, that needs to be shaken up, that needs to be improved. And do it. Do that before something happens, not after. Don't wait for the lessons you learned from the incident. Try to discover. Try to explore and to find the things in your system that are broken. You just dont know it yet. Okay? Upgrade those. The next thing is to invest in your toolbox. Sorry, what does that mean? Do you have enough documentation? Do you have enough tests? Do you have enough visibility? For when something breaks, you will be alerted on okay, you don't want for the Customer calling for you, telling you there was a problem in your product, in your system. You want to know that before it happens, in a different environment, always better before it happens, and fix it before it even actually gets to the customer. Okay? I mean, you do stuff all the time. You break and repair stuff all the time, you improve stuff all the time. Sometimes you don't know its effect until it's too late. You don't want to be there, you don't want to be that spot. You want to understand the actual effect of the change you're doing now before it reaches the customer and address it before it's too late. Okay? You want to have playbooks. You dont want that when something breaks, the people to start scratching their heads and don't know what to do. You want to write a fixed protocol of what to do if something happens. Who are the people that needs to be alerted? What should be checked, what should be backed up, what should be? I dont know. I don't know about your system, but you want for the people, for your response team to have the right tools, the right playbooks, for them to solve the incident. You want to make sure that while in an incident, which is a stressful situation, they fought about everything they need to think about. They touched all the points, they covered everything. You don't want them to miss something because they were stressed about it, okay? Because they didn't think about it. And now you have a new bug somewhere else and it will take you some time to reach for it. Write playbooks. Also invest in your transparency and discoverability. You want for something when something breaks, when something acts not the way it should be for a notification to pop up, for a dashboard to go red, for something to happen. Okay. You want to be as visible as possible so that people, yourself included, will understand as quickly as possible what happened, what kind of effect it did, what we need to do for the containment of this incident. Okay? The better transparency you have, the better visibility you have, the sooner you will be able to answer these questions. Finally, you want to share your knowledge. Okay. You've been in the trenches. You've suffered an incident. You fixed it. You came up with a new solution that can fix the problem you didn't think about before. I don't know. Tell people about it. People want to know. People want to learn from your mistakes. Go learn from other people mistakes. You want to share your knowledge about how people, your people handle this incidents in order to make everyone better, in order to make everyone smarter. Okay? Share your knowledge. Share the stuff you've learned. Make other people rise from your incidents. So let's recap. We said we want to do a post mortem. We want to investigate what caused this incident and how we can fix this. We want to do a pre mortem. We want to imagine what could go wrong, what could break. We want to fix this before it breaks. We want to test this new behavior before it reaches something critical. Okay? Do a premortem, even though nothing is currently broken, as you know of. You want to increase your resiliency. You want your system to be stronger. You don't want any tiny bug to disrupt your system entirely. You want the containment of it to be full. You want your system resilient. You want to share your knowledge. You want people to learn from your mistakes. You want to learn from other mistakes. Again, you don't want everyone to fall in the same pit for them to know how to avoid it. Okay? If you step on something, tell people how to avoid it for them to not step on it as well. So we now have a teams that can absolve the incident, solves the incident and benefits from the incident. Okay. If you have a teams that's doing all these three things, your team is no longer afraid of bugs, is no longer afraid of incidents, your team a bit, wait for it. I know it's weird, but it recognize there's a chance of growth in these areas, in these things that we all know that will occur one day or another. Again, think about what we said earlier. We dont know when, we don't know where, we don't know what it will affect, but we know the storm is coming. Okay? Use it. Use it as your tool. Don't be afraid of it. Look for it. Okay? Evolve from it. And finally. Sorry. I want to finish a few words. When the winds of change blow, build walls while other build windmills. Think about it. Harness the wind at your power. Thank you.
...

Dor Amram

Data / Server Engineer @ SimilarWeb

Dor Amram's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways