Conf42 Incident Management 2022 - Online

Create a learning culture

Video size:

Abstract

Building and marinating a five 9s system isn’t just about the tools and technologies. Development culture has a big part in how you keep a system available while scaling it up and supporting more features, users, and locations. A healthy learning culture, supporting the development, not repairing mistakes, and identifying weak points is another tool in the engineering toolbox. In this talk, we will discuss how to create a learning culture using debriefs, what to avoid, and how to instill change in an engineering organization.

Summary

  • Amir shaked: Every production environment fail, we too severely failures all the time. What do you do afterwards? This is how we can use effective debriefing to create both a learning culture and improve the stability of our systems. A good DevOps culture can be a competitive edge for your company.
  • Have we identified the issue in under a certain amount of time? Let's say five minutes. Once we answer no to any of these questions, the follow up should be okay. The repetition and the specific question really helps drive a process of change and learning.
  • Any culture change takes time. A lack of trust could be in the process. Not following up on action items can also be annoying. Another critical move was implementing proper communication on a wide scale.
  • The key in creating a process and proving you really do mean what you say is consistency on the process, the questions and everything. Establish crisis mode processes ahead of time really helps everyone understand what they should do at the moment. Treat config as code.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, my name is Amir shaked and I'm excited to talk to you about a very important part of incident management, which is what do you do afterwards? This is how we can use effective debriefing to create both a learning culture and improve the stability of our systems. Over time. Every production environment fail, we too severely failures all the time. The question is, what do you do afterwards? Today I'm going to cover our journey through the process of change towards creating a healthy and supportive learning culture by taking those failures and building on those every production system I worked on experienced issues, sometimes due to code changes, other times due to third party providers having their entire infrastructure crashing, leading us to seek ways to improve constantly on how we do things, how we protect the system, and at the end of it, how do we provide a better service to our customers. When I joined the company, we set a destination of wishing to see rapid deployment and rapid changes, being able to provide the most adequate and up to date solution. In our case, it's a world of moving target defense where the scope of features and changes all the time due to the threat actors. So being able to deploy changes quickly is a major factor in our ability to provide a competitive product. In fact, oftentimes a good DevOps culture can be the differentiator and act as a competitive edge for your company. We wanted to have zero downtime and have errors or mistakes happen only once. We will have a chance to learn from them, but never twice. However, the starting point that we got off to wasn't that bright. We saw a few repeating issues, for example, minor things causing failures in production that shouldn't have been happening, things due to code changes or configuration changes, being tools prone to incidents due to the underlying cloud environment we were using affecting our stability. Now, those two factors were very concerning at the time when we looked at where we are and how fast we're expecting to grow and scale. Since what is a minor issue today can potentially be catastrophic, since the future will come in very fast when everything is growing very rapidly. Now, while those two were concerning the real issue, the last item was what was preventing us to improve the most, and that was fear of judgment. Every time we dove into trying to understand an issue that we had, there were pushbacks. Why do we ask so many questions? Do we not trust people? Why don't we just let them do their job? They know what they're doing and that's a problem. If you have a team members afraid or feeling that they're being judged or just generally insecure in their work environment, they are going to underperform. And as a team, you will not be able to learn and adapt. With that starting point and the destination we had in mind, we set off to establish a new process within the team of how we analyze every kind of failure and what do we do during the analysis, how to conduct debrief that we're talking about and the follow up, which is just as important. Now, why am I focusing so much on the process? And the reason is a bad system will beat a good person every time, just like this great quote says. And assuming you have the right foundation of engineers on your team, if you fix the process, good things will follow. So I'll start with an example of an issue that we has. We had a customer complaining that something was misbehaving in their environment and they thought that it might be related to us. So they called support. Support was trying to analyze and understand and after a while they realized they don't know what to do with it. So they paged the engineering team as they should have done. Now, the engineering team woke up in the middle of the night. Since they are in a different time zone, they work to analyze what's going on. They found the problem. They fixed the problem with a bit of resentment of having to do it in the middle of the night and then going back to sleep, moving on to other tasks the next day. Now, if this is the end of it, on the next day, we are certain to experience similar issues again and again from other similar root causes in the future. So we did a debrief and I'll explain in a bit how we did it and what exactly we asked, but apparently code was deployed into production by mistake. How can that be? An engineer merged their code into the main branch. It failed some tests and it was late, so he decided to leave it as is and keep working on it tomorrow, knowing it will not happen, knowing it won't be deployed. What he didn't know was there was a process added earlier that week by a DevOps engineer that automatically deployed that microservice when there was a need to auto scale. And that night we has a significant increase in usage, spinning up more services automatically with the new features of the auto scale. And that came with the buggy code. Therefore, we can ask, why is this happening? What can we do better? How can we avoid seeing such an issue and any potential similar issues in the future? And the way to do that is to set time to analyze after the fact and to look for the actual root cause that initiated this chain of events, lessons can be taken and learned from back to our example. We can focus a lot on why the buggy code was merged into production or why the auto scale was added and different engineers didn't know everything that has changing. But if we focus too much on why somebody did something, we will miss the bigger picture, which is wait a minute, the entire deployment process was flawed. How can it be that merging code into production doesn't ring a bell that it's going to be deployed? And when I'm saying merging into production, I mean merging into the main branch. Every branch or every repository names should have some kind of meaning and that was the root cause in this case. So if you fix the process, and again, going back to our example, aligning all engineers, that main branch always equal production regardless of it's being automatically deployed or not. That will change the way they approach merging that code and how it's tested, and not leaving buggy code in the main branch when it happens. So if you fix the process and we don't overjudge why a specific employee did or didn't do something since they were just trying to do their job, we will have an improvement later on. So how do you learn from the incident? There are four steps. It starts with the incident and the more features you become has an organization. The team will create an incident from supposedly minor things just to follow up and learn from them. And you provide the immediate resolution to the issue and that obviously will change on every issue. Then comes the process of 24 to 72 hours afterwards, really depending on how much time they had to sleep or if it has a weekend in the middle. You're going to do the debrief and we're going to talk about what happened exactly within that meeting in the next slide. And then two weeks after the meeting, after the debriefs is being conducted, we do a checkpoint to review the action items that came from the debriefs, make sure things were incorporated, and this is especially important for the immediate tasks that came out of it, and a follow up to see that they were implemented straight on. Now, what is so special about conducting a debriefs? We need to remember it's not just a standard retrospect, since it usually follows an incident that may have been very severe or had a lot of impact and tensions are high, people are worried or concerned, maybe even judgy. So we need to take that in mind when we're going into the meeting. When do we do the debrief? Every time we think the process or the system did not perform well enough. So any incident is a good case to pick on and try to learn from. We ask a lot of questions. The questions I like to ask is, first of all, what happened? Let's have a detailed timeline of events from the moment it really started, and not just the moment somebody complained or the customer raised the alarm, but from the moment the issue started. Going back to our example, it's when somebody merged the code. That's when the story really started to unfold on that day. Or maybe the third party provider failed or something changed in the system. But try to go back to the most relevant starting point, not the Terry history, obviously. What is the impact? Also, a very important factor in creating a learning environment is trying to understand the full impact. It helps convey a message to the entire engineering team. Understanding the impact to cost or customers affected, or complaints, getting a full scope as big as you can. It's vital to help everyone understand why we're delving into the problem and why it is so important. And it's not just that they woke up or we disturbed their regular work schedule. It really brings everyone together on why is it important to find the root cause and fix it. Now, after we have the story and the facts, we want to start to analyze and brainstorm how to handle that better in the future. Now, the first two questions here I like to ask is, have we identified the issue in under a certain amount of time? Let's say five minutes. Why five minutes? It's not just an arbitrary number. We want to have a specific goal on how fast we do things. Having that in mind helps understand if we achieved it or not. So, did we identify the issue in under five minutes? Sometimes we did, sometimes we didn't. Did we fix the problem in under an hour? Completely fixed it. Maybe we fixed it under ten minutes. Do we need to do anything at all? Maybe it was completely resolved automatically and there was no point of us on us trying to analyze anything. Once we answer no to any of these questions, the follow up should be okay, we understand the full picture. What do we need to do? What do we need to change? What do we need to develop so we'll be eligible to answer yes to those two questions of identifying and fixing. Now, this part, while seemingly simple, led to a drastic culture change over time. It set a framework in the way it helps convey everyone on what is the most important thing. What should we focus on? That was the process and the system. It's not just about anyone specific. Whoever caused the incident is irrelevant. Tomorrow it can be somebody else. So it's the repetition and the specific question, the way they are constructed, that really helps drive a process of change and learning. Now, any culture change takes time. It's never simple. We has some things we had to resolve along the way, and I already mentioned a few of those throughout this talk, but I want to jump into a few more. First is lack of trust. You have a new manager on board. He's coming in, trying to instill new process, trying to change the culture. It takes time. A lack of trust could be in the process, could be in the questions people could ask themselves. Maybe. Is there a hidden agenda behind that? Perhaps. Maybe it's just another way to do the blame game, but in a different format. Now, this can be completely avoided and resolved if we do it properly and consistently. Another thing that can happen when you're trying to understand a problem is people saying he or she did something and they are at fault. Going into the blame game. When you see that happens, you need to stop that direction of discussion and put everybody back on track of what you are trying to achieve. And what were the specific questions we are asking? It's not why they did something, it is what happened and what do we need to change? Not following up on action items, that's something that can really be annoying. We do the process, we do the review. We set up action items to fix everything to get better, and then you have the same problem all over again a few weeks or months later. How did it happen? Basically, we set the action items, but nobody followed up on them. So the resolution there might be very simple. We did the checkpoint, we did the debrief, and then we set a checkpoint every two or three weeks or whatever time frame is relevant for you to make sure those items get handled. Also, what we specifically did was label each Jira ticket with debrief so we can do also a monthly review of all the debrief items to see what was left open, maybe relevant. Maybe something can be moved, pushed further in the backlog. But it gave us the ability to track if things were actually being followed up on or not. Another critical move we made was implementing proper communication on a wide scale. You need to make sure everybody knows that there was a debrief. We were publishing all our debriefs internally, very widely inside the company, exposed to everyone with the details of what happened and what are we going to do to make it better. This helps bridge the gap of trust and show that everything is very transparent and visible trying to fix problems. We also saw that if we're not asking the right questions during the process. We might be narrowing tools much on the problem and we should look broader at how things could be connected and relevant in a bigger picture. So not just focusing on smaller things. Again, using back to the example that I gave, not focusing on the repository that was changed, but asking do we have other repositories? Is this a global issue of any repository that we manage? Now there are four main things I would like you to take from this session of how to conduct a debrief besides the questions and the repetition. And it's these four. The first is avoid blame. Avoid it altogether, and if you see blame starting to happen within the meeting, you have to interfere and stop it politely. Always be calm, but you need to stop it to make sure it's on track and meeting. Go the way you want it. Because if there is a vision of how it's going to happen later on when you're instilling the change, it will happen on its own without even being needed to be involved in the process. And those meetings after a while go easy on the why questions. There is the model of the five whys. It's a risky one because when you're trying to understand why somebody did something, the more you dive into it. And if you ask too many times why, you might create a sense of resentment or self doubt for the employees. It can sound like you're being critical and judging them on how they behaved and why they were doing certain things. Be consistent, crucial. The key in creating a process and proving you really do mean what you say is consistency on the process, the questions and everything. And like I said, keep calm. It's always important, especially when you're looking at things that failed to stay calm. Show there is a path forward. It always helps in creating a better change environment. Now these are some of the most notable learnings we had from those processes over the years. I'll touch briefly on a few of those, but they really are the highlight of action items that we'll learn from along the way. Also, you can scan the template of our debriefs format in this URL if you want to see it, and I'll go over the examples briefly. First one obviously humans make mistakes. You need to fix the process and not trying to fix people that won't work because they work hard, they're smart, but everybody makes mistakes, so assume mistakes can happen. Fix the process of protecting from potential mistakes. Gradual rollout, while often appears to be some holy grail of microservices and large scale production systems it's a great tool, but it's not a silver bullet and it will not resolve everything. So saying it will miss a lot of potential problems that even when you have gradual rollout, you will still miss them. So don't try to fix everything with this hammer. Establish crisis mode processes very important. There are levels of issues and when the shits hit the fan, you want to know who to call, who to wake up and how things should operate. Having those crisis mode processes ahead of time really helps everyone understand what they should do at the moment. Feature flagging really helpful in terms of handling errors quickly. One of the things that you can do is disable a certain feature instead of rolling out or rolling back an entire system, especially when you have thousands of dockers. Feature flagging and changing config on all of them can be much faster than redeploying thousands of servers. Now we always try to avoid replacing something that we need to change since a lot of components are loosely coupled. So every time we want to change something, we are looking into is it really a change or is it a dual change? If you're adding and subtracting at the same time, you're actually doing two changes at once. So try to limit that. Add something, see that everything works, subtract later and see that everything still works. Splitting those can really help focus and narrow down potential issues. Looking for the ten x breaking point also very important. Looking ahead, trying to understand when can something fail and what will be the breaking point of that specific part of the system. And last but not least, treating config as code. Configuration metadata, everything that can affect the system if you treat it as code, add the pr, review the deployment processes, the rollback processes around it really helps stabilize everything around the system. Thank you for listening. It was a pleasure talking about this and and I hope I gave you something to work with and a new process that you can use for your production environment. Feel free to ping me on any of these mediums and ask questions about the subject.
...

Amir Shaked

Senior VP R&D @ HUMAN

Amir Shaked's LinkedIn account Amir Shaked's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways