Conf42 Chaos Engineering 2021 - Online

Creating a learning culture

Video size:

Abstract

Building and marinating a five 9s system isn’t just about the tools and technologies. Development culture has a big part in how you keep a system available while scaling it up and supporting more features, users, and locations.

A healthy learning culture, supporting the development, not repairing mistakes, and identifying weak points is another tool in the engineering toolbox. In this talk, we will discuss how to create a learning culture using debriefs, what to avoid, and how to instill change in an engineering organization.

Summary

  • Amir Shaked: Perimeterx is a software as a service company providing solutions to protect modern web apps at scale behind the scenes. He talks about creating a healthy and supportive learning culture by taking failures and building on those. A good DevOps culture can be the differentiator and act as a competitive edge for your company.
  • When do you debrief? Every time you think the process and or the system did not perform well enough. It helps to convey a message to your entire engineering team. Lack of trust could be in the process. This can be completely resolved if you do properly and consistently.
  • There are four main things I would want you to take from this session on how to conduct a debriefs. The first would be avoid blame altogether. If you see blame starting to happen within a meeting, within VRIF meetings, you need to interfere and stop it politely. Always be calm.
  • We need to fix the process, not trying to fix people that will not work. Establishing crisis mode process also a very important one feature flagging. Treating config has code was also a crucial element in how how we do things. Feel free to ping me on any of these mediums and ask questions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, thank you for joining me today for a session on how we created a learning culture here at Perimeterx. And when I say a learning culture, I mean instilling a healthy culture of debriefs and positive discussions that ultimately minimize issues and backfires. Internally, I'm excited to share with you our journey here at conference U. My name is Amir Shaked. I lead the research in engineering team here at Perimeterx and I'm an experienced breaker of production environments. We are a software as a service company providing solutions to protect modern web apps at scale behind the scenes. We have a cloud based microservices environment on a large scale, around 15,000 cores, 300 microservices. And like any other production environment, we too see features all the time. Today we're going to cover our journey through the process of change towards creating a healthy and supportive learning culture by taking those failures and building on those. This is the essence of chaos engineering, and while a lot can be said on the technical aspects of randomly breaking things to find gaps, reality can always surprise you. Every production environment I worked on experienced issues, sometimes due to code changes, other times due to third party providers having their entire infra crashing, leading us to seek ways to learn and improve constantly on how we do things, how we protect the system, and at the end of it, how do we provide the best service to our customers. When I joined the company, I set a destination of wishing to see rapid deployments, being able to provide the most adequate and up to date solution. In our case, in a world of moving target defense where the scope of features changes all the time due to threat actors, being able to deploy changes quickly is a major factor in our ability to provide a competitive product. In fact, oftentimes a good DevOps culture can be the differentiator and act as a competitive edge for your company. We wanted to have zero downtime and have errors or mistakes happen only once, so we'll have a chance to learn, but not twice, meaning we didn't learn the first time. However, the starting points wasn't that bright. We saw a few repeating issues, minor things using failures in production due to code changes or configuration changes, being too prone to incidents in the underlying cloud environments we were using affecting our stability. Those two factors were very concerning when we looked at how we are going to grow and scale, looking at ten, x and 100 ticks ahead. But maybe a minor risk today will likely be catastrophic in the future and that future can be next week if you're in a fast growing company. While those were concerning the last one really prevented us from improving, and that's the fear of judgment. Whenever we dove into trying to understand issues we has, there were pushbacks. Why do we ask so many questions? Do we not trust our people? Why don't we just let them just do their job? They know what they're doing, and that's a problem. If you have team members afraid or feeling they're being judged or generally insecure in their work environment, they're going to underperform. And as a team, you will not be able to learn and adapt. In essence, this is the whole point of this exercise. So with that starting point and the destination in mind, we set off to establish a new process within the team of how we analyze every kind of failure. What do we do during the analysis? How to conduct a debriefs and the follow up. Why do we focus on the process? Because a bad system will beat a good person every time. And assuming you have the right foundation of engineers, if you fix the process, good things will happen. So let's start with an example, which I'm sure any of us who own production environments experienced, either the same or similar, or can relate in some way. And I'll start with a use case and how it relates to the process. So you have an incident. A customer is complaining about something misbehaving in their environment, and they think it might be related to you, and they're calling support. Support is trying to analyze and understand, and after a while realizing they don't know what to do with it, they will page the engineering team, as they should. The engineering team wakes up because it's the middle of the night and they're in another time zone. They work to analyze what's going on. They find a problem, they fix it. They might resent the fact that they had to fix it in the middle of the night, obviously, and they go back to sleep and move on to other tasks the next day. If that's the end of it, you are certain to experience similar issues again from similar root causes. So you should ask yourselves, why is it happening? What can we do better? What can we do? To avoid seeing this issue in any potential similar case in the future? You have to set time to analyze after the fact. This is the only way to make sure root causes are found and process problems are improved. Lessons are taken and can be learned from a case in hand. We actually had a code was deployed into production by mistake. How can it happen? Well, we had an engineer merging the code into the main branch. The code failed for some of the tests. But it was late at night and they decided to leave as it is and keep working tomorrow, knowing that code will not be deployed from main to production. What he didn't know has there was a process added by the DevOps engineer earlier that week that automatically deployed that specific code to production when there was a need to auto scale that specific microservice. And that night we had a specific usage increase, spinning up more services with the buggy code. Now here lies the issue. We can focus a lot on why the buggy code was merged into main, why the auto scale has added. If you focus too much on why certain someone did something or didn't do, don't they understand what they're going to. We're going to miss the entire issue of wait a minute, the process is flawed. How could an engineer actually merge code into production not understanding that it's going to be deployed? There is a meaning behind specific repositories or specific ways you manage the code, branches, naming conventions and all of that in our case. So if you fix the process, the problem aligns. In this specific case, aligning all engineers to understand that the main branch equal deployment to production, no matter which service. That is the way to approaching merging branches to the main branch will change drastically. Fixing this with the process not overjudging what a specific employee did or didn't do when they were just trying to do their job will prevent this from happening again. So how do we learn from such an incident? Well, there are four steps to the process. We start with the incident, obviously, and the more mature you become as an organization and learning culture, the team will create an incidence from supposedly minor things just for the follow up and learning from them, which is a really healthy stage to be at. You provide an immediate resolution to the issue and then 24 to 72 hours afterwards, really depending on how much time they had to sleep and work hours. You're going to do a debriefs and we're going to talk a bit more about that meeting and how to conduct it in the next slide. Two weeks after the meeting, we do a checkpoint to review the action items that came from the debrief and make sure things are incorporated, especially the immediate tasks. So let's talk about conducting a debrief now. This isn't the standard retrospect, as it's usually following an incident that may have been even very severe in impact. When do you debriefs? Every time you think the process and or the system did not perform well enough. I ask a lot of questions. The questions I ask are first of all, what happened? Let's have a detailed timeline of events from the moment it really started. Not the moment somebody complained, not when the customer or somebody else raised the alarm, but from the moment the issue really started to roll into production, when the code was merged, when we changed the query, when the third party provider we were using started to crash and updated their own status page. What's the impact? Also a very important factor in creating a learning environment. It helps to convey a message to your entire engineering team by understanding what is the actual impact, be it cost wise, customers affected complaints, get a full scope as you can. That is vital to help everyone understand why we're delving into the problem and why it is so important. It's not just that they wake up in the middle of the night being paged and it's bothering them. You have to understand the full picture of where everything is related and connected. Now, after you have the story and the facts, you start to analyze and brainstorm how to handle this better in the future. First, two questions I use as leading into the discussion have we identified the issue in under a certain amount of time? Let's say 5 minutes. Why 5 minutes? Well, it's not arbitrary. We want to have a specific goal on how fast we do things. So did we identifying the issue in under 5 minutes? Sometimes we did, sometimes we didn't. Did we fix the problem in under an hour? Completely fixed. Did we do it under 10 minutes? Do we need to do anything at all? Was it completely resolved automatically? And there was no point of us trying to analyze anything once we have the answer. Once you answer no to any of these, the follow up should be okay, we understand the full picture. What do we need to do? What do we need to change? What do we need to develop so we will be able to answer yes to the following two questions. This part has seemingly simple led to a drastic culture change over time. Creating the framework in that way helps convey to everyone the focus is on the process and the system. It's not about anyone specific. Whoever caused the incident today is irrelevant tomorrow it can be a different employee entirely. Now, any culture change mistakes time. We had some things we had to resolve along the way, and I already mentioned a few of those solutions on how we did it. First of all was lack of trust. Especially if you have a new manager coming in. Trying to instill a new process, trying to change the culture, takes time. Lack of trust could be in the process, could be in questions people could ask themselves. Is there a need, an agenda perhaps behind it. How would it not become the blame game we had before? This can be completely resolved if you do properly and consistently. What often happens when you're trying to understand why a problem occurred. People might say he or she did something at fault and the reissue is something else. Also important to notice, not following up on action items, something that's really annoying. You do the process, you review everything. You set action items, and then you have the same problem all over again a few weeks or a few months later. How did it happen? So you see that the action items that were set weren't being followed up on. The resolution we had was very simple. We established the checkpoints. You have the debriefs. You set checkpoints every two, three weeks, whatever time frame is relevant for you to make sure that the immediate action items are handled personally. What I also do is I label each Jira ticket with a debrief and do a monthly review of all the debrief items to see what is left open, be it irrelevant or something that had to be moved upwards. And another critical move we've made to resolve future issues is implementing a proper communication on wide scale. Make sure everybody knows there was a debrief where we are publishing our debriefs very widely within the company, exposed to all the employees with the details of what happened, what we're going to do to make it better. This helps bridge the gap of trust, if you might have it, and show that everything is very transparent and visible. We saw that if you're not asking the right questions, the focus might be the problem. And giving a wide audience can help give another view with more audience to identify gaps that might have been missed within the bigger picture. Now there are four main things I would want you to take from this session on how to conduct a debriefs listed here. The first would be avoid blame. Avoid it altogether. And if you see blame starting to happen within a meeting, within VRIF meetings, you need to interfere and stop it politely. Always be calm, but you need to stop it to make sure it's on track and the media goes the way you want it to go. Because if there is a vision of how it's going to happen later on when you're instilling the change, it will happen on its own without needing to be involved in the process. Go easy on the why questions. It's important to understand why somebody did something, but the more you dive into it, if you ask somebody why they did something, it could create resentment or self doubt for employees. It can sound to them like you're being critical and judging on how they behaved and why they're doing certain things. Be consistent, like I said before, and keeping calm. It's always important, especially when you're looking into things that failed to stay calm and show there is a path forward, especially helping to creating a better change environment. Now, some of our most notable learnings from all of these processes I've listed here, and I'll touch briefly on a few of those. Also, you can see in the format of the debrief meeting here in the QR code you can follow up on. So first of all, humans make mistakes. We need to fix the process, not trying to fix people that will not work because they work hard, they're smart, but everybody makes mistakes. Another thing I've often heard was gradual rollout. Now it often appears to be some holy grail, perhaps of microservices and large scale production systems. It's a great tool, but it's not a silver bullet. And it will not resolve everything. And saying it will will miss a lot of problems that you need to resolve in a different process or tools. Establishing crisis mode process also a very important one feature flagging, especially if I'm connecting it to the second point, in terms of handling errors quickly, it was really important for us. It was one of the things that we were able to do to disable certain features instead of rolling back a lot of services, sometimes thousands of dockers, and that helps reverting the code much faster and understanding if that's the cause of the problem or not. Maybe something else in the infrastructure is the issue and not even the code. We always try to avoid replacing something that we need to change in the system that we have at in place changes. When you have a lot of mostly coupled microservices, there is a lot of communication with underro and changing something in place can cause a lot of harm. So we try to split it into adding a new behavior, verifying it behaves has we expected, and then subtracting the old behavior and essentially splitting the process into two. Now, always trying to look ten x ahead on breaking points of the system wherever things happen and can break. And treating config has code was also a very crucial element in how how we do things. So that's it. Thank you for listening. I hope I gave you something new to use. Feel free to ping me on any of these mediums and ask questions, and I'd love to discuss with you more.
...

Amir Shaked

Senior VP R&D @ PerimeterX

Amir Shaked's LinkedIn account Amir Shaked's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways