Chaos Engineering: Getting out of the Starting Blocks

Video size:

Abstract

Architectures are growing increasingly distributed and hard to understand. As a result, software systems have become extremely difficult to debug and test, which increases the risk of failure. With these new challenges, chaos engineering ha become attractive to many organizations as a mechanism for underling the behavior of systems under expected circumstances.

Whilst interest is growing, few have managed to build sustainable chaos engineering practices. In this talk, I will review the state of chaos engineering, the issues customers are facing, based on my learning as an AWS Solution Architect and Technologist focusing on Chaos Engineering and explain why I started to build tools to help with failure injection.

Summary

There are patterns that you can see in what's blocking people from actually adopting some of those cool practices. One of them is chaos engineering. This talk is really a collection of kind of tips and tricks to help you get started in your journey.
The first problem with chaos engineering is in the title itself. Don't call it chaos engineering. And it's been expected today is actually call it engineering. Good intentions don't work. We need to care more about our customers.
An undone cord was invented by Toyota in the early 19 two. It lets anyone on the factory floor, if they see a defect, can pull the cord. We took that principle into customers service. If good intentions don't work, work what does?
A lot of it has nothing to do with human. It's about processes, detecting things and the same third party providers. Technology is a lot more complicated than just tools and processes. Have a way to really measure and to look at the past.
When you adopt chaos engineering practices, you're never ever going to go from zero to 100 in your company. Never choose the best team. You have to find the right Troy and horse, which is going to be spreading the good virus inside the company. And the essence of chaos engineering is confidence.
The fifth is do introduce chaos engineering. It changes the way developers think because it triggers something in the mind that actually there's more than just working. When you hire people, teach them how to break stuff. This is probably going to be the wow moment in your company.
People assume work and that goes with managed services. On AWS you have a region, you have three aws to have your application to be fault tolerant. People assume an AZ just never has problem. But it does. Even though they're isolated, sometimes they do.
Chaos engineering is a set of an intersection between culture, tools and processes. The only thing I can tell is try to make these tools uniform for the entire company. Focus on few and these are the ones that are going to give you most gains.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good afternoon, everyone. It's nice to be here. I know it's, it's been a long day. There's been a lot of good information, a lot of talking, a lot of testing, a lot of things. Why I'm here is a good question that you haven't asked yet, but I'm going to answer because is why this talk is here. I've been working with AWS technology for about twelve years. I joined AWS four years ago and I worked a lot with AWS, a solutions architect, actually the first year with some of the biggest gaming companies in the nordic, and we were working a lot on scalability and some of their resiliency issues and trying to find solutions and also to work trying to get some of those chaos engineering teams actually get started. And I realized because there are patterns that you can see in what's blocking people from actually adopting some of those cool practices, and one of them is chaos engineering. So this talk is really a collection of kind of tips and tricks to help you get started in your journey. If you're already started, it doesn't mean these talk is going to be totally useful. Hopefully. Actually it might enable you even further so you can accelerate the adoption of your journey. So how many of you have active chaos engineering practices currently? Right, and we're talking chaos engineering, right, not just chaos practices, because that's pretty much good. And how many of you have a hard time to actually accelerate the adoption within the company? All right, good. So there's three people here. This talk is for you. All right, so in November, actually, before reinvent, I did ask a question on Twitter, because I was having a lot of conversation around chaos engineering, and I was really trying to confirm some of those private conversation I was having with the rest of the world and trying to see what's happening. And the question is, what's preventing you from having chaos engineering basically widely adopted in your organization? And quite interesting, the first answer was enough chaos in prod, right? Which I think everyone can agree, can agree with that, despite this joke of is it chaos engineering or chaos is so funny is because we are all having really chaos in production, or we tend to call production chaos because it's never really a stable system. A follow up question actually was why do you have chaos in production? So what do you think is these biggest reason for having chaos any? I guess that might be a factor, but quite interesting is mostly fighting fire. So people actually are pretty much reacting to outages and trying to fix things rather than improving things. And chaos engineering is often seen as an extra burden to implement on top of what we have. And I think this is probably one of the first reason why it's actually very hard to implement. Chaos engineering is because people call it something extra to do on what you already do, which is not necessarily bad, but it's actually slowing down your adoption. So this is basically what people think about chaos engineering. Actually, the first problem with chaos engineering is in the title itself. Don't call it chaos engineering. And it's been expected today is actually call it engineering. And I'll give you a simple reason for that. Every single customer that has trouble with chaos engineering and actually having widely adopted inside the company is because the leadership team or the sea level team see chaos engineering as chaos. If you just tell the word chaos to your leadership team, never get on trust. So it's a bit similar to one of the talk we had today around shadowingly fixing your company. Chaos engineering is kind of works in the shadow. You're doing engineering. It's actually engineering. So don't call it chaos engineering to start with. And this is not really a tip, it's really a requirement. So that's why it's called on the number zero. These second really important thing, in my opinion, is change the focus of chaos engineering and don't focus on the discipline itself. Actually, you have to have a wider view to start with on what you're trying to do and how Chaos engineering works. I'll give you an example. This is one of my favorite story, and this is an internal story of Amazon and which really, really reflects that statement, to have a wider view. The thing is, good intentions don't work. An intention is people want to do good. So how many of you go to work every morning trying to say, today I'm going to do shitty work? How many? Right, so you all want to do good. Actually, as people, we often, and as engineers, we often want to do good. So good intentions don't work. But let me show you why it doesn't as well. It's because people already want good. There's a story behind all that. And at Amazon, there's something which we call the customer service training. It's actually every amazonian that joins the company has the capability, after maybe three or four months, to go and shadow customer service reps and see what customers are really dealing with, what the problem they have is kind of a way not to lose touch with our customers and get immediate feedback. So Jeff Bessels actually did that and does that regularly. But in one of those occurrence, he was sitting next to a rep, very experienced rep, and she pulled one of the tickets. She looks at it very quickly, turns it to Jeff Bezos and say, I'm sure she's going to return that table. Jeff was like, what? You're like a magician and stuff. I just know. Okay, so they go through the call, these order, and kid you not actually the person wanted to return the table. It actually happened that the table had scratches and she wanted to return. So it goes, Jeff Bezos asked the rep after the call is like, how did you know that she wanted to return the table? And she said, well, we just had thousands of those in the last two weeks. And he looks at her like, what? How come we can have thousands of similar cases and not act on it, right? It doesn't make sense. So he did what every managed does. You go to the team, to the leadership, you say, become on guys, we need to do better. We need to care more about our customers. We need to try harder. Well, guess what happened? Well, nothing happened actually. It just didn't work because good intentions don't work because she was and every other reps was already trying to do the best because they go to work trying to do the best. So if good intentions don't work, work what does? Before answering the question, I want to tell you about a little story, and that's these story of Toyota. Do you know what an undone cord is? Right? An undone cord was invented by Toyota in the early 19 two, those are silk, silk weaver loom. So basically, Toyota before making chaos, was making clothes for geishas. And geishas have high quality standards. And if there's any defect in the production line on silk, it has to be stopped immediately so that geishas get the top quality clothing. So what they did is they invented a little button on the side. You see the little cord on the left that anybody in the production could pull if they could see a problem on the silk and everyone could pull that. And they took this practice actually a little further on manufacturing lines. And this is actually a manufacturing line for Toyota. And you see those cords up. These are cords that you see all along the manufacturing line. And anyone on the factory floor, if they see a defect, can pull the cord. Now, if you do this in Europe, you get your leadership, come yell at you, you are stopping your production line. It's better be good because we are losing millions and millions of dollars. Well, Toyota has a different way of doing that. When anyone pulls the cord, the leadership comes and say, thank you very much for pulling the cord. That means you care about our customers. So that means they empower anyone on the production line to care even more. It's a cultural thing. So that's called the undone cord. And we took that principle and tried to employ that into customers service because of course our rep had seen that thousand times. So she could have pulled the plug and see, oh, that's the ten times I'm doing this item. Let's pull that plug and take this item out of the catalog. And so we did. So for some unit, this is actually one of the unit that was put on the catalog that I'd mistake and contacts fell from 30%, 33% to 3% within a couple of days. Right. And these are kind of interesting ideas. So if good intentions don't work, work what does? That's another ondon corn mechanism that worked actually on prime Video, which is quite funny. You can receive sometimes emails and it'll tell you we are refunding you because we noticed that the quality of the movie was not the best you could. And these are all automatic emails, right? So there's no one looking at the logs and say, oh, the quality of the delivery of that person was bad. Now those are system that analyze log in real time and say, hey, if you are experiencing issue, we're refunding the movie. All those are mechanism to actually improve customer service. And when you think about chaos engineering, it's actually very similar. All your company, all your developers, they want to do good, right? So if you have an engineering practices that is stalling, the first thing you need to think of is not necessarily people and not asking them to try harder at testing or do better. It's because you're missing a mechanism. And this is very, very important because it's changing the focal point of where the problem is from people which already want to do good to a lack of mechanism. And if you think about mechanism, there are three things in a mechanism. The tool. Obviously you need the tool to implement a mechanism, but then you need to have an adoption framework. How are people going to use your tool? And actually how do you enforce people to use your tool? And these best, how do you do not enforce it, but how do you make it that it's subconscious like they can't do without, or it's automatic. And then you need to audit, of course, because you create a tool, you have an adoption, and then you need to audit that. So when you think about chaos engineering, don't only think about the tools, because very often when you think about that, it means you're looking at the wrong problem. And we'll talk about mechanism a little bit later. Another tip is, well, change starts with beginning with understanding. And that's both personal. If you want to change as a human being, you need to understand what you're doing, and that's the same for your system. And we talked about monitoring a lot. But I could ask you that, what are the top, most, top five painful experience outages that you've had in the last two years? Are you actually able to give me data that backs up your claim or do you have a gut feeling? Well, you'll be surprised. Pretty much nine companies out of ten, it's going to be gut feeling. And very often it's time bias. So is it going to remember the last two or three outage that were really painful? But if you look at a long time, you can see very, very different set of things. So very often people will tell you what they think it is, but it's not really what it is. So have a way to really measure and to look at the past. Outage a tool that analyzes your coes, for example, or we call it Coe. Your postmortems, we call it Coe corrections of error. And I'll talk a little bit. But your post mortem, it's good to write it, but do some analysis on it. Try to find pattern. And a pattern is not. Bob deleted a database in production or Adrian deleted database in production. I did it twice. For real. Never got fired. You know why I never got fired? Because I was not blamed. Because I was set free to be able to do it, which there was no rail guards. It's like I could run any command without having confirm command line that would confirm that. You really want a database to delete the database in production? Yes. It was not like this. And my terminal was exactly the same for test production environment. There was no different colors. There was no such a thing. So it's not my fault actually, because even though I deleted the database, I was being pulled every 3 seconds to answer question. I was not let alone to concentrate. There was fires everywhere. So yeah, the consequences is my stress level. My attention made me delete these database, but I should not have been able to do that. So some of those painful things. I mean, how many of you ever had an outage of SSL certificates? I love that one. This has nothing to do with coding skills, right. It's about process. It's a mechanism actually to enforce that. Your SSL certificates are either rotated or the alarms on it. So it's again a lack of process, it's always DNS. Well, definitely is configuration drift, right? We all have infrastructure as code, right? But how many of you are allowed to log into a machine on SSH and do configuration challenges? Why do you use infrastructure as code? Why? Because if you do it, you're thinking immutable infrastructure is good, but then you let people mutate it. Right? So all this is all about good intentions. And you see where I'm going here, right? It's actually a lot of it has nothing to do with human. It's about processes, detecting things and the same third party providers. Very often you have a dependency that's going to fail. Well, again, it's about having a process that monitors that. Maybe it's a circuit breaker, maybe it's a higher level process. Whatever it is or it's having. Multiple providers don't rely on one. Imagine if Tesco was relying only on one provider of twix bars or sneakers, and that provider gets bought by its competitors. What does it mean? That for the next six months they can't provide that particular food? No, of course they have a diversity of providers. So this is the same for you. You have to have a mechanism to identify those problems. Every time we have an outage at Amazon or AWS or the business, we go through what is called the postmortem. We call this internally corrections of errors. It's basically, and I'm going to piss people off, it's five wise, but not five wise, because within our culture we don't stop at five wise. It's just the name we gave to the process because initially it was called firewalls. But basically it's an analysis of what happened, what was the impact on the customer, how many of your customers have been impacted. Basically it's the blast radius, really understanding the blast radius of what were them, what were, what were the contributing factors. And this is very, very important. And this is one of the problem of the five y's and why we have to stop calling it five y's. And I'll explain this at the end. Technology is a lot more complicated than just tools. It's actually culture processes and tools. When you do an analysis of root cause analysis, very often people stop at the tools or the processes, but actually you have to look at the entire set of pitches. So you have to ask a lot of whys in many different direction, in many different kind of universe, the cultural universe, the tool universe, the process universe. Right. If you find one contributing factor, it's not enough. You have to dig out, you have to ask more questions, more complicated questions. Also we have the discussion blameless, right? It's actually not blameless because you want to know who did the mistake. You want to understand it means it's consequence less. You're not going to fire someone because of mistake, but you want to understand what that person did this, how he was allowed to basically do that problem, if it's an operator problem. So never ever stop at an operator problem. It's a false sense of responsibility. It's not there. Right. The responsibility is often on the mechanism, on the culture, or the wrong set of tools. Never really these people. Then you have to have data to support that stuff, right? If you don't have data, you're navigating basically in the assume world, assuming is death. What were the lessons learned? Usually there are a lot of them. Again, it's the realm of having culture, tools and processes always think about that, because this is where technology lives. And what are the corrective actions? These corrective actions for us, we always assign a date. By default, it's two weeks, but you can have way faster dates or a bit longer, depending on the task at hand. And this is very important because it defines the auditing. So then we can actually have weekly reviews and those weekly reviews. Interestingly enough, these are with the upper leadership team. Actually, Andy Jassie regularly goes into those technology review. Andy Jassy is CEO of AWS. And now there are a lot of service teams, right? So we use a wheel of fortune where every team, if they get selected, they have to go on stage and present their metrics, their operations, what they have done to fix the particular problem that were outlined in the CoE or something like this. And it's really meant to share things. So all the service teams are actually around the table like this. The management is here and these. Everyone has to present. It needs random. The wheel of fortune is a weighted random wheel of fortune. So if you get called two or three times, eventually it's going to go to someone else because your weight is going to be higher. It really is a process also to spread knowledge across the different teams and identify maybe what could be the next experiment to run or continuous experiment, continuous verification to avoid having this problem. And on the adoption part, so we have the tools, we have the auditing and adoption part. We use something called the policy engine. And basically the policy engine is a tool to amplify social pressure. It's quite interesting tool. It collects all the data of an environment from the best practices that were understood. We codify those best practices into scripts that scrap environments within AWS in every team and then return a score for a particular service team or architecture or software practices. And then if it doesn't implement those best practice which are continuously monitored, then its scores go down and then those dashboards appear on the leadership team weekly meeting. Right. So then you have to justify how come you are not following your best practices. And this is not because people don't want to do it, but it's because we tend as people to push things forward and if we have pressure from our peers, we always want to get out. Aws a good kid, right? Whereas if you actually don't have those reviews and if you don't have social pressure, we tend to take a shortcut and just say, maybe later, maybe later. And I'm sure you've all been in that situation when someone's going to verify what you do and someone is actually peers at your level, it's something which pushes you to do a little bit better. Not that you don't want to do good, but you're still human, right? So these are some of the mechanism that we use. Another one that is another tip that is super important. When you adopt chaos engineering practices, you're never ever going to go from zero to 100 in your company. Actually, you have to find the right Troy and horse, and that Troy and horse is going to be your team, which is going to be spreading the good virus inside the company. Never choose the best team. You know what, because they're already doing great and when they do great, it's hard to justify the work that you're going to do with chaos engineering. So very often I see companies say, oh, this is the best team. They have the best practice list, start chaos engineering with them, and then there's very little noticeable differences because they're already doing great. And then the other team are not really inspired because there's nothing magical that happened. So that's one thing is people expect change to be magical or wow or something that is going to transcend their developer experience. If you take these worst team, these have way other problems than dealing with chaos engineering, like having infrastructure as code in place or shutting down port 22. So you can't log into your instance or things like these. So choose a team in the middle, which team is going to serve best your interest. And when you have that team, you have to find the right metric. And I'm often asked, okay, which metric is best? If I start with a team, if I have to choose one metric, it's going to be MTTR, because that's meantime to recovery. So that means how fast the team is going to be able to deal with an outage and recover from it. And that's only because of confidence. And this is, for me, the essence of chaos engineering is training the team to be confident. That's confidence in the application, but also in the tools and in their process to actually deal with outage. How many have had first outage in production? You've never had outages in production or leave, right? Okay. Now, how many of you had outages in production? Right? How many of you during those outage felt that they've lost all their capacity to, you think they started sweating, they started swearing, they basically became really stupid. But that's what outages do in production. Chaos engineering kind of limits that, right? You're still going to be scared because it's still an outage in production. There are still real customers and you care, actually you're scared because you care for the customer and you care for your work, which is great. So MTTR is a great metric, because the less you are scared these faster you get into action, the better you're thinking, then the faster you're able to recover. And chaos engineering is definitely high on that. So if there's one metric to follow at these beginning before you understand, or you get others, like avability or things like this, MTTR is always, always a good default. And you don't have to have 20 metrics, have one which is solid, and that one is actually quite solid because that impacts directly your availability, because when you're down, the availability is only improved by your meantime to recovery. And that's super important to understand. There's another thing which is quite interesting as well. So during the life cycle of chaos injuring you basically start from the steady state, you make an hypothesis, you run your experiment, you want to verify your experiment and then improve it. And that's continuous learning cycle that we talked about all day. When you start, you actually don't have to have anything else than hypothesis. I'll tell you how I run hypothesis meetings in my company or in companies that I work with. I put people in a room. And that's not just the engineer, it's actually the project manager, the CTO, the CEO, everyone that is somehow related to the project, to the stuff that we want to test or what the service, the product, everyone in the room, from designers to I love your helmet. It's not distracting at all. It's quite funny, it's good. So I'll put everyone in the room right. Not just the engineers. And what is interesting to do here is actually ask them to write on the paper what they think is going to be that result of the hypothesis. So for example, my hypothesis is what happens if my database goes down. So you write on the paper within the timeline, you don't talk to anyone else and you write what is there. And you know why it's super important to write on the paper? Because when people talk to each other, there's something called convergence of ideas that's related to diversity and to diversity of people. Right. There's strong mindset people that will push their ideas and introvert people will naturally like me. You'll be surprised. I'm really introvert. I need to step back and think a lot before I can say anything in a meeting. Who is extrovert? They'll start talking a lot. And if they are really convincing, you have convergence of ideas and everyone tends to think, oh yeah, that sounds about right. Yeah, that's what's going to happen. This is zero information for you. It's useless. What you want is write it down because then you have a divergence of ideas and you realize that everyone has different idea of what happens if something goes down. And that's 100% of the time. I yet to have run a meeting where I've done that. Everyone knew exactly what was happening. And you can just stop here. This is your beauty of getting started with chaos engineering, because you have to understand how on earth is it possible that everyone thinks differently. So it probably means that specifications were not complete, documentation was not right or they have changed. Right. And the changes haven't propagated. So if you have product that takes months to do, it might be some developers might got stuck at the specification from the beginning and haven't necessarily caught up with the new ones. Or then you are new these and you have assumptions, right? So overemphasis on the hypothesis, I kid you not. This is probably going to be the wow moment in your company. You're going to fix most of these issues, these, you don't really have to run anything experiments because this is already going to fix a lot of issue because it's going to trigger some questions which then you're going to investigate, which is actually quite beautiful. The fifth is do introduce chaos engineering. Very early in these process I see company having beautiful processes, focusing on CI CD pipelines and then thinking about chaos engineering. It's the same thing if you create an application and say, oh, maybe I should make it secure. So chaos engineering is actually job. It's not zero because security is, but it should be. Job one or two should be really high on the process. And that means when you hire people, teach them how to break stuff, actually create dev environment where you can let people say Docker stuff and see what happened. Are you playing with just running your system in your local environment and playing Docker stuff and see what happened? That will teach you a lot. Actually, the first thing I was doing when I was hiring teams in my previous company is the first week they had to run a small program with three APIs. It was a product API. So get post and delete and health check and then trying to make it as resilient as possible. And that was the only guideline. And they could use all the tricks within the dev environment. The only thing is it was the Docker environment. Things like these. It changes the way developers think because it triggers something in the mind that actually there's more than just working, there's like breaking. And it goes back to when you try to understand how a radio or television worked. You often have to break it to see how it works. Right? And this is the same. There's few comments. I love to get these developer to start with. Docker kill, Docker stop is just beautiful. Another one is did the authentic yourself? It's great to run tons of process. For example, on your health check API and see how it behavior. Or on your API and see does my health check answer? Because if you have an API in the health check API and these are hammered and your cp is 100%, which one do you favor? Do you want to answer the API or the health check? Well, you should actually think about it because it's actually one of the big problem in distributed systems is prioritizations of requests when systems are congested. Because if you don't answer the health check and you answer your API, well, the load balancer is going to take the machine out of the auto scaling group, for example, and then you have nothing left. So degradation is a good way. Burning cpus, for example. This is stressng. It was talked today about it's an evolution of the stress API, of the stress tool. TC is definitely a good one to add latency. So there's a bunch of tools where you can start playing locally in your environment. You don't have to do this in prod. This will teach you a whole lot of things, especially how to treat your system. Another very important thing is when things go down, have a reflex to look at the blast radius and understand it. That means that then when people will start building, they will start to have this mindset of thinking about the potential blast radius of this. That means architecture, that means API. So it's really trying to create a culture where everything is around blast radius reduction. Because if you do things with as minimal blast radius possible, it means less customers are going to be affected and chaos engineering is exactly the same. When you do your experiment, think what is the smallest blast radius possible that I can do to actually disprove or prove an experiment? We've seen this today. Never assume, right? And if you assume something, it's probably broken. If you haven't verified it's assume it's broken. That's basically the language. And there are a few things in the cloud that I see continuously. People assume work and that goes with managed services. Actually, I think that's a problem of managed services. Even though it's super important for you to innovate faster, they give you a sense of belief that everything is going to be okay because the burden on managing that service is on somewhere else. And when it's AWS, yeah, we manage it pretty well most of the time. But failure happen and they will happen. And s these did fail a couple of times since 2006. And the effect was dramatic is because people had never experienced an s three outage before. So they discovered new failure modes that they were not familiar with. And these are kind of some of those failure modes that I've seen the most, at least on AWS, because this is kind of the audience I talk to. Assuming, for example, multi AZ, you have a region, you have three AZ. How many of you use AWS? These just to get a sense, right. So on AWS you have a region, you have three aws to have your application to be fault tolerant. And people assume an AZ just never has problem. But it does. Even though they're isolated, sometimes they do. So test, you're resistant to one AZ failure. We had discussion today, how do we do this? Well, you can push your subnets to have zero traffic in the network. For example, you saw people here. I know usually people are like, what the hell are you doing with verifying people? So I'm not breaking necks of people. That's not what I do. Identify top people in the teams, then I take them out of the equation because people over rely on the ten x developer, right? And you'll be surprised. I do this experiment very, very often. The last one was last year, November, October. And I took the guy, it was great. I took his laptop, sent him home 6 hours later. We had to bring him back urgently. Actually, he was the only one who knew how to do a particular magic trick to get back the database up and running. Or he had the key, or no one else had. So don't stop at it. Experiment or computer experiment, think about people, because this will highlight weaknesses in your process and mechanism. So the problem here was just they didn't have a mechanism to ensure that everyone in the team had the same level of knowledge and they couldn't share it, because there was one guy that was just doing it himself and, well, even though it was great, he was never really telling what he was doing right, and it was all right for other people. And this is just to start all those things. Actually, I realized even though there is gremlin, and I use gremlin a lot, I use the chaos toolkit a lot, pumbaa, all those things. There were some missing stuff. So I went to write a bunch of open source stuff to help people do failure injection. This talk is not about failure injection, because failure injection, in my opinion, is just a tiny part of chaos engineering. But these are the tools that I wrote to be able to do the verification that I was talking about. So these AZ failure is in a chaos script. You can randomly kill database instances, elasticache cluster full AWS. If you're using a serverless and you do Python, I wrote an injection library error injection library to do in lambda can return different HTTP code and stuff like these. I can do demos later because I'm already a bit out of time. So if you're interested in any of that, I'll be here hanging out. So I can show you some demos, but you'll find everything on my GitHub. I just didn't want to push too much AWS stuff, because in the audiences, not everyone is AWS. It doesn't make sense to focus too much on that. So in summary, my biggest, let's say, suggestion, getting out of these starting blocks blocks with chaos engineering is actually take a step back and realize this still doesn't work. But realize actually really the chaos engineering, it really is a set of an intersection between culture, tools and processes, and it really sits in the middle. And if you have problem with the adoption of this, take a step back, try to analyze, maybe there's something in your culture that is missing. Chaos engineering needs very strong ownership, really strong deep dive characteristics, very strong high standards and bias for action or whatever. This is what we do, how we call our leadership principle, for example, inside Amazon. But that will define who you're going to hire, right? So maybe you're just not hiring the right people. Maybe your culture is not set right yet and maybe you can transform it to match a little bit more what chaos engineering in your company is. And I think there's no blueprint. I can't tell you exactly what sort of people you need to hire to run chaos engineering because it's going to depend on your company culture. Then you're going to have to define tools. And there's plenty of tools. There's new tools coming up every day. So which ones are using. It's really depending on your environment where you're doing, but there's never going to be one. It's going to be a set of one. The only thing I can tell is try to make these tools uniform for the entire company. Most of the time that I've seen failures as well is because people use so many different set of tools and then the adoption of those tools is very hard because there's so many of them. So choose the right one. What works for the particular verification that you want to set and then start with that. Then at the end, once you're really focusing the chaos engineering practice, then you can have more. But at least to start with, focus on few and these are the ones that are going to give you most gains. So understand your past outage, figure out the patterns. And then it's not the low hanging fruit, but it's the one that has tried to fix things that have the biggest blast radius and then get the tools right for that and the right culture. But don't forget processes because this is super, super important and we talked about those mechanism, right? So it's a complete process. So you have the tools, these adoption and the editing part of it, and that's pretty much it. Thank you very much. I write a lot on medium if you want to follow on Twitter, and I'm going to be here hanging out. Remember, I'm a bit introvert, so even though I speak on stage, I'm actually introvert. But I'm happy to talk with you. Thank you very much.

Slides

Download slides (PDF)

See all 11 talks at this event!

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Chaos Engineering: Getting out of the Starting Blocks

Video size:

Abstract

Summary

Transcript

Slides

Adrian Hornsby

Principal Technologist, Architecture @ Amazon Web Services (AWS)

Join the community!

Featured event

2026

2025

Info

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Chaos Engineering: Getting out of the Starting Blocks

Video size:

Abstract

Summary

Transcript

Slides

Adrian Hornsby

Principal Technologist, Architecture @ Amazon Web Services (AWS)

Join the community!