Journey To Resilience: Building the Ramp to accelerate The Adoption of Chaos Engineering Practices

Video size:

Abstract

Chaos engineering has come a long way from its early days at Netflix. Its importance is no longer questioned in the community but as it has gone mainstream, teams quickly learn that adoption is not a given.

In this talk, we talk about the challenges that we encountered at Walmart and the techniques used to break through them.

Summary

I'm Vilas. I am a director of engineering at Walmart. I focus on enabling developers to deploy their code in a resilient, performant way into the cloud of their choice. To execute something like this at Walmart scale was going to be a huge challenge.
Every outage was essentially a chaos exercise that was completely unintentional. We wanted to calculate a per incident cost. We want to plan better a culture of software resilience. The teams that were running this had to do some homework.
The homework that is needed is very generic. First is observability. For every critical dependency failure, there needs to be a detailed playbook or an automated way to have fallbacks instituted. Testing a playbook continuously is the answer.
The other thing I would say needs investment is a CI CD workflow, right? The diagram that I have on the screen essentially demonstrates what a real CI CD pipeline looks like. Investing in tools is important because I believe the ecosystem is still sort of getting up to speed on that killer product.
The last thing is carrots, not sticks, which teams, you treat them with rewards, not punishments. The exercises cannot be conducted without complete participation. Resiliency is crucial. Every team does not see it the same way.
The report card reads really good. The application teams are eager to run these tests. Management confidence is good because management invests in this. The goal is resiliency, but the way to get to that is chaos. Going forward, what we want to do is build on these models.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

I'm Vilas. I am a director of engineering at Walmart. I focus on enabling developers to deploy their code in a resilient, performant way into the cloud of their choice. That's what I focus on. What I'm going to talk about today is a little bit about what we saw in our journey towards becoming more resilient as a software company and what were the challenges we faced, what are the mistakes we made and all of that stuff. So before I go there, obviously everyone in the room knows what this is. I'm not going to repeat, everyone else is going to talk about this. Suffice to say, the way we looked at this was we realized the importance of this at Walmart, but we also realized that to execute something like this at Walmart scale was going to be a huge challenge. Right. What is Walmart scale? These are some numbers. We have more than 11,000 stores worldwide that get supply of software that actually does everything in the store. All of the management on the supply chaos side, all of the retail business is managed by software. The annual revenue of all of this combined is more than half a trillion, which explains to the amount of goods that are moved across the world for it. Right. That number, 270,000,000, is the number of folks who transact on our omni services, which means including the website, including the stores in a week. So that's the amount of foot traffic or transactions that we see in a week. So if you think about scales like that, any kind of disruption could cause a massive amount of damage. So that is something that we wanted to sort of think about when we even thought about doing this kind of journey at Walmart. So to begin, initially, we had to establish some truths, right? So we said, okay, the following things are true. One, reliability is no longer just a function of redundancy and overscaled hardware, right? So we are not going to just throw some servers at it and we'll obviously fix the problem. That's no longer the case. Specifically the way that we were thinking about it. We wanted to exist in a hybrid cloud environment. If we wanted to do that, we have to acknowledge that external cloud providers, no matter how much their guarantees could be, they are still a dependency, a variable. And if something bad happens, we still have to be ready and resolve essentially what the customers needs are. Customers expect, do not like the idea of scheduled downtimes, right. There was a time when that used to happen. This is no longer the truth anymore, which means customers expect you to be on all the time, have the best service, doesn't matter how many parallel connections are open, how many parallel transactions are running. They want everything to be just as smooth as it's running at peak as it is running at non break hours. Right. Any user that's actually performing any transaction on the entire service does not expect any loss of behavior. Essentially like, your cart is not available, your items are not available, they want it all the time. And that's the expectation. And that's not wrong. That's how our customers are today. And a direct correlation from that is the users could lose trust on a brand because of a single moment of a bad experience, a glitch, something bad happening in the back end. Right? That loss could be temporary, which means they feel, oh, yeah, this is not a good place for me right now, and I'll go back and come back later. Or they could be like, lifetime, this is it. I don't trust this brand at all. And they could communicate that to their families. You could lose their entire business for their entire lifetime. Right. These are pretty big numbers that we had to consider. So these are truths. We held them as truths. And using this, we said, okay, fine. So what is the goal? So obviously you will see a lot of connections to the principles of chaos engineering directly. But this was our goal to maintain an application ecosystem where if there are failures in infrastructure and dependencies, they cause minimal disruption to the end user experience. And what is minimal disruption is something that we sort of defined over time. We refined it from being a very macro level, very sort of amorphous detail to making it more and more fine grained over time. Right? So the first thing that we started talking about is, how do we inculcate the idea of running a chaos exercise? Right? So we said, fine, let's talk to them about an outage. You had an outage. There was probably a lot of things going wrong. How did you manage it? So we realized that by talking about it. We realized that every outage was essentially a chaos exercise that was completely unintentional, but was happening nonetheless and was causing people to be reactive. It was exposing gaps in our systems. It was exposing things where we were not as good as we could be. Obviously, the revenue impact could be huge, but essentially this entire thing constituted a chaos exercise. So you were sort of in that mode already. The only idea was to say, let's not have it unintentionally. Let's prepare for this from the start. Right? Obviously, the measure of all of these exercises that we did at the start was we said, okay, we want to calculate what the downtime looks like. Right? So we wanted to calculate a per incident cost. Any incident that happens, we want to calculate exactly how much it costs the company. We want to break it down by quarter and see trends, find exactly where we are super efficient, where we are not. And then we wanted to track this and find sort of a path, essentially, right. We want to plan better a culture of software resilience. Instead of saying, yeah, just fix this for now and it'll be fine. So when we started doing calculations for an incident. So this should be familiar with anyone who's dealt with incidents, there is a path, right? You identify the incident, you page out, there is an on call that receives it, they file an issue, or maybe your l one supports file an issue. There is some logging that also is sending out some alerts. There is initial triage with the folks who are on call. They try out some stuff. They figure out exactly what the playbooks look like. They will try some things out. If that doesn't work, you will assign this to a subject matter expert for a fix that could take some time. You could escalate if necessary. Finally, you would resolve it, you would close it. If you find a bug, you would deploy it, or you would do a hot deploy. Or if there is something that requires you to rearchitect something because fundamentally something is wrong, then you would write that feature in and then you would release it. Right. This is expensive if you look at it. We want to try and solve problems or reduce the costs in such a way that we want to stop it at that third block. We want to stop it right. Where there is initial triage, we want to have enough experience for that on call person, enough information for that on call person to be able to solve it at that third level. Right? So this is what we called an incident cost, and this is what we calculated for all of the incidents that we had. So once we had done this as the initial step, we still had a long way to go because we had to do a lot of homework. The teams that were running this had to do some homework. Right? And this is where I would start talking about how others could also institute this in your companies. Right. The homework that is needed is very generic. Everyone who's trying to run chaos exercises should be doing this. First is observability. This is a non negotiable. If you do not know how your system can. How. How can you sort of keep a pulse on the system and understand if something goes wrong and identify it in a certain specific kind of period, then you are not observable, right? So obviously everyone relies on logging. There's a lot of ways to do logging. We have splunk dashboards, cabana dashboards, things like that. But that's not enough. That's only solving part of the proposal. Right? You need to have intelligent second level metrics to figure out exactly what challenges, right? You want those deltas, you want those trends and the changes. You obviously need alert setup. But I would say the last thing is important, right? So if an on call engineer, the measure is if at any team that you run, can an on call engineer who gets on call for an incident resolve an issue successfully within the SLA of a p one, right? Typically, the slas for p one resolutions are short of acknowledgment, incident resolution. If that person can essentially find the issue root, cause it to a successful degree, and then push in a fix within that time, that, to me, indicates that there is significantly good observability in the system. Right? Of course, it's not perfect. I'm sure there is better ways to measure this, but this is something that I think gives us a good measure of how the system works. The other on call prerequisites is imagine if you're an on call, and if you have a team has an on call and they wake up in the middle of the night. Let's say they wake up at 02:00 a.m. And they're on a call with the VP and the CTO, and they're like they're losing millions every second. How can we fix this? Right? At that point, you do not want them stumbling to try out things. You want them to have a specific playbook for all of the issues that can go wrong, right? So you need to have a disaster recovery playbook, and you need to have this playbook well tested. Having a playbook is not the answer. Testing a playbook continuously is the answer. Right. You need to make sure that for every application or every microservice, the way you define it needs to define its own critical dependencies. Like what are the dependencies which, when go down, impact the application to a degree where it's not functional, it cannot do its job. Those are critical dependencies, right. For every critical dependency failure, there needs to be a detailed playbook or an automated way to have fallbacks instituted whenever the critical dependency is not available. Non critical dependencies also have to be defined, right? It's possible that you are sending off to a log that's like asynchronously sent to another service, but it's possible that initially the failure may not affect you, but over time, accumulation of those logs may impact you in some way. Right. Maybe the space on your system is being eaten up and eventually run out of disk space. Right. That kind of non critical dependency needs to be defined along with the thresholds of when that non critical dependency will start impacting a service. Right. If you have this, then an oncology engineer, even in the middle of the night on a call with a high pressure situation, will be able to navigate this in a systematic way. So that's what we think is required homework. Because we realized that anyone who sort of solves goes through this exercise, realizes a lot of gaps in the system automatically, even without running a chaos exercise, right. This reveals a lot of gaps in the system. So these are the two sort of key homeworks. I would say if you do not have observability and you do not have these prerequisites, you will not be able to get the most out of a chaos exercise. The other thing I would say is there needs to be a way for you to be able to understand your production load, what it looks like, and be able to generate it in a sensible way. Right. This is crucial for two reasons. One, you do not ever want to run in prod unless you really are confident. So you have to always have a do no harm approach. Right. You cannot knowingly cause harm to prod, so you want to know how to do this in a pre prod environment, but using prod traffic so you can be confident of the results. There is no point of running failure injection tests if you can't really verify what will happen if there is a failure and what happens to prod traffic. Right? And you want to try and do that as much as possible early enough in the cycle, so that you're not causing the company expense. By all means, if you're confident, you should be pushing your testing to prod as well. But if you're not mature enough, this is something that's important, right? The build or buy question. This is something that I think has been answered by lots of folks that have more knowledge about this than me. I don't think there is any preference. If you have a system that can generate production like road internally, that's great. If you have proprietary needs to do that, that's fine. But there is also buy options which other teams and other companies have used. Right? So I don't have a say in that, but that is something that has been debated. I don't have a say in that much anyway, so the other thing I would say needs investment is a CI CD workflow, right? The diagram that I have on the screen essentially demonstrates what a real CI CD pipeline looks like in this day and age. There is a lot of automation. You'll notice some of the stuff is very familiar to you, right? Which is plan, code, build, test and deploy. Right. In between there is a profile stage, right? A profile could be something like run a performance test, find out exactly how much utilization I've been using, recommend the best solution for me to exist in. Like give me the right kind of cpu and memory allocations for my Kubernetes pods. It could be give me the right kind of flavor for my vms on Azure, whatever those are, right. That essentially has now come part of the CI CD cycle. And having that enables you to solve a lot of these reliability issues way earlier in the pipeline, rather than wait for some signal to come out of production. You will also notice there is two key things here, which is there is a multi cloud environment. So hybrid cloud, I obviously represented that. But there is two things. Inference layer, right? The inference layer is essentially a back channel out from prod coming into the pipeline, the CI CD pipeline. This is something where I say observability is important. If you have an observable system, you can read what's happening in production and then feed that back into to make your code do better, right? The other thing you'll see is a decision engine that's between the CI CD pipeline and the clouds. This is something that many teams have started investing in, many big companies have started investing in, which is figuring out what's the best kind of cloud configuration, if you will, for my application. Would it be better on prem because it has latency restrictions? Would it be better on a certain kind of cloud provider? Because certain kind of cloud provider provides a certain kind of platform service? These decisions have now started to get automated, right? And you'll see more of this coming in the years. But this is an investment that I feel enables you to have a better system in production. Building a maturity model. So we did this at Walmart. We did not allow teams to just go come in and run chaos engineering tests, right? We have a detailed maturity model from level one of maturity to level five, with requirements at each level. And this is all detailed in that blog post. It talks about how as a team, you can enable yourself to grow from one level to another. And that also enables the team to become more and more confident. It enables management to be more confident on the team running these exercises. So that maturity model I have seen, it really helps. Right. What happens essentially, because of that maturity model, is that over time, as you go from red, which is level one, to green, at level five, the support costs go down. Right. What also happens is that the revenue lost per second, per minute, however you calculate it also goes down, the resiliency obviously shoots up. At level one, I would say the support costs could be in the five digit number, five digit dollar number per minute, whereas at level five, you're probably looking at a few hundred dollars per minute. Right. So imagine a couple of engineers working on observability systems and just getting all the answers at level five because they have everything in place, whereas a whole bunch of team of engineers trying to figure out what do I do next and what do I shut down to solve this? The other thing is build the right tools. So we invested a lot of time in making sure that we have the right tools to enable our teams to run the resiliency test. One of the tools that we did invest in is resiliency, doctor. You can read all about it in the article that Vijita has published. So investing in tools is important because I believe the ecosystem is still sort of getting up to speed on that killer product. So we all need to sort of contribute, chime in and do our best. So this is essentially something that I would say all companies would. It would be good value for money. And this has really helped us enable that maturity model in all the teams. So apart from the maturity model for the teams, the other thing that I would focus on also is building the right mindset of each and every individual. Right? So what we suggest all companies to do, because we've seen this succeed, is a lot of trainings. Right. The path is not easy. The journey is not easy. You have to make sure that people understand what they're trying to accomplish before they do it. So trainings, certifications, brown back sessions, all of the usual stuff. Making sure that you're building resilience after an outage. Right. And resilience doesn't just mean software resilience. And I know there is other speakers who will talk about this, which is about human resilience as well. Right. You want to make sure that the person who was on call doesn't treat this as a traumatic experience. They actually learn from it because the postmortem itself was blameless. You're not really pointing fingers. You're trying to figure out what the system looks like and how to make it better. And teaching that is also something that doesn't come naturally to human beings. Right. We tend to want to find blame in someone, and blameless postmortems is something that I think it really takes a culture change in companies to accept that. And the last thing is carrots, not sticks, which teams, you treat them with rewards, not punishments. Basically, the idea is you provide some kind of an incentive for folks doing these kind of resiliency exercises. It's a hard thing to do. The fact that they're accepting it, doing it means that they're committed to it, they're passionate about it. You want to make sure that you enable them and you incentivize them. So what did we learn on our journey? So all of this stuff, I think, is something that I think works for any company anywhere. But there are some things that I would say do not do, and that's what I'm going to talk about next. So these are the things that are don'ts. So these are learnings. The first learning that we found was we mistakenly created vanity positions, right? And this is something that other folks also have done, which is we did not want those, which is like, this person is the designated chaos practitioner, or this person is the resiliency expert, or such. And such that doesn't work because, one, it shifts the responsibility of something to be done to that one person instead of democratizing it to everyone, saying, oh, yeah, we are all enabled and empowered to be able to do this. It shouldn't be this one person. So we learned that pretty early on, but it was something that was, we realized that it was not helping our cause. Second thing was, the exercises cannot be conducted without complete participation. When I say complete participation, it doesn't mean complete participation of just the team. It has to be the team's dependencies. Maybe the SRE teams, maybe the other teams that support you in production. All of these folks need to have sort of, they need to be signed up, they need to be committed to this cause. And that also takes effort. It does also take interest from their side and passion from their side. So this is something that we sort of found out over time, and we realized that we have to sort of fix this. The other thing was don't assume, verify. So this is sort of a take on the same thing, right? So like trust but verify kind of thing. But it always starts with assumptions, right? All of us in this room, we know what the value prop of chaos engineering is, but you can't go in with that assumption with everyone, right? So, for example, observability, you cannot go into it if you ask a team, hey, do you think you can find an issue in 15 minutes? They'll probably say yes. I don't think there is any team that has said no to us or anyone else. Right. But you have to verify that by checking if the observability truly is as they say it is. Right. No. Teams, I would say, is as well instrumented as they think they are. On call. Being reactive is always the problem. Right. You have to make sure that you test the on call systems you want to test. Make sure that the on call person has everything. Doesn't mean actually test the on call person to see like, hey, I just created an issue. Did you actually catch it? Right. Not that way, but you want to make sure that they are empowered before they get into a bad situation. The other thing is specifying what the team's goals are about. Resiliency is crucial. Every team does not see it the same way. It will change team to team, it will change application to application. And those goals would be prioritized by that team. It is not something that you can centrally prioritize. You can prioritize. Loss of revenue should be minimized, but that is interpreted by different teams differently. And I think this is something that we learned, and it was a pretty painful learning for the simple reason that we didn't really understand that teams had a certain plan as well for how resilient they want to be. Instead, we were imposing a certain metric which for them was meaningless. Right. So that's something I would suggest folks to have discussions about. And obviously their deployment pipelines have to be verified. But this, I would say, is something that is part of the exercise. So are teams ready for exercises? So this is something that no team is going to come in and say, if you just start your chaos engineering group and you go to someone and you say, okay, are you ready to do this? They are going to be very reluctant. Maybe someone will say, yes, we will try it, but in reality, none of them are right. So you have to make sure that you build the training, you teach them that it's okay to sort of learn but don't fail in such a way that you sort of cause that trauma to the team saying, I'm never going to ever do this again, this was a bad idea. So in order to teach the right way to do things, you have to treat chaos engineering as it's supposed to be. It's an experimentation process. It's a process where you establish a hypothesis and you prove or disprove it. It's a scientific experiment. It's not a, let's just hammer out some of these nails into this server here, and let's see what happens. That's not what chaos engineering is. So I just want to give you guys a sense of where we are now. The report card reads really good. I would say that all of these learnings that you saw, that has obviously given us a lot of thought about how we want to progress in the future, but it has only given us more sort of will to do more. The application teams are eager to run these tests, and because of the maturity model, they've understood that we do not have to just go in and be glamorous overachievers overnight. It has to be a slow process. Management confidence is good because management invests in this. Again, in order for management to invest in this, they need to understand that chaos is not the goal. The goal is resiliency, but the way to get to that is chaos. Right. And this has been repeated by many folks, not just me. There's other speakers also who will repeat this. Increased resiliency in engineering usually tends to basically, if your engineering team itself is resilient, then essentially it opens the way to subsequent learning. So you start doing more learnings. It encourages them to test things out even more. And frankly speaking, everyone defines an end state. So application owners, all of the application owners define, okay, if something bad happens, this is what I want, right? That's all in their dreams. There is no verification. But this now allows them to stand in meetings where maybe there is multiple high level tech leads or senior management and be able to say, okay, this is what my application does, and I'm confident about it. And that's actually, I think, the best thing to come out of it, because you want to empower your tech leads, your engineering managers and such, so that they can stand their ground whenever they are talking to folks about how to do this thing better or whatever. Right? Like, good design, good architecture, all of that is improved by this. So this is where we are today. And so all of the learnings that we have done has led us to this. And going forward, what we want to do is build on these models, right? The maturity models that we see today that I just described, those were rough around the edges, right? So there were things that we had not defined very well. Like I said, resilience of the individual person as well. Right. The engineer. So we are trying to work on those things and make sure that we commit to this in a way that is sustainable and that's actually important for this kind of exercise. That's the entire story that I wanted to share with you guys today, and that's all I had. Thank you so much. If you have questions, I'm ready to hear it.

Slides

Download slides (PDF)

See all 11 talks at this event!

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Journey To Resilience: Building the Ramp to accelerate The Adoption of Chaos Engineering Practices

Video size:

Abstract

Summary

Transcript

Slides

Vilas Veeraraghavan

Director of Engineering @ Walmart Labs

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Journey To Resilience: Building the Ramp to accelerate The Adoption of Chaos Engineering Practices

Video size:

Abstract

Summary

Transcript

Slides

Vilas Veeraraghavan

Director of Engineering @ Walmart Labs

Join the community!