Conf42 Chaos Engineering 2022 - Online

Learnings from Chaos experiments

Video size:

Abstract

During a Chaos experiments, the system might withstand the injected failure condition or a weakness might be exposed. This talk will cover what I have learnt after conducting chaos experiments on different types of applications and infrastructure.

  • Commonly identified scenarios
  • Best practices to handle the identified weakness
  • Knowledge sharing with the results from the Chaos experiments

Summary

  • Vishnu Vardhan Chikoti has 17 years of experience across it labeled engineering. This includes chaos engineering, product development and business analysis. Currently he's a senior manager SRE at Fanatics. He will talk about his learnings from chaos experiments.
  • chaos engineering is about experiments on a system and then the second part is building confidence. Third part is about the turbulent conditions in production. What turbulent conditions can actually occur in production? There are various tools available, open source or from a vendor.
  • There can be various fixes, various things to look at based on what the problem has been identified. Other things that can be considered are like observability, like when these kind of errors are actually happening. And then error budgets, now, when these errors are happening.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. You hi everyone, my name is Vishnu Vardhan Chikoti and then in this session I'm going to talk about my learnings from chaos experiments. About me I have about 17 years of experience across it labeled engineering which includes chaos engineering, product development and business analysis. It's a diverse experience that I have. Currently I'm a senior manager SRE at Fanatics and prior to fanatics I have worked at Broadridge, Goldman Sachs, bank of America, Tektora Consulting and DBS bank. So most of this experience is into investment banking, product development or in the SRE areas. Other than this talk, I have also done few other things. One is like I'm a co author of a book, hands on site reliability Engineering, which has been published recently. In July 2021, I have done a couple of tech talks. One is in conf 42 SRE 2021, about a new enterprise SRE adoption framework called Arctic and I have also spoken about chaos engineering and how it relates to error budgets in Chaos Carnival 2022. I also have a blog except geek.com. It has content across capital markets, technology and some other things. If you want you can check it out and from a location perspective. I am almost all the time at Hyderabad, India from the last 20 years. Before I get into the main topic, let's do a quick recap on what chaos engineering is. So this is like the definition from principles of chaos. I'm not going to read the full thing, but if you see there are like three main parts to it. So one is that it is about experiments on a system and then the second part is building confidence. And then the third part is about the turbulent conditions in production. Now when an experiment is done, it can result in a failure that the system is not able to withstand that condition or it can result in a situation where the system has actually withstand that condition. So if it chaos actually able to handle that, then it builds confidence that okay, like this can be handled. And if it doesn't then we need to go back and actually look at how do we fix it. The next thing is about the turbulent conditions themselves. So what turbulent conditions can actually occur in production? So the first one is the famous application or service and availability. So we all know that there can be a chaos monkey which can be deployed and it can randomly shut down services or applications. And this is a famous thing. And coming to the modern way on how we develop, deploy applications and expose it to users. So there are like various components. So we might deploy it in an on prem system, we might deploy it on cloud, multicloud, hybrid, cloud, one region, multiregion. So there are various different ways we actually deploy. We replicate across database instances. And then the users themselves are now at a very high scale where they are trying to, they are actually spread across the globe and accessing that over their wiFi, broadband or some wireless connectivity. So in such a scenario, there are so many network hops between the users requesting for something and then getting back what they want. So there can be delays in the network, there can be failures in the network, like packet loss or packet corruption kind of a scenarios. Or we can have a resource utilization problem with respect to cpu memory or I o related issues. Now, when we consider an application, let's say running on a vm, so it's not that that particular process is running. So there can be other things that might be running, which can be an infrastructure agent for infrastructure automation, there can be an agent for observability purposes. There can be an agent from security purposes. So there can be various other agents, and they might actually cause a problem which can create that cpu situation, how much you test your application, like when you deploy it in production, and then when it is running in an ecosystem with other processes. And the other things can also cause a problem and configuration error. So we have all seen that when there is some condition error, there can be a massive incident as well. So it's better we try that on a small best radius by injected that and then verifying how the system behaves and then database failures. So with database we can have logs, we can have transaction log fill ups, so there can be condition, so there can be other things in the database. So how is your overall system actually working or behaving in this particular scenario? So that's like a quick recap on what chaos is now coming to actual learning. So what I've actually learned from my experience with chaos engineering since the last four or five years, now, one is about the tools themselves. So being a technical person, we are all very interested about what tool to use the moment we know that, okay, there's a chaos engineering that needs to be done. Now, when we look at a tool. So there are various tools now available, open source or from a vendor. Now, some of the thinking that we need to look at is whether we are going to use this particular tool just for a team or a department, or at an entire organization level. Now, as you broader your scope. So what you observe is there is a variation in runtime, so how the applications are actually running. So are they running on vms, are they running on on prem, are they running on cloud, are they running on kubernetes kind of environment, so where they are actually running and then how they are connecting with each other. So that would be one variation. And the next variation that you would see is like the architectures themselves on how again they are deployed, how you are trying to maintain the ha for that, how you are trying to replicate. So there can be some differences in that. And then the application maturity itself, like given some standard architecture patterns and design patterns. Now how each application or service has implemented them can also vary. So the overall maturity of what they have actually considered and at what level they have implemented also differs. Now if you run chaos experiments on a particular blast radius and then you think that okay, the same thing will also work for another set of applications and all that might not actually exactly match. So there will be differences on what has been implemented. And next is about the policies. Now you take an organization, so there will be various things like what would your infrastructure team actually allow you to do on your applications or when you keep expanding that particular scope of where you want to run the chaos experiment. And then there will be certain policies from security perspective on what experiments you can run and in what way you have to run. Like though the tool has a capability that might not be straightforward to implement given that. And then there will be like change related policies now. And then there will be incident related policies like how you are making sure that what you are actually going to run is in line with these all policies. So that's one thing about the tools that I have learned. And then the second thing is about the actual tools themselves. There are like, as I said, many tools available. The first one is obviously very famous chaos monkey. We all know that. And that's what it started. And then there is Simeon army, I didn't mention it here. And then there is Warrian which is basically has failures based on protocol like HTTP MySql. And then error condition can be like you are returning an error or you are injecting a delay. So those condition and then there is boot chaos monkey. Basically you add certain configuration to a palm file and you build your sharing boot application and then it will randomly start injecting that faults. And then chaos blade is from Alibaba which has various chaos experiments, scenarios including like attacks on the JVM level, the fault injections at the JVM level, and then there is pumba which is basically for Docker containers so it can run as a standalone binary or you can inject some libraries into your container and then use palma as well. And there is chaos mesh for kubernetes environments. Similarly we have litmus for kubernetes and cloud native environments. And then there is chaos cube for kubernetes environments. And then there is Cthulhu and we have monarch from T Mobile which is used for pivotal cloud foundry based applications. So if you have deployments there then monarch can be used. And then there is mangle from VMware. So there are various fault injection available for VMware related or at a VM level. So there are various things which are available through mangle. And then there is at T resiliency studio. And then there is moxie which is also a proxy like Warian. So you can create put moxie as a proxy and then you can inject like response code failures or injected delays or any network related failures also through moxie. And then there is chaos toolkit with various integrations that are available now. Again, which tool you want based on where your ecosystem is, where your application ecosystem is, and then what is your best radius. So based on that it can be used. And then there are also some operating system native features that can actually be used. So if you look at these tools like how they are implemented internally, there are some common things that are actually used. So some common Linux features or some common windows features that are actually used. And then coming to the actual scenarios like what I have seen while injecting chaos experiments. So one is like reconnect failures. So most of us who have worked for a long time can connect to this reconnect failure. Like let's say we have an application and running and then suddenly it can't connect to the database. Like how is it actually behaving? So is it actually not able to reconnect anymore or is it hung or what kind of things are happening? So that's one thing that I have seen. So we need to look at how to fix it and then the timeout problems. Now if you inject a network delay, like how is your timeout being handled in your service or UI or whatever is that you are trying to verify? And then let's say it's a UI, is it actually crashing the timeout and throwing a proper message? Or is it just throwing a fin out four l and then there can be crashes like whether it again can be a service or the UI or a mobile app or whatever it says it can simply crash. So based on the injected failure condition, so is it something which is happening? That's something that needs to be checked. And then the master slave setups. Now let's say you bring down the master instance and then you have multiple slaves also available. And is the election actually working fine? And then is one of the slave actually upgrading to master? So that's something that can be checked. And now there are other things that can be done, is like you disconnect the master from the slave through network failure condition. Now is the slave actually thinking that the master no longer exists and it being upgrading itself to a master? And then we have a split brain scenario. So that is like other kind of scenarios that can happen. And then again, depending on what actually happened, you will have a fix, like you need to have three nodes, like one master and two slaves, where even if one master disconnects from the slave through some network problem, the slave can connect to the other slave and then it knows that, okay, the master is still alive. So there can be like various fixes, various things to look at based on what the problem has been identified. And then the auto scaling thing. Now if you inject a high cpu condition or you are trying to bring down few instances, is your auto scaling actually working properly? That the scaling is actually happening to bring back the required instances, or to bring up a new instance based on the cpu or memory injection that would have been done. And then the ha setup again, like how is your high availability working when one of the instance is actually not available? Is it kicking in correctly and then working then exception handling. Let's say there is a user who is trying to do some transactions and then there is a failure at a network or these old timeout things. And then how is it actually being handled on the US side? Again, let's say he has done a buy transaction, it has actually hit the server and the buy actually got processed, but the return was not properly handled in the UI. So then that will be a problem. The user will not know what exactly has happened. So are you actually like, let's say there is a timeout or there is some other error coming from back end. Are you actually taking the user back to some kind of a screen where he knows what exactly has happened overall and that kind of a user experience and then observability. So this is not directly injected directly related to how the application has withstand the turbulent condition. But other things that can be considered are like observability, like when these kind of errors are actually happening. Are you actually logging it or are you catching it within your observability data? And then observability is the data. And once you have that data, do you have the right monitors in place when these kind of things have happened? Actually, did the monitoring pick it up or not? So in certain cases, yes, we have set up the observability, but we haven't set up the monitors. And once we have set up the monitors, the next thing is, is the alerting being done through the right channels? Are you messaging someone? Are you emailing someone? Are you actually paging someone? So this kind of alerting will also come into the picture that you can actually verify whether that chaos actually worked fine or not. Right. And then error budgets, now, when these errors are happening, fan raids or whatever, errors are actually being sent back, are these being recorded and adjusted against the error budget so you don't create a major impact to the error project, to chaos experiments because it's a small blast radius, but still you can verify that whether they are reflecting there or not. And then auto recovery, if you have auto recovery set up, you can check whether the auto recovery has actually triggered correctly and then recovered the instance. For example, you are trying to bring down an instance and then did it actually through the monitoring, did it trigger the auto recovery and did it actually bring that instance back? So that's another thing. And then the other learning I have is about injecting multiple failures. So we don't actually inject multiple failures at the same time across the ecosystem, we inject like kind of one failure at a time. Like if you inject too many failures parallel at the same time, you don't know what went wrong and what would have actually caused it. That is another thing. And then, yeah, how do you actually handle the identified weaknesses? So when you have actually identified some problem through chaos experiments, it comes back to the product development backlog. And we have this problem of like, basically it goes into the technical backlog because this is some kind of a technical problem that you need to fix. Now, how do you actually prioritize this and get it fixed? So it depends on the scale of the problem. Now, you would have injected again the chaos experiment on a small best radius. Now, what happens if such a problem happens at a widespread thing? So basically what would be the impact and what level of problem it would creating? Like you need to prioritize it accordingly and either utilize your 20% of tech backlog that can be allocated in a sprint, or if it is a bigger problem, then you have to allocate more time than the 20% to fix this problem and then knowledge sharing. So once you have done the chaos experiments and learned something from that, it is important that you actually go back and share within the organization that these are the experiments that were run and this is what are the results that we saw. And if there are any observations like the system did not withstand that, then what did you actually do to fix it? Because in certain cases, the similar patterns or the architecture patterns will be used across the enterprise. And then it's important that you go back and fix the different places that the same problem might occur. That's my talk. And any questions you have, like send it discord and thank you.
...

Vishnu Vardhan Chikoti

Senior SRE Manager @ Fanatics

Vishnu Vardhan Chikoti's LinkedIn account Vishnu Vardhan Chikoti's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways