Conf42 Chaos Engineering 2024 - Online

Chaos Engineering for Developers: Breaking Systems for Resilience

Video size:

Abstract

This talk explores developer techniques to run controlled experiments that build confidence. It will discuss why chaos engineering is not just an SRE thing and how developers should incorporate chaos into software development, making it a practice in their day-to-day work.

Summary

  • Dheeraj: This talk is about chaos engineering for developers. He says even software developers could think chaos while designing or writing writing code. Says this helps us build real silence through chaos. Survey shows 47% of companies after adopting chaos engineering AWS have seen increased availability.
  • We are still lacking the awareness that how chaos engineering can help. The biggest inhibitor to adopting or expanding this chaos engineering is lack of awareness and experience. There are two major modes of chaos experimentation. Start in a pre prod environment before your changes roll out in production.
  • For every dollar spent in failure, you learn a dollar's worth of lesson. Here are some of the few popular open source tools which you can use for Chaos engineering. Not one kind of tool is suited for everyone.
  • How developers can benefit from chaos engineering. Think about external dependency failures. What if the database on which I am relying, it goes bad or it goes down. How will I react to it? Will I be able to give a consistent customer experience?
  • Always test using real world conditions. Always conduct post incident analysis after each experiment. If you run regular experiments, you'll be able to increase your resiliency scores.
  • Chaos engineering is the way 63% of like 400 plus IT professionals say that they have performed chaos experiments. 30% claim that they run it in production. GitHub has like over 200 plus Chaos experiments related projects with like 16k plus stars. Build your resilience score, increase your nines and inculcate chaos engineering as a habit.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to Con 42. This talk is about chaos engineering for developers because we believe that chaos engineering is not just sre thing and even software developers, while designing or writing writing code could think chaos, or rather should think chaos while software development because that helps us build real silence through chaos. So I'll give a quick intro about myself. My name is Dheeraj. I work as a software engineer at Amazon Web Services and in Amazon Web Services, which is popularly known as AWS. I work with the Auroradb team, and in Auroradb I work with the storage part of Auroradb which is like a multi tenant distributed auto scale out storage platform. And apart from work I am a contributor and maintainer in an open search projects there. I mostly work with how we can run open search on kubernetes. I build tools, charts and operators to help the community run open search and open search dashboards on Kubernetes. So what is chaos engineering? Now there is a very thin line between chaos testing and chaos engineering. Chaos testing is when you are intentionally introducing failures into a systems, but you are not doing after anything. So when chaos testing plus observability is called chaos engineering, when I say observability is you are identifying, you are proactively monitoring, and then you are addressing potential issues. That term is called chaos engineering. So when you are like simulating real world scenarios, you are proactively monitoring, you are creating your hypothesis and then you are proactively resolving it before they are causing any issues. So that is called chaos engineering. Now, like I said in the beginning, chaos engineering is not AWS Ari thing. It is equally important for developers as well. Now why do I think so? Because if you see data speaks everything, we'll take data driven decisions in this slide or in this session rather. So the Gremlin survey, it shows that 47% of the companies after adopting chaos engineering AWS, a habit, have seen increased availability. Then if you see the next one, their MTTR, which is like the meantime to resolution has decreased by 45%. Same goes for MTTD which has decreased by 41%. And if you see the last two, which is very significant as well, which is the outages and the number of pages, now how we can at a wholesome or at an organization level achieve this, this can be only achieved when you have the habit of chaos engineering imbibed or rather plugged in in every part of your software development lifecycle. So generally the trend is like after your entire software is developed and you are ready to go to production, you do some kind of chaos experiments and game days in order to validate your resiliency. And if you catch any bugs, then you go ahead and fix those. Now we sre telling the reverse. When you are designing, when you are coding, even for small, small modules, you need to think chaos. Okay? So when I say think chaos is like if you have built a small module, just go ahead and inject some failures, see how your system behaves. Then again validate. Then again fix those. So this way, incrementally, when you think chaos, while software development, that time, when it bubbles up for the entire system, it will lead to increase in number of nines. When I say number of nines, which is your increased resiliency or your availability of the system. So this is in very layman terms, you can say you are building resilience through chaos. Next is like I talk with multiple folks, multiple people, and multiple companies. The biggest thing, which I think is the biggest inhibitor to adopting or expanding this chaos engineering is lack of awareness and experience. We are still lacking the awareness that how chaos engineering can help. People are aware of this term chaos engineering. But what does a true chaos engineering experiment do? People are not so aware. Maybe they are injecting failures. They sre not building a proper failure model. So if you see on the right hand side of the slide, there is a cycle. So we'll start from here. Let's start from the steady state of the system. So when I say steady state of the system, it is a steady state where we have no failures injected in the system. It is the normal state of the system where everything is breaking as expected or is normal based on the steady state, we make some hypothesis that, okay, after, say, a power outage, or after a cloud service outage, or after a DB outage, how will my system behave? Then we create a bunch of hypothesis around that. Then what we do is we run the chaos experiment, which will actually inject the failures which we wanted, and some randomness as well. There should be some randomness in your experiments. Otherwise it will become more of any kind of integration or unit test where you kind of start asserting things here, apart from asserting, you need to observe and find unknowns, which is the most important part of chaos engineering. So once you have run the experiment, now you need to validate your hypothesis that, okay, whatever hypothesis I had created as part of the steady state, whether it holds true or not, you validate. And if you see that some of the hypothesis does not hold good, then you improve on that hypothesis. And once you improve on that hypothesis, or rather improve your systems, then you will get an improved, steady state and ultimately a good, resilient service. The other thing is like, followed closely by other parties. Second point, that, okay, chaos engineering, most of the time, will take a backseat. Tomorrow, if your organization is not practicing chaos engineering tomorrow, you go to your manager. Okay, I want to do chaos engineering. It's very difficult to convince them because it will certainly not add a value tomorrow to your service. Even if you start practicing it from today, it is a journey, and it starts from a pre prod environment. The last point, if you sre that greater than 10% of the engineers, they feel that, okay, something might go wrong. But, okay, listen, you need not start in production. You start in a pre prod environment. Go to a prodish environment, practice everything there before your changes roll out in production. In your beta stage, run these chaos experiments, do chaos experiments. That will give you enough confidence. Then during this chaos engineering journey, you will reach to a point where you will run this chaos experiments in production. And that is where you will see the real benefit, that even during outages, you are able to do all these things. There are two major modes of chaos experimentation. First one, which I say is the start of the journey, and the second one is like, you have already advanced the journey. You've automated everything which is manual experiments. Manual experiments is what I'm telling right now that, okay, tomorrow I built a module, I want to test it out. I just injected some failures, saw how it's breaking, that is adopt. Then we have game days. When I say game days, game days means that you run like full day outages. So suppose you brought your database down for say, six to 8 hours, and then you are seeing how your systems are behaving, how your customer experience is getting impacted. That is what is called game days. Second side of chaos experiments are automated experiments, which are like the CI CD pipelines. So whenever your code changes are getting checked in, you are running some automated experiments which are creating failure, which are variating a bunch of stuff. This way, you need not be manually involved in doing all this experiment. You can just look at the report and say, okay, this is my resiliency SRE perhaps, and my resiliency score is going good even with the next set of release that I'm going to roll out. So this is like a continuous experiments. So this is the next phase of chaos engineering, which we should target. That once everything is automated, once you know your system's failure model, then you can easily create this automation in your CI CD pipelines. So this way, every time a code is checked in, you get some resilience score. And you know that any of my code will not compromise on the availability of my system. And this also takes into the fact that a very simple example is suppose you create alarms using some terraform template or cloud formation. Then what happens is you have made some changes in those cloud formation and as part of your CI CD pipeline that is also going to go and that is going to synthesize some new alarms. Now what happens is as part of your CI CD pipeline, you have a step which is validating whether all the alarms are going on or not. Now imagine a situation where you created a bug in your alarm creation code. So suppose that updated alarm, which it should not have updated. Now your CSCDP plane will catch it, because as part of the chaos experiments we do two things. We inject failure, we validate. You injected failure, you validate it, you evaluate via Ventrix or via anything or via alarming. So here in this case, we validated by alarming. We saw that, okay, alarms are not going on. Okay? Then there's something wrong. So we kind of pause that release that time. So this is a very good benefit about CI CD pipeline and how you can integrate chaos in your CI CD pipeline. A very important quote, a very famous quote of Jesse Robbins, who was also known as the master of disaster. So master of disaster was his official title at Amazon. So Jesse Robbins used to manage resiliency for everything, which has a tag of Amazon.com. So for every dollar spent in failure, you learn a dollar's worth of lesson. So this quote means that every dollar you spend in creating a failure, you will learn a dollar's worth of lessons. So whatever time and effort you are spending in injecting a failure, there'll be always new learnings that will come in. So it will not go in vain because every time you inject a failure, you will look with a different perspective. And that different perspective will generate more unknowns in your system and that will help build the resiliency of your system. Here are some of the few popular open source tools which you can use for Chaos engineering. The very famous litmus chaos. Then chaos monkey, the legacy chaos monkey which can bring down servers and create randomness in your system. Then we have chaos blade, Chaos Mesh, which is very prominent for kubernetes based environments. Then we have chaos, Tilkit and Sto. There are many repositories which SRE tools which will help you practice chaos engineering, but these are the most popular ones that you can give it a try and start explores and I'll say that not one kind of tool is suited for everyone. So everyone will have their own failure model, they have their own resiliency model, and every service, even in the same company have their own success metrics. So it really depends on what kind of use case we have before you choose this Chaos engineering tools. Now the topic, the main topic for today, which is like how developers can benefit from chaos engineering. So my idea being that for developers, they need to think when they are writing code, when they are designing their system, that time only. If you can think about failures and do some failure driven development, that is the time that will benefit the entire lifecycle of the product. So when you are designing, think about external dependency failures. What if the database on which I am relying, it goes bad or it goes down? What if the server on which I am running it starts failing in other availability zone? What if my sister services start failing? What if my upstream services start failing? How will I react to it? Will I be able to give a consistent user experience? Will I be able to give a consistent customer experience? And when you are doing your code testing, like when you sre writing unit test, when you're writing integration test, make sure you write some automated chaos tests, like the failure test, do chaos testing on your module, on your module. If the other module fails, how will your module behave? So these kind of things will help think chaos while software development. Next thing is now you will tell that, okay, you told about how I can think chaos while software development, how I can build my failure model, how I can build resilience model. Then what is chaos engineering, what tools to use, how I can do CI CD pipelines. Now how to run these controlled experiments. Like if you're telling that error at a very modular level, go ahead and do testing. How to do this here is there you identify the boundary and the scope of the experiment. So if you have written one module, you know what is the use case of those module. You know that module will interact with what components. So that is your boundary. Then you build the failure model for your service failure model as in if your service a depends on service b, and if your service a depends on a database d, then what if a fails? What happens if d fails? What happens? A and D both fails? What happens? A and D both fail simultaneously or sequentially, what happens? You build that model. Third is you think about dependency failures. Very straightforward thing, think about external, think about internal. External. Internal can be considered, external is something, maybe any cloud service you are using any managed service or using any services which sre running on local. Suppose you have installed MongoDB on your on premise. What will happen if MongoDB goes to then is the intra dependency failures. Like what if my sister systems fail? Then step four, you inject failure, you monitor and then evaluate results. So this is one controlled experiment for a module. If you do this, you know that your module behaves perfectly. And if you just start bubbling up like several modules building a service. So on a service level also, we can do the same five steps from the service. If you entire product on a product level also the boundary and scope will increase, but the entire set of five step still remains the same. Let's do some practical thing now that you need to design a microservice which is responsible for doing some crud operations and basic computations. Okay, so let's design very simple. We have a microservice which will be running on a virtual machine and we'll do some crud operations using a database. So let's use a SQL database, and once we get those data, we'll do some computations. How we can think chaos here. Interesting. So first thing that we should come to our mind based on the previous steps is, okay, I need to, let's go back, let's revise boundary and scope of experiments. Okay, so boundary and scope of experiment is my microservice. So microservice will return some results after cloud operations and basic computation. So that is my end result. Now, what if my external dependencies go this round? So my external dependencies here can be at a very high level is one database and one virtual machine I'm taking at a very high level. Okay, so how will I react to a database failure? Because I need to return a consistent experience to my customer. So there, if you think, then you will think that, okay, maybe I can make my databases global. When I say global, maybe I'll replicate it in between availability zones and in between regions. So in case of a region outage or in case of a disaster, at least my databases can survive. So that is, you are strengthening your, this thought process will help you strengthen your database infrastructure. How about still you feel okay, something goes wrong, then even the global databases can go wrong. How to cater to that fact? Okay, what to do? Maybe I can have a cache which is like a mechanism to query whatever I wanted to query from my database. So I'll store it in a cache. So maybe my data will be stale until my databases recover, but I'll be able to give a customer experience, like a consistent customer experience. I'll suddenly not starting, throwing errors. My data is stale, but still I'm able to survive. Nice thing to think is, okay, I'm running on a virtual machine now that virtual machine, should I make it on one AZ, two AZ, three AZ, how should I go about it? So if we want to sustain AZ plus one failure, which means that one AZ is fully down, plus one more instance is down, we need to replicate this service three AZ. So we need to have a bare minimum of three boxes, one in each AZ. And then only we can say that, okay, in case of AZ outages, we'll be able to survive. So this kind of thought process is what we need to think when we are designing a microservice, which is responsible for doing this crud operations. Now, as part of failure experiments, what you can do, very basic thing, you just shut down your database, see how your system is behaving, create some latencies, network latencies, see how your customer behavior is getting impacted, see what you can do to improve it. Or rather, I'll say that maybe see if we can means you can even find the bottleneck of your systems. Rather, okay, if this is my network latency, this is the max network latency on which our customer experience won't be deteriorated. So this will also help you identify your resiliency bottlenecks, then what if two of my instances of the microservice goes down? Will I be able to sustain or give a good customer experience? Or whatever my success metrics are, will they remain same even when two of the instances goes down? Will the one instance be able to take the load or will be able to sustain the load? So this is how we are designing a microservice and for doing crud operations. And when we were designing, we thought about the different failures that can happen in a real time scenario. And then we designed the system accordingly. So it was just an example. Now we go into some of the best practices in chaos engineering. I iterated it multiple times. I iterated one more time. Understand the steady state of the system. Until you know what is the correct state of the system, you will not be able to identify when the system goes wrong. So when the system is behaving abruptly, you can only judge whether it's right or wrong. When you know the steady state of the system failure model is very important. So like in this example, we built a failure model that what if the database goes down? What if the virtual machine on which I'm running my system goes down? Third was like, how can I control the blast radius of my experiments? So when I say blast radius, it means that the boundary which I was talking about. So my microservice is interacting with XYZ components and a database. I'll just restrict it to those kind of failures. Then I introduce randomness or jitterness in my failure injections. So maybe I'll not say that. Okay, shut it down for 30 minutes and then let the server come up. Maybe just do some intermittent stuff. Maybe shut it down for say, five minutes. Then again bring it up. Then again shut it down for 20 minutes. So let's take the example of a fire or fire outage. Let's think of how a fire will happen in a data center. In a data center, there'll be multiple racks, okay? Now, when the fire has happened, it will start with, say, one rack getting fired. So some virtual machines are impacted. Then second rack, then third rack. Ultimately the whole data center is down. Ultimately the whole AZ is down. So there is randomness in how the fire is spreading and it is creating failures and creating disasters in the system. Always test using real world conditions. And don't think that, okay, I'll just go and test in prod. It is a journey. It will start in pre prod. Always conduct post incident analysis after each experiment. This is very important. Until you conduct post incident analysis, you will not be able to reap the benefits of chaos engineering. As much as important is failure injection. More than that is post incident analysis because there only you will get to know about the bugs or the issues that might have incurred AWS. Part of this then is extensive monitoring and logging. Your system needs to have a good observability posture so that you can identify issues when you are running this chaos experiments. Last but not the least, you start today and this experiments should be often. So if you run regular experiments, you'll be able to increase your resiliency scores. So I'll talk about like Chaos engineering today. It is being used a lot these days where we want, where we are focused on speed. Okay, so suppose for systems which are giving delivery in one day, for grocery delivery or for food delivery, these systems need to be available. And to have these systems available to check its availability. Chaos engineering is the way 63% of like 400 plus IT professionals, they say that they have performed chaos experiments. And this is a good number. 30% claim that they run it in production. So this gives us a good confidence to go and tomorrow and write these chaos experiments. Because if people are running in production, why can't we start with a pre prod environment and test our resiliency GitHub has like over 200 plus Chaos experiments related projects with like 16k plus stars. So you can imagine the number of people who are into this chaos engineering and this is a stat that teams who have running frequent chaos experiments minimum they are seeing like three nines of availability which is very good. And all major cloud providers like AWS Azure, they have their own managed service for doing chaos experiments. Apart from AWS Azure, we have many other managed chaos services as well like Litmus chaos which is provided by harness. So do check them out and see how you can plug in in your existing software development lifecycle and you can break systems for resilience. Build your resilience score, increase your nines and inculcate chaos engineering as a habit. So feel free to reach out to me on Twitter or on LinkedIn. On LinkedIn. Also my alias is the algo without this underscores and on Twitter you can just scan this QR code. This will take you to my Twitter page. You can just dm me or tag me for any follow up questions regarding this talk. Hope you enjoyed the session and and do go through the other talks in Con 44 as well. There are many interesting topics where people are talking about chaos engineering and different aspects of it.
...

Dhiraj Kumar Jain

Software Engineer @ AWS

Dhiraj Kumar Jain's LinkedIn account Dhiraj Kumar Jain's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways