Conf42 Cloud Native 2021 - Online

Chaos Engineering for Cloud Native

Video size:


Reliability is key to adoption and scale of cloud native systems. Chaos Engineering is a way of measuring how reliable the services are. There are a host of ways to use Chaos Engineering in DevOps.

This session describes the importance of Chaos Engineering and best practices around them specifically to Cloud Native environments.


  • Uma Mukara: What is reliability and what it means in cloud native environments. How can cloud native developers and sres can take control of reliability? What are the good practices for both sres and developers? And I will also touch upon Litmus chaos, which is the Chaos engineering project for achieving reliability in cloudnative.
  • reliability is very important because there are many applications that you're dealing with in cloud native. How do you achieve such a reliability for your cloud native? One answer is to implement chaos engineering from beginning and do it at scale.
  • Chaos engineering can accelerate the journey to containerization. You can benchmark and measure and scale your service. There are many sectors where chaos engineering is proven helpful. As an SRE, you have to start believing in chaos engineering.
  • Litmus Chaos is a project that we started a few years ago with the core goal of this Chaos engineering principles. The latest release of Litmus supports Gitops and open observability as well. Litmus is well adopted, stable, but also ready for enterprise adoption.


This transcript was autogenerated. To make changes, submit a PR.
Folks, welcome to Con 42 Cloud Native. I'm Uma Mukara, CEO, founder and CEO of Chaos Native, and I'm also the co creator and maintainer on Litmus Chaos, which is a CNCF project for Chaos Engineering. Today in this session I'm going to talk about what is chaos engineering means in cloud native environments. How can cloud native developers and sres can take control of reliability? So in this session we're going to touch upon what is reliability and what it means to achieve reliability in cloud native environments. And we will see chaos engineering as a way to achieve reliability in cloud native environments. What are the good practices for both sres and developers? And I will also touch upon Litmus chaos, which is the Chaos engineering project for achieving reliability in cloud native what is reliability and what it means in cloud native environments? Generally, reliability means you run your services without any outage. Then you are called as very reliable, but it does not end with that. It also means that you need to have some slos or business slas met even though you are running without an outage, for example, latency for service performance under scale, et cetera. So you sometimes need to measure reliability when you are asked to ramp up your services. There are certain days in which you are going to scale your services to a high degree and your slos need to be met on such days. So that's also a measure of reliability. So there is also the upgrade scenario. You will end up upgrading the services in production and they need to be continuously adhering to your slos. So you put all these things together. You call that as a measure of reliability? If you're satisfying all this criteria, your services are set to be reliable. Why are they important in cloud native? Primarily in cloud native, your application is now split into multiple microservices. That means you have more services to manage or more applications to manage in your larger service. And those applications are changing very fast in your environment, primarily because of the advances in your CACD pipelines and the CACD pipelines of the applications that you're using as just a service. It could be some other cloud native service, for example. Those changes are coming fast into your environments earlier. It's very common to see big changes happen in production environments every quarter or two, but in cloud native they are happening almost every week. Because you've got so many changes, you don't want to schedule them for all a fixed state, and you want to automate them in such a way that the upgrades do happen as soon as possible. And then you have a system to make sure that irrespective of these upgrades, your systems are reliable. So that's your target. And that's what brings us the topic of how to achieve that reliability in cloud native. In summary, reliability is very important because there are many applications that you're dealing with in cloud native and the changes to them are coming too fast. How do you achieve such a reliability? Or how do you plan to strategize in achieving such a reliability for your cloud native? One answer is you implement the practice of chaos engineering from beginning and you do it at scale. Then you have at least a good proven way of achieving reliability. Let's look at what is chaos engineering? What is chaos engineering? Why chaos engineering and how do you practice chaos engineering in the what part? It's about breaking things on purpose. Or you can also say don't wait for the failure to happen, you inject. You can also say it's practices resilience engineering and it's about being better prepared for disasters. When some launch failure or an outage happens, how are you really prepared to bring your services back online? Right. If you have chaos engineering, we would have dealt with such an option and you are now better prepared. So why chaos engineering? Because big outages are expensive, sometimes smaller. Outage also can be expensive depending on your slas to your end users. And you cannot really prevent outages. No matter how well you are prepared and tested, outages will happen. So you better be prepared for that. And there are too many unknown, too many changes happening. We just discussed why is reliability important in cloud native? So chaos engineering is needed because you don't know everything about your entire knowns and unknowns, right? So chaos engineering, why you should do because there are tools in place now. There is so much knowledge available in chaos engineering space, especially in cloud native, that you can actually easily do it and you can avert bigger financial losses and be in control of your reliability. And that's one reason why you need to do chaos engineering. How do you do it? Primarily it is a culture. Still, many people are grasping the need for chaos engineering and starting from developers, sres and all the way to the management who are responsible for operational reliability. So you start with advocating or learning chaos engineering. That's really how you start with and you create a strategy and choose a platform that suits your needs and really build a service within your environment rather than just a set of chaos engineering experiments or chaos experiments. So you need to look at it at bigger picture. Chaos engineering has to have goals of increasing reliability over a period of time. And one way you can start or keep repeating is doing game days. These are proven very helpful to build a culture as well as to build a practice around your chaos engineering is working well or not, and it's always difficult to go and break things in production. So you start small and you keep fixing things and then you slowly move on to pre production in production. I'll talk about it later in this session. So what are the business benefits? Right? So you cannot avoid outages. So you shorten your outages. That means you are better prepared and you can really prevent large revenue losses by finding them early in pre production and fixing them. And your overall customer satisfaction will either go up or your customers are happy. You'll be able to retain them because you already fixed something before they actually cause losses to your customers. This is like on the business side, but your end users can also move to this new architecture, or you move to your bigger and better new architecture fast because now you have a way of finding how resilient your systems are. That's definitely a good benefit. And scaling your services, you implement your larger service for the optimal size and you scale up as you need it. And you can test that with chaos engineering. And you know that you're going to scale well when the need arises. So you don't need to really run them at scale bigger than what's needed, right? So that's definitely a benefit. The other benefit is how well your team is prepared for a given fault, right? So you don't need to guess. You knew that your team can respond well because you just experienced it by injecting a similar. So what are the business use cases where chaos engineering can be considered? Now we know the benefits. What are the business use case? As the digital transformation is happening, we are all moving to microservices. So you would want to see what is the reliability today and what is the benchmark that you need after you move to this microservices architecture. Chaos engineering has to be put in place or can be put in place to benchmark that. And you can also accelerate because now you have confident way of measuring the reliability. So you can accelerate the journey to containerization. You can benchmark and measure and scale your service. There are many sectors where chaos engineering is proven helpful, especially when they are a large scale and they are critical. For example, banking sector, retail sector, ecommerce sectors, these are all already in production. They are very critical as far as the user experience is concerned. Any losses or outages will cause bigger financial losses. So chaos engineering will really help in these sectors and also the edge computing, you are moving there very fast and there are many of such services that are in place. So you want to automate your failure testing in edge computing. So that's another area where you can find chaos engineering as very helpful. Generally where do you do chaos engineering is in many places, so you could find them in game days. You can find them the developers using in CA pipelines or SRS using as a way to trigger your continuous deployments, or after you did continuous deployment, how do you measure things are okay or not? And there are various temps where your failure testing in your pipelines or staged environments are not good enough. You want to automate more corner scenarios for failure. And there is another advanced use case where an application has been upgraded in your production and you want to trigger some failure testing in a random way onto that. So that's triggering chaos on the trigger of application change. So these are various ways, various reasons, various use cases in which chaos engineering can be very helpful. Let's look at what is cloud native chaos engineering? We talked about why chaos engineering is important in cloud native. When you are doing chaos engineering in cloud native, you can generally consider certain principles and cloud native is a reality right now where kubernetes has crossed the chaos. And whereas chaos engineering is in the early days of implementation or being considered as a must for reliability, there are a lot of options available today to do chaos engineering in cloud native at scale. And generally you can follow these principles while choosing the implementation of chaos engineering. It's always good to go with a technology that is open source proven and that's community collaborated. So the chaos experiments that are developed through community collaborations will have less chance of false positives or false negatives because they're well tested. You are in control of what exactly is the fault that is getting in. And these chaos experiments or chaos workflows or chaos scenarios, whatever you call them, they also go through changes, they need to be maintained. So it's better to have good API or operators to do the lifecycle management of such chaos experiments. And scaling is very important. When you scale your services. Chaos engineering has to scale as well. Think of killing container where there are thousands of containers and then you want to bring off of them down for whatever reason, right? So you need to scale well. Your infrastructure should scale well to induce chaos and observability should be an open one. It's very important to be able to observe what exactly is happening. When chaos was introduced. You are most likely following observability platforms that are based on Prometheus. So your chaos interleaving also should be open in nature. You should be having clear idea on when was chaos injected and what was that chaos and how it was injected. So consider all these principles while choosing your platform for chaos engineering. In cloud native. Let's generally talk about what it means for sres and developers later on. So for sres there are many ways to start, and primarily it starts in staging, right? And then you move on to pre production and then you move on to production. As an SRE, you have to start believing in chaos engineering is a helper tool and there is going to be a lot of business benefits, operational benefits that we talked about earlier in this session. And you need to be able to convince this benefit to your teammates, to your management. And how you do that is by doing some simple chaos experiments in staging, and also try to inject values and see whether your auto scale works or not on kubernetes, et cetera. And you also generally do a simple game day as a way to express confidence in culture implementation of chaos engineering. In summary, you start in staging or as a trigger to your CD with some simple experiment, and that can go on for a quarter and you can increase the complexity of such experiments slowly. You need to gain confidence as well as your team's confidence, and you do that and then slowly move on to the other areas. After a quarter move to pre production. And generally it takes more than a couple of quarters to do any real failure testing in production because you should really convince yourself and your fellow team members that your infrastructure of chaos engineering is stable. You are not doing any false positives or negatives, and you have seen some good benefits of injecting faults and you are able to respond to such small outages or big outages and you plan and then move on. So that's more into production. That's really about being better prepared. Do you really need chaos engineering? For developers in cloud native environment, we've been seeing a lot of positive response from developer community to chaos engineering, and it is not really tied to whether your sres are practicing chaos engineering or not. It's really about an extension to your existing CA pipelines. So why do you need chaos engineering? In your CA pipelines? Primarily the changes are happening fast. You are supposed to be developing and shipping your services fast. At the same time, in your CI pipelines there are a lot of other microservices which are not developed by you. You depend on them, and there are many of such microservices which are making your pipelines more dynamic and more complex and bigger, right? You need to have a defined strategy not only to test your code, but also to test the other microservices and other platform changes inside your pipelines. So typically this is your regular pipeline, you're trying to take care of your code. And in addition you need to consider continuous verification of the underlying platform. It may be a good idea to run your pipelines on multiple platforms, right? Different cloud platforms or on Prem. Because it is a microservice, you don't know where it all is going to run. And it is better to inject failures in the pipeline in such platforms and see whether your code is behaving well. And similarly other microservices, they can fail and then you better test them right inside the pipeline how your code responds to such a failure. So this is really about this continuous verification of either the platform failures or the services. Your dependent microservices failures is really called as chaos engineering for developers, primarily in cloud native. So with that introduction to why chaos engineering and why chaos engineering for cloud native, let me introduce Litmus Chaos, which is a project that we started a few years ago with the core goal of this Chaos engineering principles. And Litmus supports all these principles very well. The latest release of Litmus supports Gitops and open observability as well. So it is a CNCF project which is currently at Sandbox state and we are hoping that we'll be moving to incubation very soon. And it has got a great adoption in terms of more than 50,000 installations or operators running. And we built a good community around the usage of litmus and primarily around Kubernetes chaos engineering. So at the outset, litmus is really a simple helm shot, either for one developer or one SRE, or for the entire team or an enterprise. Litmus is a Kubernetes application that scales very well. And all the experiments or chaos workflows are published in Chaos Hub, public Chaos Hub, and you can pull them into your private environments and set them in a completely air capped environment. So when you install Litmus, you get a centralized chaos control plane called Litmus portal and you can start either running a predefined chaos workflows or you can construct chaos workloads very seamlessly and you target them against any other Kubernetes resource. Or you can also target them towards non Kubernetes resources such as vms, bare metals and also other cloud platforms. And all of this you can do it with integrated Gitops such as Argo CD or plug CD. When integrated, this chaos can be triggered as a way as a change happens to your application. So in a nutshell, litmus has a control plane, chaos control plane, actual chaos plane. Target your chaos from the centralized portal or through GitHub's controlled infrastructure. This chaos workflows can be directed towards any Kubernetes resource or any non Kubernetes resource as well. Highly declarative and API scalable API is there and it is obviously open source and you are in control of your chaos. There are a lot of good examples of how you can use Litmus in CI pipelines can use them. There are known examples of GitLab GitHub actions, spinnaker or captain using Litmus to introduce a chaos stage. And at the outset all this chaos logic is bundled into a library and with simple API calling of that library you'll be able to inject chaos and then get a result of that chaos experiment. And litmus is not just only for Kubernetes, it is a Kubernetes application, but it can inject failures into non Kubernetes platforms and it can scale very easily and to a large scale as well. We do have certain examples of how you can inject failures into this cloud platform such as AWS, GCP or Azure. And we also have some experiments, initial experiments of how you can inject chaos into VMware platform. These are expected to grow very heavily, months or quarters to come. Litmus is well adopted, stable, but also ready for enterprise adoption. I am part of chaos native team and we provide enterprise support for enterprises who are deploying litmus in production environments or non production environments. So with that, I would like to thank you for watching this session and you can reach me on Twitter or on kubernetes. Slack thank you very much. Our channel.

Uma Mukkara

CEO @ Chaos Native

Uma Mukkara's LinkedIn account Uma Mukkara's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways