Conf42 Chaos Engineering 2020 - Online

- premiere 5PM GMT

Cloud Native Chaos Engineering

Video size:


The cloud native approach has taken the DevOps world by a pleasant surprise by the welcome adoption of Kubernetes across all categories - from Developers to SREs to VP of digital transformation. As the huge mass of legacy applications move Cloud-Native platforms, an important problem arises. How do SREs make sure the systems do not have weaknesses and have the required level of resilience? A well thought out chaos engineering methodology is the right answer. And for a large number of fast-changing applications and infrastructure, finding the right set of chaos experiments and identifying if the impact of chaos has resulted in showing up a weakness in the system is almost an impossible task.

In Cloud Native Chaos engineering, the developers develop chaos tests as an extension of the development process. These tests are developed using standard Kubernetes Custom Resources or CRs so that they are easier to manipulate according to the environment. These chaos experiments are groomed in CI pipelines and finally published in the Chaos Hub so that they are available to SREs using the Cloud-Native applications in production. SREs use such chaos experiments of various microservices to schedule chaos in a random fashion to find weaknesses in their deployments, which leads to increased reliability.


  • Uma Mukara: Most of these day, we have been talking about chaos engineering. This talk is a little bit more about how to practice it, rather than preaching why is it needed and what it is.
  • Cloud native chaos engineering is about practicing chaos engineering for specific environments, which are called cloud native environments. It's all about how Kubernetes is changing our lives, has developers and sres. It requires a continuous verification.
  • The principles of cloud native chaos engineering are the following. Whatever you try to do, it has to be open source. It should be usable and then it should be community driven. Litmus project is a manifestation of our effort to practice chaos engineering with kubernetes.
  • You can orchestrate all your experiments using litmus. The way they are written, you just need to convert them into a Docker image. And then these community is developing more tools to observe chaos. You will slowly see that the entire Kubernetes chaos engineering can be practiced just by installinglitmus.
  • So that's summary of how a cloud native chaos engineering framework can work. It's on the Kubernetes slack itself. And then hopefully more contributions will come from the community.


This transcript was autogenerated. To make changes, submit a PR.
We have a really interesting topic here. Most of these day, we have been talking about chaos engineering in general. This talk is a little bit more about how to practice it, rather than preaching why is it needed and what it is. Before we start, about me, I go by Uma Mukara. You can call me Uma. I'm co founder and CEO of a company called Mayadata. We do cloud native data management primarily for Kubernetes. So you can see me talking a lot about Kubernetes today. And I'm a co creator of the following open source projects. The genesis of this entire cloud native chaos engineering is really about trying to do chaos engineering for an open resources project called Open EBS. And we ended up creating a cloud native chaos engineering infrastructure called litmus. While trying to do chaos engineering as a practice for open EBS. I'll talk a little bit more about that. So we've been talking chaos engineering. It's a need everybody knows, right? So Russ Miles started with why chaos engineering is a need. And we've been also hearing it's a culture you need to practice. It's not a thing. Right, but I got all that. How do I actually get started? Right? This is the question that we got into. I'm in this Kubernetes world, and I'm developing an application, or I'm operating a big infrastructure or a deployments. I need chaos engineering. How do I do? Right. As part of the operations work I do, I got into a situation where we need to practice chaos engineering for a very different reason. We were building a data management project called Open EBS, where the community has to use this software for relying on these data. Right. Storage is a very difficult subject. We all know that. So you don't want to lose your data at any point of time. So how do you convince your users that you have done enough testing or it has enough resilience? So the best way is for them to find out that, okay, I can actually break something and still see that it is good. Right? So we started working on that and then realized, okay, Kubernetes itself is new. Chaos engineering is as an awareness is growing, but we need to start creating some infrastructure for our own use. And then slowly we realized this can be useful for community as a whole. And then we created litmus. Right? So today I just want to touch upon the following topics. We all know what is chaos engineering, but what is cloud native chaos engineering, right? And then if we understand why cloud native chaos engineering is a different subject or a little bit of a deviation from what we've been talking, what are those principles? Like? We commonly observed some principles that we can say that here are the cloud native chaos engineering principles. And then we created a project called litmus, and then we can talk about is it really following all these principles? And then how many of you know operator hub in Kubernetes so very similar? Or helm charts, right. Helm charts is a popular concept because you have a chart to pull up and taken bring an application into life. So same concept applies to a chaos experiment as well. Right. Then we'll go to some examples and then we can talk about what are the next steps and then so forth. Right? So what is cloud native chaos engineering? Are there any cloud native developers here? Yeah, what is the cloud native developer? If you can Docker pull a developer, then you're a cloud native developer, right? So it's really about practicing chaos engineering for specific environments, which are called cloud native environments. We'll just see what these environments are. And you can also say that chaos engineering done in a cloud native way is called cloud native chaos engineering. Even further, to simplify chaos engineering done in a Kubernetes native way is called cloud native chaos engineering. So it's all about this general concept of how Kubernetes is changing our lives, has developers and sres, and you might have observed this many, many times that there is a particular way that the DevOps have know the way we practice DevOps have changed. Adrian talked about infrastructure as a code that has taken as a primary component of Kubernetes deployments. Right. So I'll just touch upon what a cloud native environment looks like. I've taken this concept or picture from one of the GitLab commit conferences where Dan Cohn from CNCF was presenting about why a CI pipeline is useful in cloud native environment. I just will repurpose that for a different need here. Right, so you are writing a lot of code if you're a cloud native developer, and then after you develop, you run it through CI pipelines and then you put it onto production, or like you release it and somebody's going to pull that and then use it, right? So where are they going to deploy it? So they're going to. Let's assume that it's a Linux environment, it's a lot of code and you will have kubernetes running on it, right? That's much more code than Linux itself, right? 35 million of source locs compared to the Linux. And then you will have a lot of microservices infrastructure applications that will help you run your code and then there are some lot of small libraries or databases that run on top. And finally your code, right? So that's how small your code is, it's about 1%. Right? And the common factor is you need to think how often your Linux upgrades happen, right? Maybe 18 months, twelve months. And can you guess how often Kubernetes upgrades happen before you know, you understood one version of kubernetes, another version comes in, right? So it's very dynamic and that's the purpose of actually adopting kubernetes because it runs anywhere and then everybody runs one thing inside a container in one thing only, right? So kubernetes upgrades happen very frequently. And your apps, not your apps, but the application or microservices surrounding your application, these also need to be updated very frequently and then you don't control all of it, right? And you are always thinking about my code. How do I make sure that the code that I bring in deployments is working fine, right? So the summary is that your environment, the cloud native environment, is very dynamic and then it requires a continuous verification, right? I mean it's the same concept of don't assume but verify. Right? But what do you verify? You verify your 50,000 lines of code or remaining 99%. How do you verify? So you cannot go and say that everything was working fine. I verified everything if I'm an SRE. But it's a bug in the Kubernetes service that is provided by the cloud provider, right. You cannot assume that cloud provider has really hardened up the Kubernetes stack or any other microservices. So it's really up to the SRE or running whoever is running the ops to verify that things are going to be fine. Right? So that's one thing. So how do you do this? Obviously it's chaos engineering. Chaos engineering, not just for your code, but for the entire stack, right? Because your application depends on the entire pyramid. So that's one concept. The other big difference is, okay, I need to do chaos engineering, but how, what's this cloud native environment? The big difference is everything is yaml. You need to practice this yaml engineering, right? You cannot write. I mean, I think Andrew has really demonstrated very well today about how to do chaos engineering for MySQl server on kubernetes, how we did like we really killed some pods or cube invaders. That's good for demos. But how do you really practice, if you are necessary in production, right? You have certain principles. If I am putting something into an infrastructure, it should be in a yaml, I need to put it into git. And then even if I'm doing some chaos experiment, it should be in a git because I am making an infrastructure change, I'm killing a pod. Who gave you permission? Right? So it has to be recorded and that's Gitops. Right? So that's what is the cloud native environment dictates? If you are a cloud native engineer or a developer, you have to be following this infrastructure principles. So what you need is not just chaos engineering, but it is cloud native chaos engineering. That's the concept I want to drive here in a simple way. This is my definition. If you try to put chaos engineering as an intent into a manifest AMl manifest, then you can say that you are practicing cloud native chaos engineering. So that's how we started, right? So we are going to give open EBS as an application to your users and they're going to take it, run it. How do they verify? However they've been verifying an application is same as deploying an application. How do you deploy an application? You pull up some resources into a ML file and then you do a Kubectl apply, your application comes. How do you kill it? It has to be the same way, right? So that's the primary difference that we found. And then we started building up this infrastructure pieces to do the same thing, right? So let's look at Kubernetes Resources that we have for developers. We have a lot of resources provided by Kubernetes as an infrastructure component, right? Pod deployment, pvcs services, stateful set. And you can write a lot of crds, right? So you use all those things and then define an application, you manage an application and then this concept of crds have brought in these operators. You can write your own operators to manage the lifecycle of the CRS itself, right. That's good for development. If I'm practicing chaos engineering or chaos testing for kubernetes, I need to have a similar resources. So I need to be able to have some chaos resources. So what are those chaos resources that I would need? So you would need some chaos CRS in general, right? So just like an application is defined by a pod and a service and some other things. And how do I define chaos engineering? You need to be able to visualize your entire practice through some CRS. And obviously you need a chaos operator to manage the CRS and to put things into perspective, upgrade each test, right? And it's all about observability. If you don't have observability, you can introduce chaos, and then you're not able to interpret what's going on. And then it's all about history has. Well, right. So metrics are very keen in chaos engineering. So these are the chaos resources that we figured out at a general. These three resources are needed. Right? So then looking at all those things in general, we can summarize. The principles of cloud native chaos engineering are the following, right? So first of all, whatever you try to do, it has to be open source, because now we are trying to generalize. Right. I can just write some closed source code and say that these. It is. You're not using to accept it. Right. Kubernetes was adopted and then the entire new stuff is becoming adoptable so fast because it's open source. So that's just a summary of it. And then you need to have acceptable APIs or CRS, just like pod service for chaos engineering. And then what's the actual logic? How do you kill something, right? So that, again, you should not determine. It should be pluggable. Right. Maybe my project does it in a particular way, but somebody else also has a particular way of killing. For example, here today, we talked about Pumbaa, right? How do you introduce network delays? It does in certain way. And then you should be able to use Pumbaa and plug it into this cloud native chaos engineering. And it should be usable and then it should be community driven. Right? So when will you do this chaos engineering as a practice? It's not just preaching about chaos engineering, that it is a culture. It's a good thing to do. What are the best practices? We all need to be able to build these chaos or experiments together, right? And that's when you call it as principles. Right? So I've written a blog about whatever I just said and then published it on CNCF, same concepts. And then it's available on CNCF itself. Now what I want to do is with these principles, this is how actually we grew, started practicing chaos engineering, and then we named it as litmus. And Litmus project is exactly a manifestation of our effort to practice chaos engineering with kubernetes. And we turned it into a project that it's not just useful for, only for our project, but anybody who is practicing or developing on kubernetes want to practice chaos engineering, they can use litmus. Right? So this is just a brief introduction. It's totally apache two license. Then there are some good contributions coming in. It recently went GA 10 a couple of weeks ago. It really means that it's got all the tools or infrastructure pieces that is needed for you to start taking a look at it. And then it's open source, obviously, it's got some Aps or CRS. I will explain that in a bit. And then whatever the chaos logic that you're using, you can just wrap it up into a docker container and put it into litmus, right. You don't need to change anything. And then it is obviously community driven. I'll explain that in a bit. So let's see, what are the CRS or CRDs that litmus has? Right? So these first thing is chaos experiment. You want to do something. Take the minimal thing that you want to introduce as a chaos and define that as a chaos experiment, right? You can do a killing of a particular application that might include three, four different chaos experiments. But a chaos experiment is something that is like minimalistic killing of something, right? And then you need to be able to drive this chaos experiment. For an application, we call that as chaos engine. This is where you tell. Here are the chaos experiment. This belongs to this application. Here is your service account. And then who can kill, et cetera. So you define that. So to do this, and then after you do, you need to be able to put your results into another Cr called chaos resulted, so that Prometheus or some other metrics can come and pull it up, and then somebody else is going to make sense or some tool is going to make sense of what exactly has happened, right? And then there will be multiple chaos experiments that you can keep adding into a chaos engine, right? So that's the CR that you have. And then for pluggable chaos, for example, it already has Pumbaa and powerful seal as built in libraries. You can actually pull in your own library. So how do you pull your library into this infrastructure? For example, here I'm explaining powerful seal. How we did it, the community did it, is all you need to do is whatever the killing that you do, you put that into a docker image, right? So if you just do a docker run of that image, and if it goes and kills something or introduces chaos, then it can be used with this infrastructure, right? So once you have the Docker image, you just create a new CR, new experiment. A new experiment really points to a new custom resources. And then inside that custom resource definition, you just say that here is your chaos library, right? So it's as simple as that, right? So the reason why I'm just trying to emphasize this is think litmus as a way to use your chaos experiments with a more acceptable Gitops way you can just orchestrate all your experiments using litmus. The way they are written, you just need to convert them into a Docker image, right? So litmus chaos render automatically picks up this docker image and then runs, puts it into a chaos result cr and then you're good to go. And then these community is developing more tools to observe chaos. All that will work very naturally. Right? So it's community driven. What it really means that we have something called operator hub, a chaos hub. It's got a lot of experiments already and taken you as a developer when you create a chaos experiment. If you want this chaos experiment to be used in production or in pre production by your users, you're going to push it into your chaos hub. And your sres, whoever is practicing your chaos engineering, are going to pull this experiment and then use it, right? So imagine if Andrew had published that experiment. You just need to create a yaml file and then put in some key value pairs and then it runs. So this is how the cloud native orchestrator architecture looks like. So you will have a lot of crs defined for your users, chaos users. And you have a lot of experiments that are being used by this CRS, but you can develop more, right? So has more and more people develop this chaos for various applications. You will slowly see that the entire Kubernetes chaos engineering can be practiced just by installing litmus, right? So how do you get started? Just imagine that you have on the hub a lot of chaos experiment for various different applications and you are running your app in a container and then all you need to do is pull litmus helm shot or just install it. It runs in a container. The moment you install you will get chaos libraries and operator onto your Kubernetes cluster. And then you need to pull. The assumption here is that you have so many charts and you may not need all of them. So you pull whatever you need onto your Kubernetes cluster and then you just inject chaos by creating a cr, right, chaos engine which points to various different experiments and then it opens up, the operator goes and runs chaos against a given application and it creates a chaos result, right? It's as simple has that. So let's see an example of how this changes a cloud native developer everyday's life, right? How does a developer create a resource? Right? You create a pod and then you create more resources for an application, for example a pv or a service, et cetera. And that's usually that's where it ends. You want to test something, right? So how do you do it, you inject chaos by simply creating one more crs just like you've been using kubernetes for, right? So you can actually create a chaos engine and tell what needs to be killed where and then it's all done, you get your results. So it's extending the concept of your experience with kubernetes to do the Chaos engineering also. So that's cloud native chaos engineering. Just to summarize. Right, so on the Chaos hub we generally have two types of experiments. One is generic, that is like generic chaos experiments for Kubernetes resources in general. And then there are application specific. So this is where it gets interesting. So you can, like we have seen again, sorry to take these same example again and again Andrew showed I do container kill of MySQl server pod. Right. So then you have to go and verify what exactly happened, right. We verified that the pods have come back. That's all we saw. Right. In Kubernetes invader. Hey, more pods are coming up. But how do we verify whether application that the MySQL server is really working well or not automatically. So that's these, you can write more logic into your application specific chaos experiment and then use them in production. So some of the experiments that are available are already available. You can just do a pod delete, container kill. I mean like a pod can have multiple containers, you can just kill only one of that. And then you can do a cpu hog into a pod. Network latency, network loss, network packet corruption just introduce some corruption into your packets that are going into your pod and see what happens. Right? So that could be the granularity level that you want. Then there are some infrastructure chaos. Of course these are specific to the cloud providers. For example, how do you take down a node on AWS is different than how do you take down a node on Google. Right? So there are disk losses, which is a very common thing, right? So I suddenly lose a disk, what happens? And then disk fill is one more common thing. So these are the initial set of things that are already there, you can start practicing and application specific things that are available here are an application really constitute a pod or multiple pods and then it's got some service and then there's got some data. Right? So what do you mean attacking an application? Right? So you need to define the logic of what is it that you're going to kill. Am I going to kill MySQl server or am I going to kill a part of it, et cetera, et cetera. That's definition. And then you verify before killing everything is good or not, right? So that's the hypothesis. And then you use the generic experiments to actually do the chaos and then verify your post checks, right? So all this can be put in into an experiment, and every time, you don't need to run it again and again. So just put that into Yaml file and your application chaos happens, right? So the result Cr gets done. So for example, I'll quickly take an example of open EBS, how it's done. Open EBS has got multiple components. Now I'm verifying open EBS as an application. Works well when I kill something, right? I cannot go and say, okay, open EBS is a cloud native app. That means it's a microservices and it's got multiple pods. I can kill a container that belongs to openabs and see what happens. That's one way of saying. The other way of saying is I can kill a controller target of openabs and see what happens, right? So you will end up having multiple different chaos experiments that are specific to that application, and then you can go start using them. For example, I can kill an SQC target of open EBS and then see what happens. You don't need to really know what should happen. It's all defined by the open EBS developer. As an open EBS user, you will be able to say, okay, my open EBS is functioning properly because I just killed a target and then it is behaving as expected. Or you can kill a replica and then see, so you don't need to learn the nittygritties or complexities of the application, how the application should behave when something happens in production. So all that is coded up by your developers and pushed up onto chaos hub, and then you can just use it. Right? So that's summary of how a cloud native chaos engineering framework can work, and litmus is just one of that, then you can contribute. It's on the Kubernetes slack itself. And then if you find some issues, you do that. But primarily take a look at, if you are practicing chaos engineering, take a look at chaos hub and then there are some things that are already there and it's just these beginning we opened up. And then hopefully more contributions will come from the community, from the CNCF itself.

Uma Mukkara

CEO @ Chaos Native

Uma Mukkara's LinkedIn account Uma Mukkara's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways