Conf42 Site Reliability Engineering 2020 - Online

- premiere 5PM GMT

Increasing Kubernetes Resilience for an SRE

Video size:


SREs main task is to keep the operations up and running. An SRE dealing with Kubernetes has many challenges to keep resilience is at the desired level and improving over time. In this talk we will go through techniques to measure and improve resilience of Kubernetes platforms in a Cloud-Native way.


  • Uma Mukara, co founder CEO of my data, speaks at SRE 2020. Talks about increasing Kubernetes resilience. What we need right now is more complete solutions around kubernetes.
  • R resilience is system's ability to adopt to a chaos load. Whenever there is a fault happens, it recovers automatically without affecting any service user facing services. 90% of resilience of your application really depends on the components that you are not developing or you don't own. How do you check resilience?
  • litmus Chaos is a complete framework for finding weaknesses in Kubernetes platform. Using this, developers and sres should be able to automate chaos in a cloud native way. At the moment there are about more than 10,000 installations of Litmus.
  • Scale it is being able to run multiple chaos experiments in either a sequential manner or in parallel, or a combination of both. The Argo workflow goes very well with the has workflows. You can scale this to hundreds of experiments in a system where there are hundreds or even thousands of kubernetes nodes.
  • litmus can inject chaos into any application because the service account has the permissions to do that. The idea here is you'll be able to schedule the experiments and thereby achieve higher resilience. Litmus portal that is under development will have more detailed views for chaos itself.
  • Litmus has slack on Slack channel and Kubernetes slack their litmus channel. Have a great time with Litmus has. Do try it out.


This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, this is Uma Mukara, co founder CEO of my data. Today I'm here at the SRE 2020 show organized by Con 42 Dot folks us to talk about increasing Kubernetes resilience. This should be a topic of interest for sres. Whoever is already planning to practice has engineering or already practicing chaos engineering. We're going to talk this topic, especially when it applies to kubernetes. So before we delve deep into this topic, let's look at what we do at my data. At Myadata we sponsor two open source projects, open EBS and Litmus openebs for cloud native data management. Litmus for cloud native chaos Engineering. We also have the commercial SaaS solution for cloud native data management called Kubera. So with Kubera, sres can turn kubernetes into a data plane and get the complete solution around cloud native data management. So in this talk we sre going to talk about the importance of resilience and how do you get resilience on kubernetes and an introduction to litmus chaos, how it works and how automate chaos using litmus chaos and obviously how you end up getting higher resilience in that process. We'll also do a brief demo of Bitmas, how it works when you try to use this automation tools, et cetera. So what is the state of kubernetes today? Kubernetes is starting to have greater and greater adoption. It is believed that in terms of adoption it has crossed the chasm and most of the IT organizations are either planning to adopt or already have adopted. So what we need right now is more complete solutions around kubernetes so that the choice of adopting kubernetes becomes a valuable one when you have the complete solutions around kubernetes. So one of the more important tool is a tool that helps you keep the resilience high. You obviously want to keep your kubernetes clusters and applications on that running all the time and achieve your or meet your slas. So for that you need to have this one tool or an infrastructure set of tools or practices or processes to keep this resilience higher than what for slas demand. So what is resilience? And resilience is system's ability to adopt to a chaos load. Whenever there is a fault happens, it recovers automatically without affecting any service user facing services. So what are some of the examples of this resilience? Or when you don't have a resilience you call have a weakness? Some examples that are common in Kubernetes is pods are evicted for various reasons. If your system is resilient, your implementation is resilient, your pods are automatically rescheduled, your services are not affected. That is a sign of resilience. And on the other hand, if your services are becoming slower than certain threshold, or they're completely down, then that's not healthy. That means there is a weakness that you need to fix. Some other examples are nodes going to not ready state, which is a little more common on large infrastructure setups on cloud service providers. And when nodes go to not ready state depends on the application that you're using. The blast radius can be very high and that's not healthy. So you have to implement your services in such a way that when these nodes go down or they go into not ready state, you have to be able to continue to survive that situation. That's the resilience that you would want. And similarly, as you have hundreds and thousands of containers, it's possible that some of them are not behaving exactly the way you expect them. In some cases, some memory leaks are also some of the common herd examples in large scale deployments. So these are some examples. Of course faults, you always expect them not to happen. But you also know that a fault can happen anytime respective of how much careful you are. So it is inevitable that some fault will happen at some point of time. So you have to stay afloat for that. And that's resilience. And the resilience is more important now in the Kubernetes environment because there is this promise of common API, everyone has adopted it. If you see CNCF landscape then you realize how many all the vendors users across the spectrum SRE adopting kubernetes in some form, delivering solutions or building solutions based on the services available. And what they are trying to do is because of the new solutions available, they themselves are building stronger CI CD to meet this demand of agility, right? So together with so much adoption and stronger CI practices that these users or vendors SRE following, you will see that the applications on Kubernetes are changing pretty fast. It's a good sign you don't need to wait for many months before a fix is arriving. So it is a good news that the system is more dynamic and they're coming multifold times faster than earlier times. For example earlier the databases. Upgrades to the databases may happen once in six months or a year, but now it may be every quarter because they have moved on to microservices model, right? And the resilience also depends on various other infrastructure items or on services in your Kubernetes environment. Let's look at the resilience dependency stack. So at the bottom you have platform service. You can expect that some faults can go in that area. For example, node going into not ready state is a famous example. And Kubernetes services itself can be fragile sometimes if you are not implemented properly. And other cloud native services that surround kubernetes like DNS, Prometheus Envoy and other cloud native databases, they all can get into some kind of unexpected state. And on top of all these things is your app. So if you're looking at slas at your application level, the resilience of your application, it really depends on a lot of other services in your environment. So to summarize, 90% of resilience of your application really depends on the components that you are not developing or you don't own. So this is something very important together with dependency of other application and they sre coming much faster. Something that you really need to do to keep your resilience high is to keep checking how is my system's resilience? You have to constantly validate that fact. Do I have a weakness or my system is resilient enough or not? So how do you check resilience? Well, you has to practice has engineering, that's the topic here, right? So how do you do that? Typically is you have to know what is your steady state, either a service or an application, and then you yourself induce a fault. Don't wait for a fault to happen. You introduce a fault and then verify whether steady state condition has been achieved. We talked about some examples of resilience. The same thing applies here, right? If the system is resilient, you're good, if not, you found a weakness. But that's as simple as that. So in other words, for the two reasons, the dynamism in the services and the dependency of those services for your application, you need to practice chaos engineering in order to achieve the required levels of resilience. That's good. And let's also talk about how do I do that? Chaos engineering. And before we talk about that, it's important to know how the configuration or operations are managed in the cloud native world. Today it is by using Git ops. The git functionality is helpful in managing your configurations as well. The version configuration and that concept has led to using the operations declarative yamls also to automate your config changes. So this is your cloud native way of managing your configuration of your application. So this is being applied across the spectrum now to manage Kubernetes services, various applications on it, various resource management strategies and policy management. Everything is being done through Githubs. That is the new way of doing things for DevOps. And you can also apply the same Gitops to resilience checks. You don't need to start doing something new or new way of doing things or not needed for bringing in your resilience checks. So how would you do that? So in order to get that resilience check into place, you bring in a has operator and then some chaos experiments, and you make changes to the has experiment and somebody picks up that change and the chaos experiments are run or resilience checks are done and then you can observe the results of that. So this is a simple way of doing chaos management or chaos execution or fault induction using Kubernetes style, using Gitops, right, using operators. So let's look at what is, or at least let's summarize what is cloud native chaos engineering is you want to keep your resilience high, you want to practice has engineering and you want to do it the Kubernetes way, cloud native way. So what does it mean to do it in that fashion is you pick up an open source infrastructure component or infrastructure itself, because it is cooked up in open source, there's more community around it. You can depend on the survival of that technology for a long time. Right. That is very important. And the promise of Kubernetes has really realized, because it is neutrally governed, it is open source. Large vendors have come and depended on that purely because it is open source and well governed in an open way. And the next one is you have to have chaos APIs, crds, and that's very important to do it the Kubernetes way. And the third one is bring your own chaos model. You don't want to be tied to a particular way of doing chaos. It can keep improving, you can improve a certain type of experiments, or you already are in the process of developing new experiments and you want to use this framework for those new experiments as well. So you should be able to plug it in into that framework and then finally has itself has to be community oriented, primarily because of the reason that you will not be able to develop all sorts of chaos experiment yourself. You have to depend on the application owners, vendors, practitioners, whenever they learn a new way of introducing a fault, they can upstream that experiment. So that is the fourth principle. So this is generally putting forward what is cloud native has engineering is with that, let me introduce litmus has litmus. And litmus Chaos is a complete framework for finding weaknesses in Kubernetes platform, its implementations and applications running on kubernetes. So using this, developers and sres should be able to automate chaos in a cloud native way, right? So we talked about what is cloud native way. It's all about declarative chaos and being able to automate the entire chaos using Gitops. And Litmus Chaos is in sandbox. And the mission statement, as I just described, all about helping sres and developers bring out weaknesses, keep the resilience high. And Myadata is the prime sponsor of the project. But because of the vendor neutral governance and being in complete open source, there are a lot other companies that have joined as maintainers contributors. Some of them include Intuit, Amazon, Ringcentral, Container Solutions, et cetera. And one of the biggest assets of Litmus Chaos project is the hub itself. And we expect that more and more community members will upload or upstream their chaos experiments onto this hub and the community is really spread out. At the moment there are about more than 10,000 installations of Litmus. It is starting to get wider adoption, and Litmus chaos as we described it, is completely cloud native. And the four principles that we described a little while ago applies to Litmus chaos. Hub is the community way of doing has and there sre some good examples of bringing their own chaos onto litmus infrastructure and scaling it up using this infrastructure. It is cloud native. So to summarize the features, there is chaos API crds. You can manage the entire chaos using declarative manifests, including the Chaos scheduler and the next is hub itself. And we expect you will have more than 50% of your chaos needs automatically available on the hub already. And all you need to do is learn how to use them and then learn how to write your new experiments that are required which are specific to your application. So for that you have chaos SDK available in Go Python and ansible. And using this SDK you'll be able to bring up the required skeleton of a chaos experiment very easily and then put your chaos logic and your experiment ready. And the other feature which is a very important one is has portal. Litmus portal is very important because has engineering does not stop at the introduction of faults. It's also about monitoring and helping sres developers take right actions to fix the weakness. And also you have to be able to do it at scale. Right. We'll talk about chaos workflows in a bit, but portal is about hub is about getting your experiments together in one place chaos portal is use those experiments, manage your chaos workflows, execute them, then monitor them, see what's happening. So it's about managing your has engineering end to end this portal. That's chaos portal. It's under development. Early versions are there if someone wants to take out, it's not released to the community formally. Maybe end of the year is what we are hoping that will happen. Litmus has many experiments. Right now we have about 30 plus experiments expected to grow as the community grows. And you sre some stars here. That is. This really means it about inducing a fault into the infrastructure. Disk or node node, cpu node, memory, all that stuff. The other ones SrE Kubernetes resources itself. As you can see there are network duplication, network loss, Kubelet service skill. These are some of the important experiments. We have heard many stories where everything is fine. Kubernetes itself goes down and blast radius and very, very high. So don't wait for that to happen. Use Kubelet service skill experiment and see it what is your resilience? And that's a good one to have. In fact, this really came from community using litmus and the team is able to put this experiment onto. And those are about generic experiments we call. They're all grouped under kubernetes. There are also application specific experiments which are about inducing a fault at an application level. It could be about causing a database unavailability or about bringing down Kafka broker. It speaks the logic, the chaos logic speaks the language of an application rather than kubernetes. And we believe that it is very important to scale up the deeper faults that you want to induce. This application specific experiments will help. So where all you can use litmus in DevOps. Of course, in CI pipelines you can start with very small chaos experiments is pretty easy. And then chaos cannot be executed to full extent on pipelines because they are short lived. So deeper chaos can be executed on long running test beds, which is where the code gets moved to after the pipeline typically. And then there is staging, which is closer to production. But there are lots of people who are churning the code out there and you can increase the number of tests there and then just closer to production. SRe production can increase the number of tests, but production is where you want to start. Small, but try to cover more scenarios. They can be spread out, but you may want to cover all the scenarios that are possible in terms of failure injection so that you stay resilient to those faults. It all starts small. You need to buy in from the management has engineering. Many people are scared to be precise and sres want to do that. Developers do not want you to do that. It takes time in any organization to roll out chaos engineering in a large scale. So you would want developers themselves. See hey, this is how you can test yourself and get this automation integration teams to use chaos and then sres will be able to convince them easily. So as the time grows, people will find it acceptable to run chaos in production as well, the entire system. So as you increase the number of chaos tests that you can run in production, the overall resilience increases, but it takes a step by step approach typically. And good news is that it is cloud native. That means you can start doing automation of kiosks right into the development lifecycle. So how do you automate it? It's a very important thing. Kubernetes kind of one of the promises. So you would want to generate not scripts but Yaml manifests. That's the fundamental block for automation. You put all your chaos experiment, which is itself is a custom resource YAMl spec, and you attach that to an application where the chaos is going to run. And management of chaos on an application is also another CR, that's another YAML spec, that's called chaos engine Yaml and then attach schedule how often you want to run this. That's also an YAML spec. Everything, all these three, the experiment, the application which is taking this chaos experiment in, and also how often you want to do it. All this logic is put into a manifest and we call that as lithmus Yaml for example. And you want to automate this, this experiment, put it in git. Use the deployment tools, auto deployment tools like flux or Argo CD. And whenever a change to that happens, when a PR about the change is merged, your has starts running automatically and litmus gives many outputs, not just the chaos result, but it is also about giving has metrics. You can upload to your prometheus and then automate some of these alerts and corresponding actions notifications. And once you get the notification you want to go and see what exactly happened. You want to debug, you want to fix it. So you get has events for correlation and taking the right action to fix the weakness. So how do you scale this up? You are able to automate this with this fundamental concept. And how can you automate this automation? Sorry, how can you scale it? Scale it is being able to run multiple chaos experiments in either a sequential manner or in parallel, or a combination of both. Typical example is there are multiple namespaces applications are spread out across these namespaces and you are managing all of them. And end users are really being served by services that are spread across these namespaces, right? So faults can happen anywhere. So you want to simulate a flow chaos workflow where introduce two faults into two different namespaces in parallel, then wait for it, and then do two more faults in parallel and then kind of drain the node or multiple combinations of that, right? So this is one simple chaos workflow. And how do you do that? Is you develop has experiments into different Yaml manifests and you keep them ready. And then you apply a workflow using tools like Argo. The Argo workflow we've been using, it goes very well with the has workflows. So using Argo workflow, you have your experiments ready, you embed them into an workflow CR, and then Argo also has a scheduler. So you can attach that schedule to that and you develop a bigger Yaml manifest that manages these litmus experiments. The same thing will happen. You go and put them into a git and use Argo CD or flux to manage this auto deployment and change management. Then get multiple of them. You can scale this to hundreds of experiments in a system where there are hundreds or even thousands of kubernetes nodes. So you get an infrastructure to automate your has in a very natural way, and you can scale it and you can put it into your existing system of DevOps, which is Gitops. So with that, let's look at a very short demo of litmus. So I have the following setup where I have two nodes on Amazon eks cluster and I have deployed the microservices demo application. You all may be aware of this, sockshop and install litmus. A couple of experiments are being run as a workflow onto this, and litmus is set up in admin mode. Admin mode is where Litmus can go and inject chaos into any application because the service account has the permissions to do that. And I also have set up a monitoring infrastructure to receive the chaos metrics and we can see through Grafana what's happening. So with that, let's actually see how. So I have two nodes. There's only one that's running on this cluster. That is the talk shop, the what is. So they've been running for quite some time, about four or five days. You might see that some of them went down recently. That's because we've been continuously introducing chaos. That's a sign of things are being meddled with. Somebody is watching the resilience of it continuously. So let me show, so we put litmus in admin mode. That means everything runs within litmus namespace. The other mode is you can create service accounts in such a way that your has actually runs within the namespace of your application. So here you have a chaos operator. You also have a litmus portal running the early version of it. You also have a chaos monitor which is exporting the metrics into a Prometheus server. Then there is chaos events being exported through an event router. And as you can see that there are some tests being run and they are being run in it must namespace. There is a cpu memory hog experiments. So let me also show monitoring this, we have Prometheus and Grafana running. So let's look at actually how this is your application that is running workflows. So we have two chaos workflows configured within argo workflow that are running every fifth minute and every 10th minute. So every five minutes you have cpu chaos pod. Cpu has pod memory chaos running on two different applications or containers. So let's look at one. So here you have argo workflow of chaos. We named it and it embeds the Chaos experiment. This is what we talked about. You have your entire litmus chaos experiment. You embed that within a workflow. And the chaos engine calls the has experiments. For example, this is a chaos experiment CR on the system, and you can tune through githubs the behavior of this chaos experiments. For example, you want to increase the experiment chaos for a longer duration, shorter operations, and increase the number of cores, so everything is possible through Gitops. And let's look at the other workflow. Similarly, memory, so it's being run on a different pod called orders pod, and the previous one was on catalog pod. And you're running on the same namespace. So you're running it in the application is a sock shop, and you're running memory chaos on orders application. And this is what it is. You can specify it declaratively like that. And it is inside chaos engine Cr. And the experiment is memory hog. And you can tune the behavior of your has, and it's scheduled every fifth minute. So as you can see, the workflow itself was executed within 5 seconds. And you can go and sre the results of that. This is argo way of seeing what is going to happen. The litmus portal that is under development will have more detailed views for chaos itself. It uses argo workflow concepts. And let's look at what's happening. This is a Grafana dashboard that we put together for sock shop where this red lines indicate the has induction. And as you can see, that every fifth minute, the screen and yellow lines green is catalog CPU Hog. This one is about memory hog, and it's being induced into different containers. So whenever a cpu hog is induced into catalog, the performance is going down. You can see that orders are also going down. Similarly, whenever there is memory Aug is introduced into this one, your catalog also is the performance of catalog pod is going down. So doing it again and again, of course, you're not going to do, in reality every five minutes, but let's say every day some fault is happening. A larger fault can happen randomly in a week. So all that possible combinations can be implemented in an automated way. So that's how your litmus chaos can help in automating has engineering, thereby helping you to achieve higher resilience, right? So a bit of small introduction to Litmus portal. It's very much under development. The idea here is you'll be able to schedule the workflows itself and you got your own hub. So it's all about bringing experiments from your hub and adding more experiments, right? So you have a public has hub, but you want to share that with your team and the new experiments. So Litmus portal will help you bring in your more team members together and develop the new experiments, create more has workflows, monitor them. So it provides an end to end infrastructure or a tool set to practice chaos engineering and resilience engineering. That's about the demo. And in summary, do practice chaos engineering and do it in cloud native way. And litmus chaos can help you do that. And overall has. The time goes by. As you increase more and more chaos tests in your production, you will end up having higher and higher resilience. This is definitely a preferred way to increase your resilience on Kubernetes. So with that, thank you very much. Do try out litmus. You have very easy use get started guides available. There is a Litmus demo application workshop. Whatever the demo that I showed, you can set it up within a few minutes with that. Thank you very much, folks. Have a great time with Litmus has. And do try it out. Join Litmus has slack on Slack channel and Kubernetes slack their litmus channel. So with that, thank you very much. Thanks for your audience.

Uma Mukkara

CEO @ Chaos Native

Uma Mukkara's LinkedIn account Uma Mukkara's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways