Conf42 Chaos Engineering 2021 - Online

Increasing Kubernetes Resilience for an SRE

Video size:

Abstract

SREs main task is to keep the operations up and running. An SRE dealing with Kubernetes has many challenges to keep resilience is at the desired level and improving over time. In this talk we will go through techniques to measure and improve resilience of Kubernetes platforms in a Cloud-Native way.

The number of micro services in a Kubernetes environment can grow into hundreds easily. Continuous upgrade of these micro services and the Kubernetes platform itself will need a system to measure resilience of the deployment. SREs need to practice chaos engineering in a cloud native way in such a way that it is easily manageable, reuses as many as chaos experiments and workflows. This talk is intended for those SREs that would like to practice or are already practising chaos engineering in their environments. In this talk, we will introduce chaos hub for SRE and discuss how to construct complex chaos workflows using Litmus and Argo projects. A live demo will take the audience through the construction of a end-to-end chaos workflow involving Kubernetes node failure, a CPU hog, a network slowness in a ecommerce application and how resilience is measured and monitored during this process.

Summary

  • Uma Mukkara is CEO at Chaos native and a maintainer of Litmus Chaos CNCF sandbox project. He has been doing a lot of entrepreneurial stuff in the last decade. Recently we spun off from my data to focus on litmus chaos and cloud native chaos engineering.
  • Cloud native is about achieving resilience for operations using chaos engineering. In effect, cloud native is nothing but microservices being delivered faster. Why is it different, what it is? And how do you do that?
  • Traditional chaos engineering is all about avoiding expensive downtimes. It's supposed to be a standard, and it has been limited to experts and enthusiasts. Large deployments typically follow chaos engineering. There are no certain defined ways of doing chaos engineering till 2019 or 20.
  • There are two reasons why we should look at chaos engineering differently. One is more dynamism the other is the way DevOps has changed with respect to infrastructure provisioning. How do you do this differently in cloud native ecosystem?
  • Litmus is a complete tool set for doing cloud native chaos engineering. It comes with a simple helm chart where you will be installing and doing chaos workflows. We are releasing Litmus 20 very soon with GitHubs and open observability features.
  • litmus chaos workflow consists of multiple chaos experiments. You can arrange them in sequence or in parallel, or a combination of them. These chaos workflows can be run against multiple targets. It's basically a multi cloud chaos engineering ecosystem.
  • Litmus has created Litmus CI library, Chaos library. Can execute experiments on various clouds, such as various different clouds. Experiment management and monitoring everything remains on kubernetes. Many more examples of doing chaos for non Kubernetes resources coming later this year.
  • So other one that I want to touch upon is what about CNCF projects and Litmus? What are the integrations that are available? And we have recently integrated both with these workflow. Here is the quick summary of the Litmus roadmap.
  • The idea of v spinning off from my data is to provide more resources to the success of Litmus project. Acceleration of enterprise adoption of litmus is really by building a stronger community around Litmus. And I'm encouraging you to go ahead and give Litmus a try.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi there. In this session we're going to talk about cloud native chaos engineering and how to do it at scale. I am Uma Mukkara, CEO at Chaos native and I'm also a maintainer of Litmus Chaos CNCF sandbox project. I live in Bangalore with my two boys and my wife. I've been doing a lot of entrepreneurial stuff in the last decade, starting with Cloud Byte, which is a storage startup for service providers. Then I later on started Openebs and myadata, where I also started a project called Litmus for Cloud Native Chaos Engineering, which is also a CNCF project. Recently we spun off from my data to completely focus on litmus chaos and cloud native chaos engineering. I'm happy to be here talking about what is cloud native chaos engineering at this Chaos engineering conference of Conf 42. Let's talk a little bit about reliability. What is reliability? It's about achieving resilience for operations using chaos engineering. That's a regular definition that we've been hearing about reliability. And what about cloud native? Cloud native usually is associated with containerization, where you change or rearchitect your applications for using microservices architecture. And you also apply the advancements of CACD, where you get your code tested pretty well using these advancements. And also you try to make use of all the declarative nature of these services or applications and apply githubs to manage them at scale. So in effect, cloud native is nothing but microservices are being delivered faster. Your applications, which are now multiple microservices sre being delivered faster, that's the net effect of cloud native ecosystem. What about reliability? In cloud native we talked about what is cloud native and what is reliability. How do you apply these together? Well, it's applying chaos engineering to achieve resilience in operations in the cloud native ecosystem, and you still get faster application delivery. So in essence, it's about applying chaos engineering for the resilience of cloud native applications. And that's what we're going to talk about in this session. Why is it different, what it is? And how do you do that? Before we jump in, let's talk a little bit about traditional chaos engineering. Traditional chaos engineering is all about avoiding expensive downtimes. We all know that these downtimes are not easy to deal with. They will usually result in expensive losses. And what you do as part of practising chaos engineering is you don't wait for failure to occur. You keep testing in production, you keep introducing faults in production, and then see if your services can hold up. If not, you tune the system and these you learn, right? So that's the feedback loop that we keep talking about in chaos engineering and the general state of chaos engineering till recently, I would say till 2019 or 20, is we all understood. What is chaos engineering? It's supposed to be a standard, and it has been limited to experts and enthusiasts and people operating. Large deployments typically follow chaos engineering, right? And chaos engineering is generally started after you burn your hands with these downtime as part of the root cause analysis, you will resolve that we need to practice chaos engineering. So let's start doing it. That's how typically it has been done till recently, a year or so. Right? And how they've been doing is we all know that game days SRE, the ways that you keep using it and you integrate into CACD, but not like has a thumb rule. They're being done pretty rarely, not as a standard practice. And it is typically limited so far to sres. Developers limit their testing to CI pipelines, the regular functional testing, or a little bit of negative testing. It's not deep chaos testing, and also the measurement of results is also not standardized or there are no tools exist to do that. And observability is also done through whatever the tools that were available and manual operations, manual scripting, all that stuff. So overall, so far, chaos engineering is well understood why part of it, and many large companies who are operating large data centers and applications have been doing it, and it is still about manually planning it and manually executing it. And there are no certain defined ways of doing chaos engineering where people with common knowledge of operations can go ahead and then practice it. So that's the state of affairs, in my opinion, of chaos engineering till recently. Before we go into cloud native and chaos engineering, let's see the state of affairs of cloud native and chaos engineering itself with respect to crossing the chasm Kubernetes is pretty well adopted now, and it is believed to be in the mainstream market, whereas chaos engineering is believed to be still in its early days in these cloud native ecosystem. And that's probably I'm going to touch upon why is it so important? Right? So why can't you go ahead and then practice chaos engineering the regular way in cloud native ecosystem? Right? So what is chaos engineering done in cloud native environment is called? It's called as cloud native chaos engineering. Why should there be any difference in practicing of chaos engineering in the cloud native ecosystem? I would say primarily there are two reasons why we should look at chaos engineering differently. One is more dynamism the other is the way DevOps has changed with respect to infrastructure provisioning. So let's talk a little bit about dynamism, right? So it started with containerization, where an application is broken into, split into multiple microservices. So instead of dealing with one large application, you have multiple smaller microservices to deal with. And they are actually being tested with pretty nicely done CACD pipelines. And the goal here is to deliver them faster. So there are a lot of advancements in CI and CD. New tools are available and they're being made easier. And the net effect of that is instead of typical releases happening in 90 days or 180 days of large systems, you get these almost every week, right? You have multiple microservices. Something or other is being changed all the time, at least once in a week in very large system. Look at this cloud native ecosystem, the dynamism. There are so many players working together under the leadership of CNCF, and it is working very, very well. There are a lot of coordination is going on. And the net effect of all these is you are getting very dynamic changes all the time into your system. If you sre in cloud native ecosystem, right? And then imagine doing chaos engineering in that system, you also have to be very dynamic in terms of doing chaos engineering. And the other one is DevOps being changed. And it's not about left shift containerization coming faster, it's about infrastructure provisioning. So how infrastructure provisioning has changed, first of all, it is now 100% declarative. Everybody provides an API, a declarative API where developers can go ahead and write the declarative code or syntax to get the infrastructure they want. They're not waiting for anyone to provision the infrastructure for them, they're doing it themselves. Right. And by following the practice of Gitops, which is on these raise, developers are going ahead and getting the infrastructure through APIs that they want, right? So it is happening already. And because the infrastructure is getting provisioned left and right, it also means that the system could be less resilient, because the faults can happen very much in the infrastructure and typically they do happen in the infrastructure. So your applications are coming faster into production in terms of microservices updates. And developers are also provisioning these infrastructure underneath, changing the infrastructure underneath that these applications run upon frequently. So both of these things together will cause a problem for resilience, right? And that's what chaos engineering in cloud native has to be aware of. So in summary, there's more dynamism and there's more infrastructure changes. So you need to be doing cloud native chaos engineering. So how do you do this differently in cloud native ecosystem? So I have come up with certain general principles of practicing chaos engineering. Or how do you do with set of principles? I'm setting up around five of them these. Till recently in my blogs I've written four. But recently I've observed open observability is a big deal. So you have to have a common layer of observability to do chaos engineering in cloud native ecosystem. I'll talk about that. But first of all, it has to be open source. And by having this infrastructure or architecture for chaos engineering in open source, you sre getting a more reliable stack for doing chaos engineering. Kubernetes and the entire CNSafe is an example of that. All these projects are well cooked, well architected, well designed and well reviewed. And chaos experiments, which in my opinion will become a commodity at some point. They have to be community collaborated so that there are no false alarms coming out. Right. And you need not spend a lot of time writing the most common chaos experiments so they can be hosted somewhere and they should be available for most commonly required chaos operations. And chaos itself has to be managed at scale. So that means that the chaos experiments need to be versioned and they need to follow a lifecycle of their own. So you need to have open API and the versioning support for chaos experiments. And how do you do this at scale? The challenge is always about doing anything at scale is a challenge, right? So the same thing applies to chaos engineering in cloud native ecosystems also. So when you're doing it at big scale, you need to automate everything. And the right way to do it is using Gitops. So the entire chaos engineering construction has to be declarative and has to be supported by the well known tools like Argo CD or flux or spinner cut and so forth. And open observability is an important thing. As I mentioned a little while ago, observability is a key thing, especially when you introduce a fault, and you need to be able to go and debug whether a certain change in behavior is because of the fault that I introduced or is it because of something else that happened coincidentally. So these are the principles of cloud native chaos engineering. You need to have an eye on it. You need not follow all of them all the time, but it's good to have all of them being followed, in my opinion. So I want to introduce Litmus project, which is actually built based on these principles. And we've been building it for about three years now. And I'm happy to say that we sre almost there with all of them, and especially the first three of them have been there for more than a year now. And we are releasing Litmus 20 very soon with githubs and open observability features. So basically you got all these cloud native chaos engineering principles well adhered by the Litmus project. Let's look at Litmus a little bit more detail. It's a platform for doing, it's a complete tool set for doing cloud native chaos engineering. And it comes with a simple helm chart where you will be installing and doing chaos workflows. I'll touch upon it, but basically it all starts with a simple Kubernetes application called Litmus. And using helm you can install it. And you already have all these experiments that are needed to do your chaos workflows in a public hub. And you will end up having your own private hub for coordination of the new experiments that you write or tuned ones with your team through a private hub. So once you install litmus through the helm chart, you will have something called litmus portal. It's a centralized place where all your chaos engineering efforts are coordinated and you will be pulling in the experiments or referring to the experiments on public or private hubs. And you can run chaos workflows anywhere, either any Kubernetes cloud or any kubernetes on Prem. And it is not limited only to kubernetes. You can run these chaos experiments from litmus portal, from these kubernetes ecosystem, but the targets can be non kubernetes as well. And then if you're doing it at scale, you better do it through Gitops. So Litmus provides an option whether you want to store all your configuration in the local database or in a git. And then once you store them in git, you can automatically do easier integration with any of these CD tools. So let's go and look at in a little bit more detail. You have central Litmus portal and all you do is you can pick up a predefined chaos workflow, or you can create a chaos workflow and you run it against a target, and the target is where the chaos operator will be spun and the experiments will be run. So you can have multiple targets connected to the same portal. So you don't install litmus again and again. You install Litmus once per your enterprise or a team, and these you're good to go. You have rbacs and everything in Litmus portal. And then once these experiments are run, the metrics are exported back to Prometheus, the observability system and the analytics sre pushed back into the portal. Out of all this, the previous one I talked about, workflow is a key element. It's one of the innovations that happened in the last six months within the litmus team. So I want to talk a little bit about litmus chaos workflow, what it is. So it is basically argo workflow consisting of multiple chaos experiments, and you can arrange them in sequence or in parallel, or a combination of them and they get run. And litmus chaos workflow also consists of consolidated results and status. So that is one workflow. So that is a unit of execution or management for you within Litmus. And using workflow, keeping all the configuration, the complex workflow in a declarative manner, we sre saying chaos engineering and githubs can be put together using this declarative nature. And argo workflow is a pretty stable. In fact, it is in incubation stage within CNCF, a very widely used tool. So I'm pretty sure you will have great experience using argon litmus together. So if you want to go a little bit deep dive, how does the chaos workflow work? So basically your experiments are in Chaos hub. The chaos workflow refers to the experiments on this hub, and you always keep them in a hub, either public hub, you can refer to them and then tune them through Yaml file. Or if you are changing some of these experiments, you can keep these in a private, or if you are creating new one using the sdks, you can keep them in private hub. But ultimately a workflow refers to an experiment within a hub somewhere. So if you want to kick start this workflow and you create the workflow, push it. So somebody recognizes that these is a change, either manually executed or through GitHub. Finally, chaos engine is the one that's responsible for kickstarting. So chaos operator will be watching for the change in chaos engine and then the experiments will be run and exporter litmus chaos exporter will take the results metrics and push it back into Prometheus. The chaos results CRD will be created, and the result of the chaos experiments in this case will be pushed back into these litmus for analysis and debugging or monitoring purposes. So that's how a chaos workflow happened. And then these workflows, many of them are available as a predefined workflows. You just need to configure them, attune them. And these, you are good to go. And these chaos workflows can be run against multiple targets, and it's basically a multi cloud chaos engineering ecosystem here that we are talking about in terms of experiments list. Litmus provides a lot of experiments of all types, a few more or in the works, Ivo chaos and DNS chaos, all that stuff. But you got pretty much everything that you need to start today. And another thing that lateness provides is how to define your hypothesis using probes. You can define the steady state hypothesis in a declarative way using probes, and using probes and annotations, you will be able to mark what was the chaos duration on any regular grafanographs, for example. It also provides a good deal of analytics litmus portal about comparing resilience within the workflows that are run at the same time or at different times, all that stuff. So you have great beginnings of the observability that is built in, rather than depending on some external tools. And chaos interleaving is an important concept. How about CI in pipelines, right? So there is a lot of advancements, lot of interest in CI. So what we have done is we have created Litmus CI library, Chaos library. And you can go ahead and create a chaos stage and use this library through this existing tools. So for example, if you're using GitLab, you can create a remote template for chaos that uses this wrapper of CI library wrapper, and then your remote template is ready. You don't need to really worry about execution of the underlying chaos workflow. It all happens automatically. So, so far we have done for GitLab and GitHub actions and Spinnaker, and most recently for Captain project. So these are the integrations that are available today, and many more may be coming pretty soon or later in this year. Litmus is known for being very strong for doing chaos engineering on kubernetes. But how about non Kubernetes? Does litmus support chaos engineering for non Kubernetes? The answer is yes. The experiment management and monitoring everything remains on kubernetes. And you can still execute experiments on these target, such as various different clouds, or your own vmware on Prem, or OpenStack on Prem, et cetera. So how it works is your experiments runs all the way till the last leg within kubernetes, but the actual chaos will be executed on the target by using the network access control APIs. So you need to write that logic of how do you kill that? And rest of the chaos engineering control plane will stay on kubernetes. And we do have some experiments like EC two terminate or EBS detach, GPD detach, and many more are the coming. So later this year I'm pretty sure we'll have many more such examples of doing chaos for non Kubernetes resources. So that's how it happens. As long as you have an API to reach that resource on the other side, you will be able to kill that resource using that API. So other one that I want to touch upon is what about CNCF projects and Litmus? What are the integrations that are available? And we have recently integrated both with these workflow and we've tested very well with Argo CD and certified it. We have also done close integration with captain by working through their team. A fantastic team, I would say. And of course, we've been the Open EBS team and Openebs is the one that started litmus integration to begin with. A lot of community members use chaos testing with open EBS and we SRE certified for other runtimes like Cryo and container D. That's where so far we got and we have in our short or medium roadmap to do some good integrations with the flux crossplane and ytest the database and for security open policy agent. So these are the projects that we have in mind to do some kind of integration along with usage of these projects. Here is the quick summary of the Litmus roadmap. I think I talked through most of it. You can just take a look at this in this recording. You can pause and take a look at this roadmap in detail if you want. Let's talk what we do at chaos native and more details about what chaos native can do for chaos engineering, both for cloud native and non cloud native environments. So the idea of v spinning off from my data is to provide more resources to the success of Litmus project and to accelerate the adoption of litmus by enterprise users. A lot of users, the enterprise big users have been using Litmus and they've been approaching if the litmus team can support. And of course, we've been using that from being part of my data. But now we felt it's time to go and put more focus, create more resources around Litmus. So that's how the company has been launched a few days ago. And how are we going to do that? Acceleration of enterprise adoption of Litmus is really by building a stronger community around Litmus. So community is very, very important and we want to encourage and demonstrate the open governance nature of litmus. So we will be putting more resources to work with more community members, large companies, in the cloud native ecosystem and work together to build a stronger community and the adoption increases. I just talked about the plans to integrate litmus with other CNCA projects, so all that requires more resources which we are now going to be able to allocate. So apart from enterprise support, we sre also thinking of doing CD tools integrations with many more of them and Katis distribution, testing and making sure that chaos engineering is done very easily in the air gapped environments. Also, and some of the customers are asking for managed services around chaos engineering. And we also have plans to launch chaos as a service at some point of time for making chaos easy for developers. So that's about this session where we talked about cloud native chaos engineering and the need for it. How do you do that? And I'm encouraging you to go ahead and give Litmus a try. And if you have any questions or need support, come back to the Litmus channel on Kubernetes Slack. With that, I hope you sre going to have a great conference on Conf 42 and have a fantastic day or evening. Thank you.
...

Uma Mukkara

CEO @ Chaos Native

Uma Mukkara's LinkedIn account Uma Mukkara's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways