Conf42 Site Reliability Engineering 2022 - Online

The Freedom of Kubernetes requires Chaos Engineering to shine in production

Video size:

Abstract

Like any other technology transformation, k8s adoption typically starts with small “pet projects”. One k8s cluster here, another one over there. If you don’t pay attention, you may end up like many organizations these days, something that spreads like wildfire: hundreds or thousands of k8s clusters, owned by different teams, spread across on-premises and in the cloud, some shared, some very isolated.

When we start building application for k8s, we often lose sight of the larger picture on where it would be deployed and more over what the technical constraints of our targeted environment are.

Sometimes, we even think that k8s is that magician that will make all our hardware constraints disappear.

In reality, Kubernetes requires you to define quotas on nodes, namespaces, resource limits on our pods to make sure that your workload will be reliable. In case of heavy pressure, k8s will evict pods to remove pressure on your nodes, but eviction could have a significant impact on your end-users.

How can we proactively test our settings and measure the impact of k8s events to our users? The simple answer to this question is chaos Engineering.

During this presentation we will use real production stories to explain: - The various Kubernetes settings that we could implement to avoid major production outages. - How to Define the Chaos experiments that will help us to validate our settings - The importance of combining Load testing and Chaos engineering - The Observability pillars that we will help us validating our experiments

Summary

  • The freedom of Kubernetes requires chaos engineering to shine in production. If you're connected to this session, there is a big chance that you are interested in Kuber netes and also in chaos engineering. Enjoy yourself, relax and let's spend half an hour together.
  • Henrik rexed is a cloud native advocate at Dynatrace. Has become one of the producers of one of YouTube channel dedicated for performance engineers called Perfytes. Has also started a new fresh YouTube channel called is it observable?
  • To validate that we will use chaos engineering. Last, because we do some testing, we also need to have observability in place. And last, we will briefly explain how we could automate that process.
  • In kubernetes there are two type of nodes. The master node is one of the node that is in the top. The worker node are there to basically host our workload. When we deploy any workload within our cluster, Kubernetes will move our workload in different state.
  • Kubernetes eviction works by quality of services. We need to define requires and limits in our pods. The second recommendation of course is to put resource quotas in our namespaces. This avoids any node pressure events or infrastructure issues.
  • settings requests is very helpful because it helps kubernetes to properly orchestrate your workload. The second concept is limits. How do you express it? Cpu of course in mini cores and memory invites. And if these in the cpu side, basically we are very slow.
  • What is chaos engineering? Chaos engineering is a process to discover vulnerability by injecting failures and errors. Don't do it directly in production. First you do it on a non production environment. Once you're mature enough, you move to closer to a production environments.
  • All right, so what are the hypotheses related to Kubernetes? So my expectation is that kubernetes is good. I expect that my app is stable performance, no impact and no error rates. Then you have the maintenance scenario. These will clearly happen during your production hours or night hours.
  • What are the observerity pillars that we need to collect? Of course in observability there is plenty of pillars. But here in our particular case, because it's pretty much related to infrastructure topic, we will focus on metrics and Kubernetes events.
  • Captain allows you to deploy manage test similar to a CI CD process. It can also manage production use cases like automotions and so on. Captain is in a framework case on cloud events. At the end it provides a score, and at the end I have one SLA based on the score.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to this session. The freedom of Kubernetes requires chaos engineering to shine in production. So obviously if you're connected to this session, there is a big chance that you are interested in Kubernetes and also in chaos engineering. So take a glass of water, a glass of coffee, enjoy yourself, relax and let's spend half an hour together. So before you start actually to the main content, I would like to briefly introduce myself. So, my name is Henrik rexed. I'm a cloud native advocate at Dynatrace. But prior to Dynatrace I've been involved in the performance engineering market more than 15 years. And as a result of that I have become one of the producers of one of YouTube channel dedicated for performance engineers called Perfytes. Check it out if you're looking for content for performance engineers. On the other hand, last year, in July 2021, I started a new fresh YouTube channel called is it observable? It's a dedicated YouTube channel related to observability in general. So if you're looking for content tutorials, content that will explain a given framework or technology, check it out. It will really helpful. And I also looking for feedback. So please connect and send me your feedback. So what are we going to learn if you stay with me on the next 30 minutes? So a couple of things. Because we're going to talk a lot about Kubernetes and the challenges and the problems that we could probably face in production, it makes sense that we do some couple of reminding related to Kubernetes. We will of course present the challenges itself. And then to validate that we will use chaos engineering. So we'll introduce what is chaos engineering and then see what will be the experiments that we will need to design to be able to validate our community settings. Last, because we do some testing, we also need to have observability in place. So we will see what type of metrics and events that we will need to collect to be able to validate our experiments. And last, we will briefly explain how we could automate that process. So the dark side of kubernetes. So Kubernetes is an orchestration framework. Everyone knows it, no surprises. And in kubernetes there are two type of nodes. In fact, Kubernetes relies on nodes and nodes at the end are physical servers and virtual servers or virtual servers. The master node is one of the node that is in the top. In this slide, as you can see, there are various icons. You can see the scheduler, etCD control manager and the API server. So the master node, if you're using on any managed Kubernetes environment provided by any of the iposcaler AVs or Azure or GCP, then you probably don't see that master node. If you do manage fully the cluster, you will have to manage as well the master node. On the bottom you have the workload nodes. The worker node are there to basically host our workload. So when we deploy any workload within our cluster, Kubernetes will basically move our workload in different state. And behind that there is a lot of events, and I need to remind those projects because it's very important to understand the various challenges that we're going to talk to in a few minutes. So when we deploy using Kucatl or anything else or maybe other systems, the first step of deploying our workload will be in pending state. So pending state means kubernetes know it has to deploy a new workload. So he will try to identify a node that is able to host that new workload. So based on resources, based on tension tolerance and various policies, then once he has identified in the right node, then our workload moves to the creating state. Creating state means I know which node going to host my workload. And because our workload relies on containers at that state, Kubernetes will basically interact with our docker registry to pull out the images and validate that there is all the requires that is needed. If we have any volumes, if you have any config map, any secrets, it will check that those exist to be able to deploy it. Then once we have all the requirements out there, then our workload is in a running state. So it doesn't mean that it's officially running, it's just that Kubernetes has started the pod with the various containers in it. And if you want to check if the app is obviously running, you need to check all the readiness probes or health probes that Kubernetes provides. All right, so now you know all the various states of our workload. Once upon a time, Kubernetes killed my workload. All right, so here, let's say we have a cluster, I have two applications, so I've decided to have two pods. Let's pretend that it's not two pods, but two namespaces with a lot of various pods inside of them. And I have more than one node. So here we see only one worker node. And during that time frame one of the app was basically consuming more resources and as a consequence there were almost no resources left for the other workload. And as a consequence to be able to avoid any node pressure events or infrastructure issues, Kubernetes will try to resolve that problem and for that it will start an eviction process. Eviction process means it's going to select one of the existing pod running in that node and it will evict it. Invicted means first I kill the pod from that node and then I reschedule it on an available node that can take that new pod or that new workload. So here in few minutes we were able to resolve our pressure situation and our user were almost not impacted. But it could be worse, it could be very very worse. Imagine that all your nodes are pretty much saturated, or you have designed some tension tolerance policies and there is no nodes for your workload anymore. So you have killed the workload from these previous node and that new workload is not able to be scheduled elsewhere. So basically no app, nothing responding to our user. So here it's pretty critical for us. So we need to figure out how we can avoid that type of critical situation. So how can we do that? Well, for that they are, they say the recommended Kubernetes approach. So first, eviction works by quality of services. So we need to define requires and limits in our pods. So by doing this, Kubernetes will basically determine a specific quality of services related to our request and limits. So if our requests equal our limits, then our workload will be considered to be guaranteed. If the request is under limits then we are burstable and asked. There's nothing defined. It's best effort when the eviction happened, the eviction will happen in that order. So first it will try to delete the best effort, then burstable and ask guaranteed. The second recommendation of course is to put resource quotas in our namespaces. Remember our situation where the app we're eating the resources of the second app. If I want to avoid that, because we usually silo our apps based on namespaces. If I define resource quotas on my namespace, then I'm pretty sure that I won't be able to have that situation because my app will basically have dedicated resources and it won't be able to eat more. So requires and limits is a hot topic in kubernetes. So let's have a look at the request and the limits. What is the value of that and how you can express it. So first request requires is a bit like Tetris games. So kubernetes behaves like a Tetris game. Remember you have shapes going down from the screen we move them and we place them on available spots. And the idea we want to make lines. It's the same thing with Kubernetes. Kubernetes we deploy, we have a new workload. He sees the workload coming in and then based on the size of the workload, so the request, he will try to place it on the right node. So basically request is there to tell to kubernetes. Okay, I need at least 100 megs to run my workload. So he knows basically the shape of your pod. So we will be able to play a tetris if you don't precise any request at the end kubernetes doesn't know the shape of your workload. So he sees a small square, he place it somewhere and then suddenly that square becomes a huge shape. So impossible to play properly the game. We can also imagine that Kubernetes is a. But like a box of chocolate. Remember when you're a kid you receive a box of chocolate. Of course you don't read the manual, I mean who reads the manual today? But we pick the chocolates and these we discover, we eat it and suddenly we discover there's liquors. And remember we were making those faces. I don't hate liquors. Whatever. Maybe today is different but it's the same thing. If you dont precise the request and limits the same thing, kubernetes will take the chocolate, it will imagine that it's a chocolate without liquor. And then he starts to deployed it and he realized there's liquors inside of that. So settings requests is very helpful because it helps kubernetes to properly orchestrate your workload. So how do you express it? Cpu of course in mini cores and memory invites, nothing complicated. If you do the right test we know what will be the minimal resources that we need to run properly. Our applications, of course you can. But very high values because if you don't want to test and you want to basically guess, yes you can do. But keep in mind that those resources will be allocated by kubernetes and they will never be used. So at the end you're not optimizing the usage of your node properly. The second concept is limits. So now we have defined that we need a three room bed apartment. And now we have a contract with kubernetes saying okay so you have the three bedroom apartment, it's fine, but you won't be able to consume more power or more water during some period of the day. So that's basically the limit. So we have a contract with the Kubernetes saying how much resources can I officially utilize in that cluster in maximum what will be tolerated by kubernetes? So for that we can express it for the memory in byte. So this is very useful, very easy, I do a low test, I can see what's the maximum value that I need for memory, so that I can basically guess it. On the cpu side it's more difficult. And this is due to the fact that we have an heritage from the Docker world. So the way docker works to share resources within our host it users CFS. And for this the cpu basically is split it in function timing. So let's say that owned core has periods of work of 100 milliseconds. And we're going to determine the quota that we can use within that period. So we have 100 milliseconds of cpu periods. And if I defined, let's say 20, I will be able to consume only 20 milliseconds. So basically I do some work, I consume 20 milliseconds, then the node or docker will basically pose my work and wait until the next cpu cycle. And then I will be able to resume my work at least for 20 milliseconds and so on and so forth. And that mechanism to be posed during our work means throttle. So cpu throttled. So if I define my value too low, as a consequence I will get a lot of cpu throttling. And in these memory, if my value too low, then kubernetes will kill my workload and send an event called om kill. And if we don't input the value too high at the end, we are not optimizing properly our resource in our cluster. So that's why it's really important to define it properly those limits at least because in the memory we can see that kubernetes can kill our workload. And if these in the cpu side, basically we are throttled. And as a consequence our workload will be very slow to be able to work properly. In fact it's funny because when I started to work with those requested limits, I said let's do a test. So I said let's remove all the limits from our workload, run it and measure on the response times in one end and on the resources. So you can see on the graph on the top, it's a cpu usage. You can see that without I was able to consume almost 650 millicores. And when applied limits, you can see that my resources, I'm consuming way, way less resources. It's normal because I have a limit defined. But now let's have a look at the actual experience that we're giving to our users. In the bottom you can see response times. As you can see in the bottom we have these without we have a response time that is about 100 milliseconds or less. Pretty good. But then when I apply the limits, boom. You can see that we almost have 3.6 seconds of response times, which is very very high. And this is only due to CP throttling. So what do I need to put resource cpu limits in my cluster? Because at the end it seems that it works better without it. Well, because it's a best practice for the industry, at least for the memory side you need to define requires and limits, otherwise you dont utilize properly your resources and your nodes. And for this topic request and limits there is tons of horror stories available over various presentations done in Kubecon. So here you can scan the QR code. There comes to a website listing all those horror stories. So I will definitely recommend to watch it. There is plenty of interesting stories performance issues related to cpu throttling, RBnB Zorando that talks about it, stability issues due to OMKL. Same thing, RBnB Zorando. So check it out. You see that you can learn a lot of things and this topic is very important because at the end we know that we can have a major impact on the stability and the behavior of users. So how do we validate and avoid that type of situation? Well, obviously chaos engineering. So what is chaos engineering? Chaos engineering. If I took the definition for in Wikipedia you say chaos engineering is a process to discover vulnerability by injecting failures and errors. And it even says in production. So first of all, don't do it directly in production. You don't improvise in production. So first you do it on a non production environment. And then once you're mature enough, you move to closer to a production environment. So let's have a look at the various workflow. How do you define those errors and vulnerabilities in our environments? So first, the process is in our three steps. The first step is we need to define hypotheses so we can say, okay, I've designed my app, I know the architecture, I know the system, what could go wrong? What could fail? So basically, for example, I have a connection to database, I may assume that I can basically have network connectivity issues between my system and database. That could be a problem. Then I need to predict how my system react. Either it will basically handle properly because I've designed an awesome architecture, an awesome design of a software. Or I can also predict that, okay, my system is going to fail or have some problems to write to database. So we need to list basically what we expect from that situation. Then we need to define what are the metrics and events, whatever that we need to collect to be able to validate our experiments. Then we define our experiments. Usually an experiment is a workflow of tasks. So first I want to inject latency between my app and my database. Or I want to inject, let's say packet loss or whatever. Basically to simulate network problems I can restart my app. Basically you define the workflow of this given specific situation that you want to test. Then we need to define how to rollback. Why? Because keep in mind that we're going to run that probably in production. So if in case there is a problem, we need to have the process described and automated to be able to come back to a normal state, then we also need to figure out how we're going to collect the various KPI that is required to validate our experiment. And last is basically we can our test. All right, so what are the hypotheses related to Kubernetes? Remember we talked about it, request and limits. So what are the various hypotheses? Well, I got some ideas. So first we have Kubernetes settings. We know that if I change requires and limits, I have an impact probably on my users. So I want to basically validate that those settings are working fine. So my expectation is that kubernetes is good. I have already defined the right request and limits. So I expect that my app is stable performance, no impact and no error rates. Okay, fine, fair enough. Then you have the maintenance scenario. What are we referring to maintenance by the way? Yeah. Remember when you have to upgrade your nodes because nodes at the end are a physical or virtual service like I mentioned. So if you want to upgrade the version of kubernetes because you're upgrading your master nodes, then to do that you will have to node drain. So you remove that node from the actual work, you do your maintenance task and then you reattach it to the cluster back. So this is a maintenance task. These will clearly happen during your production hours or night hours. But as an expectation, I say if I have designed well my cluster, I should have no impact, stable, no user impacted, everything works smoothly. And last, it's a store that we had before eviction. So we're turning into nodes pressure or a situation where we come into these situation where there is an eviction. So my expectation is that because I have defined the right priority on my pods, no downtime, stable and performance, no impact on my users. What are the observerity pillars that we need to collect? Of course in observability there is plenty of pillars. You have logs, you have traces, you have metrics, you have Kubernetes events is also one of the pillar. And then you can add profiling and others. But here in our particular case, because it's pretty much related to infrastructure topic, we will focus on metrics and Kubernetes events. So let's have a look at those metrics and events that make sense for us. So first let's have a look at the metrics. Keep in mind that Kubernetes is like applications are a bit like an onion. There's different layers. So first you have the layer outside which is the user. So the user is interacting with our app. So I will probably collect some metrics about these user. So response times, failure rate, basically user experience. Then I will go to the next layer. Because Kubernetes relies on nodes I need to figure out how my nodes are behaving. So in terms of resources, cpu memory, number of pods running on that specific node, maybe also number of ports, ports available, ip address and so on a pod. So within that node I have pods running in that. So here I will keep track on what I have defined in teams of requires and limits for my cpu and memory, what is the limits and what is the actual usage. So then I can figure but if I'm far from the limits or not so I can optimize that settings. And last in the pod I have some containers. And remember cpu throttling is a docker concept so we need to measure that from the container perspective. So we'll look at the cpu throttling, the memory usage and cpu usage of course on the events side, as we saw at the beginning, there are various states in Kubernetes and in those states there are different types of events that is sent by Kubernetes. So of course the user won't send any major events except maybe tweets or support case. But we are focused mainly on the events coming from the cluster. So the node nodes pressure, that is a sign of a problem on the pods, I could say fade scheduling. That seems that we are not able to place that workload anymore on any of the nodes. That could be a really important sign. Eviction of course om kill unhealthy that type of events. So what are these kind of experiments that we need for Kubernetes so first we're going to test the evictions and the maintenance scenario in the same one. Why? Because I probably have a lot of nodes. So to be able to come to these situation where I'm putting high pressure on my node, I want to remove some nodes. So I'm going to have less nodes than expected because I want to come to the situation where I'm having a node pressure. So our first nodes drain and then I will then simulate cpu stress on these node memory, stress on the nodes and I will generate some load test because I'm not going to run it in production for this type of experiments. I want to run in a non production environment and I need to measure the user, the impact on the user. So I will run a constant cloud, so no spike test, nothing fancy, just the load test will only be there to report how is the actual response time from the user perspective and what is the error rates, what's the stability of the applications on the community settings. And I may not need necessarily some chaos experiments. If I want, I can do it to be able to come to the right situation. But usually just a standard load test is fair enough. You run the test stable load again and you measure the response time, the failure rates, and you compare it between with requires and limits without you measure the cpu throttling and you tweak those settings to at least get the right settings that will provide the highest response times and with the right stability of our apps. So what do we need for this? So first I will need a Kubernetes cluster and there is various tools that going are to be deployed here. You can see that I have two colors here to the nodes I've separated. I want to make sure if I run the experiments on the same cluster as my app, I want to make sure that the experiments are not impacting my tooling. So I need an observability back end solution, prometheus or data trace. I need a chaos engineering tool. So in my case I use litmus and a load testing products. I use k six in my case. So for these I have labeled the nodes to make sure that all the toolings that I need for my experiments are placed on dedicated nodes. And then I have other nodes that are dedicated for my app. And I know that my experience will only impact the nodes for my app and not my testing tools. Then like I said, I need some tools. So I need a chaos engineering product. I use litmus chaos, which is part of CNCF. Really good product by the way. There's a web UI called the Chaos center. So litmus chaos can be installed either in the same cluster as your app or it could be installed on dedicated cluster and it will interact with your cluster where your app is running. For this. In any case, you will need to deploy an agent either on these same cluster as litmus chaos, or if you have a remote cluster, you will have to install the agent that comes with the Chaos exporter. The chaos exporter exposed couple of metrics to our experiments in a Prometheus format and then there is a lot of chaos workflow and so on. I'm not going to go in details on the architecture, but at least keep in mind that those components, the one I'm showing here on the screen, are the main ones. What is great with chaos litmus chaos is that there is a chaos hub. That chaos hub provided 50 plus experiments in kubernetes and all the right experiments for our case, which is node drain, I need node cpu hog, node memory hog, all are there. They are already there in the chaos hub. So it's perfect. So I don't need to reinvent the wheel, I can simply use it. The other advantage of litmus chaos is that it's rely on argo workflow. So I can define a really specific workflow combining pure chaos experiments and lotus with k six in parallel. I'm picking k six. Why? Because k six has an output extensions that I personally like. It's their Prometheus integration. So k six provides results that is sent in the command line, in the std out or in JSoN or other format. But in my case, I want k six to write into Prometheus the statistics. So these I will have all the observatory in Prometheus or dynatrace because nitroce will be able to scrap the data from Prometheus. But at least I need the response times, the requires, the failure rates. And I also want to have also all these data related to the health of the cluster itself. For this I will also have a Prometheus in my cluster. If you install the Prometheus operator, it comes with several components. So of course the Prometheus stack, it's the stack, but also a couple of exporters. These exporters is the component producing metrics. So we'll have the Kubernetes metrics to see how the various objects of kubernetes, the node exporter, anything related to how healthy are my nodes CA advisor to collect cpu throttling metrics on the container level. And then I have an exporter for litmus to be able to collect metrics from these nitrous perspective. And I have an exporter for K six to be able to collect the metrics from k six perspective and those with data trace and our components, we will be able to collect the right metrics and push it back to data trace. So now you know all the toolings and let's have a look at how we could automate this process. To automate this, we can define obviously a pipeline in Jenkins or any CI CD system that we have. So we're going to build, we're going to deploy, we're going to deploy our exporters, we're going to configure data trace or whatever is we can do it because all those tools could be configured with the API or with command lines. And then I'm going to run my test. Fair enough. But then after these test, someone needs to approve, someone needs to look at the results. So it's not automation anymore because actually we have a pose here in this automation. So how can I remove that pose? Well, for this I'm going to use another CNCF project called captain. It's an open source solution provided by Dynatrace. And captain provides several use cases. So first you have the progress delivery. So you can basically give the power to captain to deploy manage test similar to a CI CD process, and also manage production use cases like automotions and so on. Or I don't want to use everything. I just want to rely on my traditional CI CD system. I just want to use captain for coitigate and this is the use case I'm going to use. Or last, I can use pure SRE and production use cases, which is autoremediation. Captain is very easy to configure. It's based on files, on Yaml files. So you have a ship, your file SLo. So we need to define the SLI. Define slos in captain and those will be used, you see, for these Kuwait gate perspective. And then we can connect our tools. Captain is in a framework case on cloud events. So basically it's an event driven framework. So it's very easy. So I can easily connect, disconnect tools. And the beauty is that all the tools I'm going to use is part of captain. So if we look at the pipeline that we had a few minutes ago, so I'm deploy, I run my test, and then just when I run the test, after the test itself, I will basically send to captain, there's an API to say, an event to say, hey captain, I just finished the test. Could you evaluate the environment during that time frame? So I've already expressed a couple of SLI and SLO. So captain will reach out to the SLI and SLO that we've defined for that particular services. And so you say, okay, so pod failure needs to be owned, hundred percent node pressure under 1% and so on. So you define basically what you expect. Once he looks at all the SLI, he will reach out to the observed backend that has the value. So either Prometheus standard race, and then he will look at the values and match it with the objective that we have defined. And then he will basically present the results of each individual SLO that we've defined into a heat map like you can see here. And the great thing is, at the end it provides a score, and at the end I will have one SLA based on the score. So you say my experiments were fine. If we have a score of 90%, for example, and that will be basically in 1 minute, unless my test ends, Captain trigger all that workflow I just described, gets back the scoring, and from the scoring would say, okay, everything is green, or in the other way, everything is red. So let's have a few quick takeaway here. So, a couple of things. So, first, in Kubernetes, define quotas, resource quotas to separate non namespaces. That is a really high recommendation from the community's world. Define qos. So request and limits we talked about during a long time, observability. Of course, we need to collect metrics, logs, traces. We need to understand what's going on our environment. So make sure to have everything sres. So of course, define SLI and SLO. I mean, we want to be smart, we want to automate. So without SLI SLO, it will be very difficult to be efficient. And last, we know that we have problem related to Kubernetes. We could face problems related to Kubernetes. So let's test them utilizing low test and chaos engineering to validate that our cluster is stable, our user are happy, and there is no surprise in production. So before we actually finish this presentation, I will do another quick promo to my YouTube channel. Is it observable? So check out the various episodes there. There is one dedicated to Litmus chaos, chaos engineering, performance testing and so on. So check it out. And yeah, I'm trying to improve the content, so if you can send some feedback, that will be great. So then I can continue production content and help you guys in your project. All right, so thank you for your time and enjoy the conference.
...

Henrik Rexed

Cloud Native Advocate @ Dynatrace

Henrik Rexed's LinkedIn account Henrik Rexed's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways