Conf42 Chaos Engineering 2024 - Online

Chaos Validation Made Easy: Plug & Play with Resilience Probes

Video size:

Abstract

Observability is crucial in Chaos Engineering. Beyond injecting faults, measuring impact, and ensuring SLA compliance are vital. Resilience Probes in Harness Chaos Engineering streamline chaos impact measurement and integrate with various data sources and observability platforms for SLA validation.

Summary

  • This talk is on chaos validation made easy plug and play with resilience probes. Silanjan is a software engineer working at harness and Shyan is a litmus chaos maintainer. Thanks to Confortu for having us and really looking forward to you guys enjoying the talk.
  • A downtime has many adverse effects for an organization. Chaos engineering can help in uncovering the weaknesses in a system. It is the way to go for all those enterprises who want to prioritize resiliency and reduce downtimes.
  • A resilience probe is nothing but reusable pluggable checks that could be used in your experiment. There are multiple ways you can configure and use a probe in your specific application. Different modes to how you might want to execute these probes.
  • App harness IO allows you to test your chaos infrastructure. Features are also available in the open source litmus, so it does not matter, it's a platform agnostic. Of course you have the free trial, so definitely go ahead and check it out.
  • You can upload your own AML, you can create your template or you can just start from a blank canvas. And this is the section where you actually add the resilience probes. Now let's try and run an experiment and see the observation, see how it's going.
  • For the monitoring I actually have set up the Grafana and Prometheus integration. Once we run it, we are checking if certain rules are met or not. Then we are actually installing the chaos faults and then we'll actually do the pod delete. The aim is to determine the resilience percentage of an application.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone, and welcome to this talk on chaos validation made easy plug and play with resilience probes. My name is Silanjan, and I'm a software engineer working at harness, and I'm also a litmus chaos maintainer. Joining me today as my co speaker is Shyan. Hey guys, what's up? My name is Cheyenne. I'm a senior software engineer at Harness, and I've also been maintaining litmus chaos, the open source project. Thanks to Confortu for having us and really looking forward to you guys enjoying the talk. All right, with that out of the way, let's get going with chaos. Let's start the talk with a very important question. What causes downtime? Sure, we all have been there and experienced it, perhaps multiple times. However, it never gets easy. A downtime has many adverse effects for an organization. Take instance of Slack, whose SLA violations led to an 8 million payout and gravelly impacted the company's revenue. Wells Fargo, the financial giant, suffered a power shutdown in a data center due to smoke, which caused loss of transactions, and some direct deposit checks were not reflected in its accounts. In this instance, a single hour of downtime costed them over $100,000. Lastly, British Airways had to cancel 400 flights, which led to 75,000 passengers getting stranded, and costed them over 100 million in losses. In this case, it was a debugging issue in one server that cascaded to other servers, impacting the billing systems. Therefore, a downtime is often a result of a combination of issues in a system. With the ever increasing complexity of cloud native microservice applications, the question remains that how can we ensure that our distributed systems always withstand the adverse and potentially unforeseen situations? So why are we not better prepared at managing downtimes? First of all, microservices are prone to downtimes. While one can prepare for the apparent causes that need attention, no one can fully anticipate an overwhelming downtime before it takes place, as there are plethora of ways in which things can go wrong. And that's where chaos engineering can help in uncovering the weaknesses in a system and becoming better prepared at managing the various downtime scenarios. Our chaos failure scenarios can be difficult to run while ensuring the safety of the target resources, and often there isn't a good culture around it which makes it difficult to implement and scale. Lastly, as more volume of code gets pushed over the time in any organization, it becomes difficult to assess the system against the weaknesses at scale due to the lack of a good CI CD pipeline of chaos integration in the development stage and also failing to effectively measure the impact of the faults automatically at scale, it becomes difficult to assess the resilience of any application. To better understand it, consider this your applications, now being cloud native, stand atop a plethora of other services that determine its functioning and resiliency. You have your application dependencies, then the other cloud native services provisioning the underlying infrastructure, the Kubernetes services itself, and lastly the platform on which your application is deployed. Failure in any one of these services can cause your entire application to not be able to cope up. The problem is only accentuated as more amount of code is now shipped more frequently at a weekly or even shorter cadence which is expected to run in a multiple different environments. This unpredictability of the application behavior is the prime cause for service outages, since there is no reliable way to know that how our application will behave when subjugated to an unanticipated situation. Therefore, chaos engineering is the way to go for all those enterprises who want to prioritize resiliency and reduce downtimes for their customers. The key to successfully practicing chaos engineering is to understand the complexity involved in your system through realistic experiments and hypothesis conditions, and then slowly scaling them up so that all parts of your application can be assessed. However, what does a good chaos engineering practice look like and how can you implement one? As far as the general best practices are, Chaos engineering is a culture oriented approach which finds its place as part of the DevOps practices and hence developers and SRE should work together for the best efforts. While developers should run chaos experiments from an early stage in the development and slowly scale up their test to cover all the different kind of chaos scenarios. SRE should focus on how to make chaos engineering practices scalable enough to run into their CI CD pipelines, as well as execute the set tests within the staging and eventually in the production environment. Also, it is paramount to have a robust set of chaos experiments that can cover all the different types of failures that might potentially affect the application. Lastly, you need a good observability to assess the impact of the chaos throughout the system, and hence your chaos engineering tool should provide with enough insights that can help you understand if the application is deviating from its steady state in an unanticipated manner. So how do you implement a great chaos engineering practice within your organization? Well, harness Chaos engineering can help you get there. Harness Chaos Engineering tackles the problem of providing a streamlined platform and provides powerful features which helps you get started with harness Chaos engineering. It provides simplified experiment creation, so instead of writing complex scripts. Honest Chaos engineering offers a declarative approach, allowing you to define experiments in code version, control them, and easily integrate them into your workflow. It probes you an extensive fault library. So whether you are targeting kubernetes, AWS, Azure, GCP, VMware, Linux, or even your custom services, harness Chaos engineering provides with a rich library of prebuilt faults, on top of which you can create your own chaos experiments. You can select, customize and compliance these faults to create realistic scenarios that can stress your system in various ways. Also, you can leverage the chaos hubs to store and provide access to these faults and experiments throughout your organization. It also provides you with real time monitoring and metrics, wherein harness chaos engineering leverages Prometheus to provide real time insights into your system's behavior. During the experiment runs. You can visualize the metrics, correlate the impacts, and gain deeper understanding of your chaos effectiveness. Automate your experimentation with harness chaos engineering so once you have defined your experiments, it's time to automate them. You can schedule regular runs, integrate them with your CI CD pipelines, and continuously assess your system's resilience. The proactive approach helps you identify the weaknesses before they can cost downtimes. Lastly, harness Chaos Engineering also probes you with advanced features that provide you with resilience. Scoring private Chaos Hub security chaos faults, tailor your chaos experiments and gain deeper understanding within your system's vulnerability. In short, harness chaos engineering provides you a plethora of benefits, including reduced downtimes. That is, by proactively identifying the weaknesses, you can fix them before they can cause outages, which can lead to improved uptime and user experience. It also helps you with faster recovery in the sense that harness chaos engineering helps you build systems that can automatically recover from failures, and therefore it can help you minimize the downtime and impact on your business. It can also aid you with the validation and optimization of your disaster recovery setup. Lastly, it helps you reduce the cost by avoiding unplanned downtimes, which helps you in cost savings as any of your resources aren't waste on the recovery efforts. So it's as simple as getting started with harness chaos engineering as choosing your platform of choice. That is the SaaS or on premise. The SaaS platform is deployed on cloud, while on premise platform can be deployed into your environment. Once you have selected your environment, your platform of your choice, you can pick an experiment and depending on that experiment, be it a Kubernetes, AWS, GCP, or any other type of chaos engineering fault, you can select the blast radius to which you want to affect your target application and your target environment, and then choose to execute the chaos experiments. Upon executing any chaos experiments, you'll be able to see and observe the chaos impact, as well as measure the critical metrics which give you an insight into what is happening throughout your system when an chaos experiment is run. Finally, when you have found enough confidence with your chaos experiment runs, you can automate with the CI CD tooling of your choice. Harness Chaos engineering experiments integrate out of the box with Harness CI and CD. However, you can also leverage the APIs provided by harness chaos Engineering to integrate it with any CI CD tooling of your choice. So observing impact of the chaos at scale can be difficult, especially if you are performing chaos experiments in CI CD pipelines. To overcome. This harness provides resilience probes. Let's hear from Cheyenne how they work. Thanks Lanjan, for talking about chaos engineering, its practices, and how slas are important in this practice. As mentioned, I'm Cheyenne. I'll be talking about resilience probe and giving you guys a hands on demo as well on how you can practically approach probes and how to use them in your regular day to day applications. So before jumping right into the actual hands on approach, let's first understand what probes are. What is a resilience probe? What is this term that we are coining? So, resilience probes are nothing but reusable pluggable checks that could be used in your experiment. So let's say you have an application where you want to do some kind of a query or some monitoring parameters that you want to check or assert based on certain criteria, so you can put in a probe in that specific fault. So what that will do is go and query or go and check certain aspect that you have configured the probes on and then return some values based on which you can do your chaos validation. So to understand this in further depth, we'll of course take a deeper look into it. But yeah, that's the general gist of it. So it's basically a write once, use anywhere kind of a paradigm, which means you just create the probe once and then you are free to use it in as many chaos experiments or faults you want to attach it to. That's in brief, what is a resilience probe? Now, how to use this probe? So you basically have to configure a resilience probe globally. For example, let's configure a health check probe which checks the health of my application, of my pod, of my container or anything. And once I do kind of generalize and create this health check probe, I have to add the necessary probes to my specific faults and then observe the impact that it is causing to my specific experiment. So what are the different types of probes we have? Right now we have two different infrastructure based probes. So for Kubernetes infrastructure we have HTTP CMD kubernetes from ACS, Datadoc, Dynatrace and SLO as of today. And for the Linux one we have support for HTTP CMD datadog and dynatrace. So what are the typical use cases that you would normally see for probes? And this is not definitely an exhaustive list, but yeah, this is just something we came up with. So some of the use cases would be to query health or downstream uris to execute user defined health check functions or user defined any functions for that matter. You can perform crud operations in your custom Kubernetes sources definitions. You can execute promql queries using Prometheus probes, or you can validate your error budget using the SLO probes. You can also do exit and entry criteria check with dynamics probes. So there are multiple ways you can configure and use a probe in your specific application. Yeah, this is just as I mentioned, not an exhaustive list. So there are different modes to how you might want to execute these probes. And this is dependent on what behavior you are trying to achieve. So for example, SOT is a start of test, EOT is the end of test. So if you want to execute your probes just when the chaos execution hasn't started or is about to start, you want to execute before that. So you can use the SOT mode for EOT after your chaos finishes it will basically do the assertion on chaos is when the chaos is happening, and continuous is throughout the entire chaos execution flow and edge is actually before and after your chaos is about to happen. So before chaos happens it runs the assertion, and after chaos finishes it runs another assertion. So yeah, these are different modes that are available for probes as of today. Now let's jump right into the hands on them. So now I'm in the harness platform. As you can see in the URL, it's actually app harness IO. So what it would look like normally is something like this. So you might have to sign in, or if you're new you can click on sign up and you can create an account. You can use social sign in as well. Depends on your choice. And once you are logged in you would definitely get a free trial as well as some free applications, free to use modules which you can give a try. And of course you have the free trial, so definitely go ahead and check it out. So once you're inside, you would see all these different modules. You can quickly navigate to the chaos module and then you can create a project up just for testing or just to explore. So I already have a project in here selected, and in this I've gone to the resilience probe tabs. This is where I can see all my different resilience probes. Currently I've filtered it via the conf 42 tag. That's why you're only seeing the four probes that I've pre created already 2 hours one 2 hours ago. So these are how the probes would look like and irrespective of which platform you're in. So let's say you're not trying this on harness. These features, functionalities are also available in the open source litmus, so it does not matter, it's a platform agnostic. You can also take this, you'll also get the same level of features in the open source version as well. So moving forward. So these are some of the probes I've pre configured. So there's a Prometheus probe, there's an HTTP probe, a CMD probe, and one kts probe that I've also configured. But yeah, for the demo we'll be mostly using the three probes, and we'll be trying to assert certain criteria and validate our microservice application, which is called bootycap. And we'll be doing some probe validations on top of that application. So just to give you a brief setup tour of what I have, I have a GKE cluster running in which I have monitoring setup with Prometheus and grafana. I have my boutique application set up. This is the microservice demo application that I'm going to use and do chaos on. And this is the infrastructure setup that I have for harness. So harness requires you to have an environment in where you can deploy your chaos infrastructure. So this is that infrastructure that I've connected, which is nothing but the GKE cluster. Cool. All right, now let's move on and actually see the application. So this is the online booty application. As you can see, there are multiple items which I can select. I can add things to the cart. Let's say I want to add some sunglasses, I can add two of them to the cart. Once I do that, this is my cart, so I can go to my cart and I can see that the cart is functional. So if you go to the microservice list, actually you would see that there's a service called cart service. This is what's responsible for handling all the cartilage activity. So what we'll do is we'll actually try to break this service. We'll do a simple pod delete, but on this specific service and we'll turn it down, we'll kind of disrupt this service. But this is just a very simple application. But what we want to do is we want to do all these different kind of validations on top of that disruption. For example, let's take the HTTP probes for now. So if we go over to the HTTP probe and see the probe configuration, we can see that we have certain set of timeouts, attempt interval and initial delay. So we want it to be start after a certain delay, you want it to have certain interval if it fails, like how many times you want it to attempt again and again if the first one doesn't succeed. So these kind of things and what we are doing in the probe is actually the probe details where you'll get all the information, which is basically we are trying to check or connect to this specific URI, UrI which is the FQDM link for the specific front end booty cap. And we are checking if this is accessible. So if it's actually returning the response code of 200 or not, we are checking if it's actually live, if this FQDN link is actually visible and we can navigate to that specific port or not. Now coming back to the probe screen again. Now a lot of the other probes would also be listed because we just got rid of the filter, but yeah, so if we check the CMD probe, what it's actually doing is if we go to the configuration, we'll see that it's trying to do a kubectl get pod in the boutique namespace. So if you see over here, this is in the boutique namespace, it's trying to check if in the boutique namespace we want to grab the card service, which is the microservice we want to target and drop if it's actually in the running state, and if so, what's the count of it. So we want at least one card service to always be present, that is in the running state, in the healthy state. So we are kind of asserting in the comparator as the integer criteria should be greater than zero. So there should always be at least one card service. So that is what this specific probe is doing. And the third is the Prometheus probe which is asserting, yeah, it is asserting on the specific Prometheus endpoint. It's checking the average over time. It's checking the prom query, the promql query actually of the probe success percent. And it's also checking that according to our evaluation criteria, this should be greater than equal to 90. So we are saying that if it's the probe success percent is greater than equal to 90 only, then consider this as a resilient fault. So that's our assertion, that's our hypothesis of what we want to do to configure any new resilience probe. You can go over to the plus new probe and you can choose between which infra type you want, Kubernetes or Linux. So for Kubernetes you can go ahead and select any of the probes types. Let's say HTTP, you can give it a name. So this is the unique name. So once you assign it you can't really get rid of it. So be mindful about that. You can force delete it of course. But yeah, so let me just do HTTP probe one, one one or something you can configure, you can do next, you can set up the timeout for this one, for example, something like this. And if I go next, this is where you can give your probe details so similar things. So you can choose the get or post method. If I do post, you can choose the HTTP criteria if you want to compare the response code or response body. So yeah, this is just an example of how you can go ahead and configure whatever probes you want. And once you do that they'll be shown up like this. Now let's come to the scheduling part and actually let's try and run an experiment and see the observation, see how it's going. So let's create a new experiment. I'll call it boutique app conf 42 and I'll select in Kubernetes infrastructure type. So in here I would select the conf 42 intra that I created and I'll just apply. So you have few options. You can upload your own AML, you can create your template or you can just start from a blank canvas. If I start from a blank canvas I would just filter from the chaos hub what I want to do. So in this case I want to do a pod delete. If I do select pod delete, I would be given certain choices of where you want to do the pod delete. So in this case I want to select the app namespace, which is the boutique namespace where my app is currently present, I'll select the kind. The kind is nothing but a deployment and it is nothing but the card service. This is the label, these are all the labels that are present in my specific boutique namespace, but I just want to target the card namespace, the card service. So now that I've selected this, I can go ahead and tune the faults if I want to. I don't really have a use case for now, so I'll just leave it, let it be. And this is the section where you actually add the resilience probes. So in the probe section, currently I don't have any probes added to my specific fault, but they are configured in the resilience probe section. So what I'll do is select and all these different probes will which are eligible to be added for your specific experiment. For your specific faults I would just select the HTTP one add this to fault. So for this one I would kind of want sot check in the sense like whenever the chaos starts. Before that I want to do an assertion and check if this is actually live or not. So I want to apply that. Secondly, I want to add one more probe, which is the CMD probe. So the CMD probe is doing nothing but checking if the card service is in running state. So we want to kind of assert this before the start of the chaos and after the start of the chaos. So what it means is before start it was already running and then chaos happened. So it might have gone down, but after the chaos finished, whether it came up back again or not. So that kind of a check I can do with the CMD one. So I'll just run it in the edge mode, I'll apply that and next I'll do the Prometheus one. So for the Prometheus we are again checking for the probes success percentage. So for this I would want it to go in a continuous mode. So keep checking forever, like within the certain interval, polling interval that you specify. But yeah, just keep doing that so that I get a constant verification. So now I apply the changes. Now if you look at the yaml, this might scare you, but it's a big yaml. Now where are the probes? So for the resilience probes we kind of add it here in the annotation. Now since we have certain probes configured in the hub, so things like the health check might pop up, which is another probe right here. But this is not considered a resilience probes. This is something we do for backward compatibility. But yeah, you can also go ahead and remove it, it should not affect your application or your fault. But yeah, so these are the three probes we have added and if you want a little more information on where you can add probes. So in the documentation for developer harness IO, you can go to any of the probes, let's say CMD probe, you can see that this is the exact place where you have to define your probes. This is for the old legacy way. So if you want more information that is. But yeah, this is not something you're doing by hand. This is already pre created if you're using the UI using Chaos studio, so you don't have to worry too much about it, I think so, yeah. Cool. So that's that. Now let's save, give it a minute. Yeah, and let's just run it. So once we run it, we are checking if certain rules are met or not, then we are actually installing the chaos faults and then we'll actually do the pod delete. So if I go back to my application, to my mic service, you can see that this guy card service age is 70 minutes. It was running for 70 minutes as of now. Now when the chaos happens this will actually terminate. So the age will be much, much in seconds, I think. So we'll see that as well. But yeah, as of now you can see that some things would pop up here, like this guy running for 2 seconds, the boutique app just started and if I go back to my application currently everything works well and good, but once the chaos actually happens and I click on the card, things should start breaking. So for the monitoring I actually have set up the Grafana and Prometheus integration. So you can see that. So you can see that the chaos injecting is actually starting based on the annotations, as you can see in the bottom and this is the card Qps that's going to be affected because the card service is the one we are targeting. So you can see the QPS go down, which is card service is actually starting to get affected. And if you come over and check the logs actually of this, you would see the probe logs as well. So if I go down you can see the health check probe has been passed. This is the default legacy one. You can see the Conf 42 HTTP probe has also been passed. Maybe I can zoom in. Yeah, the Conf 42 HTTP probe has also been passed, which is just doing the assertion. So this is an sot thing. So before the chaos started, it did this assertion now the chaos is going on, and you can see certain things like the CMD, which would be before and after the chaos pop up. So in here you would see the CMD thing. So the confidant to CMD probe has been passed because its expected value. Where is it? Yeah, so the expected value is zero, which is greater than zero, but its actual value is one. So that means it did receive something. It was in the running state. So the count is one. And now you can see the prompt probe, it's actually failing because prom probe's actual value is 88.33%, whereas it should be equal to, greater than or equal to 90. So that is the one thing which is in this case, our specific application is not really our specific fault. That application is not really that resilient, but according to our criteria should have been greater than equal to 90. So if your application is resilient, you should have to configure your application in such a way that it's actually greater than or equal to 90, so that you can term it as resilient. Now if we go back to the boutique and click on the cart, I think it's actually restarted. Yeah, as you can see. So the cart service is restarted and it's 89 seconds. So that's why you did not see the chaos in the UI, because it just restarted that quickly. But yeah, so you can see that this guy did terminate, and because it terminated and came back up, you can see the age difference. But yeah, if I go back to my booty gap now, you can see the fault injection. You can see that the chaos injection is finished and the annotation has stopped going further. So yeah, that is just a brief assertion of what I wanted to show you. And if I come back to the probes section, you can see all the different probes mentioned here as well. So the HTTP probes passed because its expected code was 200 and we received 200. The CMD one passed because we had a value greater than equal to zero, greater than zero, and its actual value was one. But unfortunately the prom one failed because it received 88, but we wanted greater than 90. So yeah, that's how we can coin like determine the resilience percentage of our application and come over, come back to the experiment. I think this one also finished, so I think we will get the resilience score for this. Yeah, so it's 75. So you can still make it better, but it's actually okay. But you should definitely look into what's wrong with your application and change it. So yeah, that's all about us from me and Ilanjan on resilience probe and chaos engineering. So if you have any questions, you can use the social handles to chat with us. Yeah, I hope you guys enjoyed. Thanks for watching.
...

Neelanjan Manna

Software Engineer @ Harness

Neelanjan Manna's LinkedIn account Neelanjan Manna's twitter account

Sayan Mondal

Senior Software Engineer @ Harness

Sayan Mondal's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways