Conf42 Chaos Engineering 2022 - Online

Chaos Engineering alongside Litmus and Jenkins

Video size:

Abstract

Today, Chaos Engineering is becoming more and more prevalent, aiming for stronger resilience in information systems. The questions about its implementation, integration and automation are numerous and arouse the interest of all! In this conference, i am going to show you how to integrate chaos engineering using Litmus 2 within jenkins pipeline and test process in order to promote a resilient built application image to production and get notified by its chaos results via slack.

Summary

  • Jamaica make up real time feedback into the behavior of your distributed systems. Errors in real time allows you to not only experiment with conference, but respond instantly to get things working again. I'm so delighted to be part of the Conf 42 ks engineering 2022 as speaker.
  • Akram Riahi: Litmos is an SRE and the Chaos engineering at scale. This event aims at Cloud native community in France. We will see how we use case engineering easier with litmus for developers and SRE to improve their resilience.
  • So to do that we are going to present the infrastructure which is based on AWS eks. It has been cooked via terraform for the sake of demo as KS engineering requirements. It will interface slack with the Qatas and the chaos, such as the chaos report. If it fail, it means that a chaos experiment fail.
  • Netmas chaos workflow is a very simple code. It can be easily integrated to GitHub. The demo shows how it works. We can run experiments to test different workflows. And we can scale it.
  • Starting chaos injection is a must in order to improve our app resilience. And also we have to keep enhancing one of the most requirements of the chaos engineering, which is the alerting systems. All of that will reveal a lot of failures and it will expose many things.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with conference, but respond instantly to get things working again. Close SAS Fox I'm so delighted to be part of the Conf 42 ks engineering 2022 as speaker with such great speakers. Today I will be presenting to you as you all chaos, chaos chaos engineering alongside Litmus and Jenkins. It's like Rex what I have done with my clients talent to improve the resilience of its app before promoting to production and making it easier to developers and SRE to execute chaos. I'll be explaining it in a deep dive in a bit. I'm sure that also that you have already seen many great talks about today about the why and when to do chaos engineering, but I will be focusing on the how today so it will be also shown by a demo. Before that, I would like to present myself. I'm Akram Riahi. I'm an SRE and the Chaos engineering at scale. I'm also an author of several blog posts posts related to the Chaos engineering and litmos and also I'm an organizer of the Chaos Week which is like a week long chaos engineering fest with great speakers such as Wambukara from Litmus, Yuraninu, Jay Z, Ni, also for Gremlin. There are other many chaos engineering folks also who has been included in the Chaos week. This event aims at Cloud native community in France and I'm also part of the Paris Chaos engineering meetup. I've also participated in the Chaos Carnival recently as a speaker who we are so we are with scale that has been created in 2015, it's already more than 50 experts whose mission is to help you become cloud native. We help you also think, build and master your cloud native architecture, continuously adapting to your maturity. What set us apart is our high level of expertise and personalized support on cloud native technologies with a convection that know how only has value if it is already shared. That's why we are trying to write many several blogs and do some meetups etc. We are also a member of the CNCF and partnership with Hashicorp, but also with AWS and GCP. In a few minutes we will have presentation of our menu for today we will have an introduction. Then we are going to see together how we use case engineering easier with litmus for developers and SRE to improve their resilience. We will also discover the harmony between Litmus two, Jenkins and Slack. We will deep dive into that via an amazing demo at the end with my clients it's always hard to look for solution to test and improve our application resilience before promote into production. And also it's hard to ensure that it will be hydro resilient. And we are not going to get some incident in the morning and the very morning. And we are not going to get surprised. Today we are faced by two major choices. Either we can create scripts and some tests that can take a lot of time and investments and also a lot of consultants also. Or we can go to the chaos engineering discipline with scientific approach based on hypothesis and experimentation. But here the question is it difficult? Do we have enough knowledge to do chaos? How can we deal with it in a daily basis knowing that we are production a lot of code that has to be tested in term of resilience. And also we are going to face this question, can we make it easier for developers to do that if we say okay, for example, if you are going to say okay, we can do chaos, but is it going to be easier for developers and SRE to do that? Well, I'm very certain that this answer is yes, but how case engineering is easier. Now the delivery process has more steps from dev all the way to the saicd. Each time the developer push a code, it will be tested by a classical approach called QA tests. With all type of tests that looks for things we all know. It means that we are going to test things that we already know built. We don't going to see what we can't expect and the unknowns in other words. But we have also to the right to get surprised sometime by a problem we don't know and we don't expect. Also in order to improve the app resilience. For that reason we have also to enable developers in order to inject chaos in their DevOps pipeline as often as they want. So today our talk will be around the disenablement and how they can easily inject chaos via sample push or a pull request. To do that we are going to present the environment. So our environment will be based on AWS and kubernetes for the infrastructure and AWS for as a cloud provider and kubernetes as a container orchestrator. We'll have also Jenkins for the CACD part. We'll have terraform for the Amphra configuration. And we also have slack for the nfcation and communication alerting, notification, alerting and communication. As we all know that communication in chaos engineering is very important in order to make people knowing that we are going to inject chaos and so they don't get surprised. And also that helps us to collaborate together in order to improve the app resilience and to improve the system results. In other words, we have also GitHub for our code SEM or code management for the chaos injection we will use our famous framework which is Litmos chaos. So what is Litmos chaos? Litmos Chaos is like an open source framework used for chaos engineering and which helps kubernetes sres and developers to practice it in a Kubernetes native way. Litmos was in CNCF sandbox and now it's incubation because of its great community and a big community behind it that supports Litmus and make it and trying to improve it continuously, continuously. And you can find it in GitHub repo Litmus chaos Litmus. Now it is in version 2626-0260 well, the importance of Litmos is like behind the Chaos experiment that provides. So Litmos provides a lot of chaos experiments in kubernetes and AWS, et cetera. It will help us inject various scenarios such as cpu hugging, memory hugging that target the resources and we can also have some experiments that target for example network such as network latency for example. And this is in the pod, for example the radius. We can enlarge the radius in order to attack, for example nodes. These experiments are available in like Chaos hub that group all these chaos experiment. This chaos experiment can be organized and executed on another word orchestrated within a case workflow. So here's a question, what is the chaos workflow? A chaos workflow is like a set of different operation, or we can call it chaos experiment coupled together to achieve like a desired chaos impact on a Kubernetes cluster. Well, the importance of this chaos workflow is like it is very useful in automating a series of preconditioning steps or action which is necessary to be becoming before triggering the chaos injection. And a chaos workflow can also be used to perform different operation parallel to achieve a desired chaos injection scenarios, for example. So I see that my application is very affected by the cpu and memory memory. So it means that if I, for example, I want to test the impact of the update or the injection of chaos injection on these two resources, these two resources so I can go for a chaos workflow for example and create for example a workflow with two experiments, cpu hugging and memory hugging and make it running parallel for example. I have noticed that lately I have gotten many network latency on this app, so I would like to see that and to make it for example randomly on the different also dependencies so I can create workflow that chaos. For example, two experimentation running in parallel CPU Hogan memory Haagen and then going in serial with network latency. This chaos workflow can be for example created through Chaos center, which is like a portal that helps us see our workflow, observe it and monitor it, and even create our workflow. From the case center. We can see it, we will see it in the demo lately. How is that so how the case engineering is easier with litmus and Jenkins. So to do that we are going to begin to present the infrastructure which is based on AWS eks. It has been cooked via terraform for the sake of demo as KS engineering requirements. We all know that we have deployed and configured the monitoring stack composed from Grafana and promote. And also we have configured Slack to get notified with the necessary actions and also to communicate before executing chaos. Because it's very important, I'm trying always to insist that the communication is very and highly important keys of the chaos engineering discipline. We have also configured Jenkins GitHub to be triggered via pull request. And also we have like a container registry like get Docker hub or artifactory for example. For example. So here developer will update its code or the app code and will push or create a quest that will trigger the pipeline. And this pipeline, it will notify slack that the pipeline has been started and it's going to prepare the amp and then build the application and push it to dev. So here it will push to the Docker hub container registry. Then we will start the QA tests. So I'm not going to present the QA test because it's not very important for the sake of our demo and our presentation. And then it will go to update the deployed app image, so it will be updated by the new application or the new image and then it's going to inject chaos. So injecting chaos, it will be done via applying the workflow that we have already talked about. It's like CRD Kubernetes CRD and will be applied. So here we will face two results, pass or fail. If it fail, it means that a chaos experiment fail. Our app is not resilient, so it will get notified by slack or if it passed, so our app resilient so it will be promoted to production, tagged and promoted to production, pushed to the container registry with a prod tag. Then we will clean up the resources that has been created with the chaos workflow, for example chaos workflow and also will clean up the chaos results, which is like CRD for litmus. And then it will interface slack with the Qatas and the chaos, such as the chaos report. Now we are going to move to the amazing part, which is the demo. So get ready for it. So to begin with, I will present the code which is like a simple, very simple code. Let's see together Netmas chaos workflow up. It's a very simple code. Our app, it's like it's running. It's like doing hello Chaos folks. So here, hello folks. For example here hello Fox. I will be updated it. I'll be updated it. And then we have our app which is in here. Sorry, yes, it's here. So here we have our app, which is here is Dockerfire of the apps like Apache app, PhP app, and also we have the Jenkins file for pipeline. We have the prepare stage. We have configured the build image and pushed the dev decay test and also the update of the app, the app deployment. Our app is like deployed in deployment. And then the case injection jump via script and work that contains a workflow. And if everything goes fine, we will promote the app, the manifest of the deployment which is here for the app. And we have different scripts. This is the KS SH and the cleanup sh, the KsH, it will apply the workflow which is here, for example the CPU Huggin workflow and the cleanup. It will clean up the workflow that has been created and also the chaos result which reflect where we can get the reports that has been sent to slack in the workflow. We can find three steps. Install chaos experiment Sepio Huggin, which is the experimentation that we are going to running and driver chaos in order to delete the runners and the agents that has been the runners and the resources that has been created through this workflow. In order to target our app, we have to update the app info, the app info which is here. So of course in the chaos engine, in the chaos engine resource here, for example, we are going to target the app namespace, which is app with the app level which is app chaos chronicle demo. And the app kind is like a deployment. And for the CPU Huggin, we are using to do that for 60 seconds chaos duration. And we're going to target one cpu core. And this will trigger the chaos workflow. And this is of course the revert part. If we don't need to revert, we can delete this part. It will keep the different runners up to clogs. For example, we have also for example, memory hogging. We have the pod delete experimentation, for example, to delete pods randomly, generally for deployment, for example. Okay, so for the workflow here, we have seen that the workflow is like an AIC code. You can also get it from the chaos center. So here for example, you have the litmus chaos center where we can see the different workflows that has been run, the chaos engine, chaos engine which connect the cluster, the chaos hub, also that contains many experiments, Mary experiments in azure, AWS, et cetera. We have the observability part. So here we are going to use our hours, which is Grafana, which is Grafana for example. And also we have other stuff. So here we can setting, we can use team management, user management, and also we can integrate it to GitHub. So for example, if I create like workflow here, it will be pushed directly to the GitHub repository. So how can I, here's the question, how can I get this cpu Hogan workflow? I can do it, it's very easy. So here from the cave center I will create a new one. For example I will call conf 42. Next I will add for example like a pod up pod cpu huggin. Thank you. So here I'm going to like this, I'm using to target the app namespace while the count is deployment and the app label which is chaos carnival. So I can also define the steady state like a probe that define, if I would like to define the steady state of the app. It's very important. So for the sake of demo, we are not going to use it. So we can use HTTP, CMD, prom, et cetera. And also for the tune experiment we can do. For example, I'm going to go for 62nd chaos duration and one cpu core click finish for me here for the resiliency score, reliability score. It's like for me the cpu Hogan is very important. So I will give it ten if it's not important, for example, I can give it six, or even I can give it from zero to three. For example, I will give it ten and I will scale it now. If I can scale it now, now I will get view yaml, I can get this yaml and copy paste it in the vs code and push it to our GitHub repo. So here for example, I have already, for example create a pull request which called Trigger Chaos conf. This chaos conf has prepared the app for me, build image and push it to dev and chaos, done the Qs testing, then updating the app. So in the beginning we have received like a notification saying that the Chaos conf has been, the pipeline has been started, then it will at the end, at this phase, after the cleanup, we will get results like you will get notification saying that the chaos result, that the pipeline has succeeded and the chaos result with the experiment name, which is Potsybu Hogan exec with the verdict pass and the resilience Convert 100, which is it means that our app is fully resilient and it's like going great. Here, for example, I will update the app. Well, I will do hello casework from around the world and I will push it, git add, git commit, git push origin. So here it will trigger the app, it will trigger the master. Here, for example, the master will be triggered, it's pending. And then we will get like a notification here saying that it is started up, waiting for it to get started up, it's like waiting for it, waiting in the queue. And here we have the startle. Then it will inject the chaos and we will go to all the way to the different steps that has been shown here, up all the way that here the Q eight hasn't been updating the app. Here it will update the app. Here it has injected the chaos and we will wait for the chaos to finish. When the chaos will finish, we'll finish. We will see that the app, it will be updated and we will see the CPU Hogan for that. For example, we can see that there are several resources that are created in the litmus namespace, for example, which like for example, they are the runners. They are the runners. For example, see here, runners that will execute the experiment which is the pod cpu huggin. Once it's finished, it will create for us the chaos results. You see it's case results that will be updated to our chaos with the verdict and the reliability score. And also we have the workflow, workflow resources that will be created which will be in running state that will run our experiment, which will run our experiment in the litmus namespace. Here the chaos is waiting for chaos to finish and certainly it has finished through the master. So if we can take a look to the slack notification, we will get notified that it has been updated with the new image. It's like it takes sometimes to get notified. It's like a connection issue up. Well, we will wait for it to finish and then we will get such the case results experiment name CPU Hogan, et cetera. It might take some time to get notified. Well, going back to our it while it finish tech, let's see, succeeded. So it's like notification error. So it's like the Internet connection is lagging so if we are good, we will see that the promotion. Yes, it's like the Internet has already finished. So normally it's updated. So it's right updated. And here normally we'll get the experiment name with the result and the report. Going back to our presentation, I hope that you enjoyed the demo. So as we have seen that starting chaos injection is a must in order to improve our app resilience. And also before chaos injection we have always communicate what we are going to do and it's very important for the sake of other team and the work of improve in order to keep everyone posted that there will be like a downtime or something like that. And also we have also to make chaos more and more automation, for example, as we have seen Jenkins or other tools, in order to improve it continuously. And also we have to keep enhancing one of the most requirements of the chaos engineering, which is the alerting systems in order to get notified when we have errors incident. And also we have also to keep enhanced in the monitoring systems. And also all of that will reveal a lot of failures and it will expose many things that we have already forgotten or we didn't have the chance to take it into account in our infrastructure or our system. So we don't have to be afraid of and we have to keep moving forward. And also I believe that the key success we need to hack failure before it's very important to to learn from our failures. I hope you enjoyed it and I would like to thank you very much for attending this session and see you soon.
...

Akram Riahi

LitmusChaos Leader @ WeScale

Akram Riahi's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways