Conf42 Incident Management 2022 - Online

On Call Like a King - How we utilize Chaos Engineering to improve incident response

Video size:

Abstract

As engineers, we used to write code that was interacting with a well defined set of other applications. You usually had a set of services that were running in well defined environments. The evolution of cloud native technologies and the need to move fast, led organizations to redesign their structure. Engineers are now required to write services that are just one of many other services that usually solve a certain customer problem. Your services are smaller than what they used to be, they aren’t alone in a vacuum and you have to understand the problem space that your service is living in. These days engineers aren’t just writing code. They are expected to know how to deal with Kubernetes, HELM, containerize their service, ship to different environments and debug in a distributed cloud environment.

In order to enhance engineers’ cloud native knowledge and best practices to deal with production incidents, we started a series of workshops called: “On-Call like a king” which aims to enhance engineers knowledge while responding to production incidents. Every workshop is a set of chaos engineering experiments that simulate real production incidents and the engineers practice on investigating, resolving and finding the root cause.

In this talk I will share how we got there, what we are doing and how it improves our engineering teams expertise.

Summary

  • Iranian engineer: How do we utilize chaos engineering to become better cloud native engineers and improve incident responding. It's a series of workshops I composed of Tyrone which at the beginning was meant to bring more confidence for engineers. But later on it became great playground to train knowledge around it.
  • The evolution of the cloud native technologies and the need to scale engineering is leading organizations to restructure their teams. Engineers these days are closer to the product than the customer needs. Make an impact and as a result you will have happy customers. We should embrace these changes.
  • As part of transitioning into being more cloud native distributed, engineers face more challenges. Deep systems are complex and you should know how to deal with them. Being cloud native engineer is fun, but also challenging. We utilize chaos engineering to cope with these challenges.
  • workshop sessions are composed into three parts. We have the production and the goal setting. You should try keep the people focused and concentrated. Overall session time shouldn't be longer than 60 minutes. If you walk hybrid it's better to do these session when you are in the same workspace.
  • Our Nkolaki King workshops sessions are usually trying to be as close to real life production scenarios as possible. Such real life scenarios enable engineers to build confidence while taking care of real production incidents. The discussions around the incidents is a great place for knowledge sharing.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, thanks for joining my talk. I'm happy to be were with great other force that shares the knowledge with others. Our talk today is how do we utilize chaos engineering to become better cloud native engineers and improve incident responding. Let me first introduce myself. My name is Iran and I'm leading cyren security engineering at Tyrant. I'm an engineer problem solver. I love sharing my knowledge with others. So before we start, I would like to start with this one. What are you going to gain out of this talk? I would like to share with you how we leverage chaos engineering principles to achieve other things besides its main goal. We wanted at the beginning to bring more confidence to our engineers who are responding to production incidents and in addition to that train them to become better cloud native engineers as it requires additional expertise which isn't just the actual programming to ship your code. Somewhere I would like to share with you how we got there, what we arent doing and how it improves our engineers team's expertise. And one more thing you might ask yourself, why is these title on Kolaki King? It's a series of workshops I composed of Tyrone which at the beginning I must admit meant to bring more confidence for engineers during their encore shifts. But later on it became great playground to train the cloud native engineers practices and train knowledge around it. During this session I'm going to share with you what we are doing in such workshops. So stay with me. Let's start with the buzzword cloud native as a definition. I copied from the sensitive documentation which cause I call it the cloud native definition. I've highlighted some of the word there. While you read the definition, you see the word scalable, dynamic, loosely couple, resilient, manageable, observable. At the end you see something that I truly believe in and I'm trying to make it in part of every culture of any engineers team that I join. As engineers, we deliver products for the definition. As you see, this is what it brings. As a result, engineers can make high impact changes. These is my opinion, what every engineers culture should believe in. Make an impact and as a result you will have happy customers. The evolution of the cloud native technologies and the need to scale engineering leading organizations to restructure their teams and embrace new architectural approaches such as microservices. We are using cloud environments that are pretty dynamic and we might choose building in microservices to achieve better engineering scale. You should remember just as a side note, microservices are not these goal but we use that the cloud and other stuff as tools to scale our engineering and end product was your system scales, your system probably becomes more and more distributed. Distributed systems are by nature challenging. They arent not easy to debug, they arent not easy to maintain. Why it's not easy just because we ship these pieces in a larger puzzle. Two years ago I wrote blog that is trying to describe the engineers production. At a glance I'm feeling that the role of the engineers grown to be much more bigger. We're not just shipping code anymore. We design it, we develop it, we release it supporting production. The days that we had to throw artifacts on operations are over. As engineers we are accountable for the full relief cycle. If you think about it, it's mind blowing. We brought so much power to the engineers and with regard power we should be much more responsible. You might be interested in the blog that I just mentioned and I wrote back in 2020. I just talked about the changes and the complexity, but we should embrace these changes. These changes enable teams to take an endtoend ownership of their deliveries and enhance their velocity. As a result of these evolutions, engineers these days are closer to the product than the customer needs. In my opinion. There is still a long way to go and companies are still struggling. How to get engineers closer to the customer to understand in depth what is their business impact. We talked about impact, but what is this impact engineers need to know? What do they solve, what they influence on the customer and know these impact on the product. There is transition in the engineers mindset. We ship products and not just code. We embrace these transition which brings with it so many benefit to the companies that are adopting them. On the other end, as a team at the system scales, it becomes challenging to write new features that offer certain business problem and even understanding the service behavior is much more complex. Let's see why it's complex. These best approaches that I've just mentioned bring arent value. But as engineers we are now writing apps that are part of a wider collection of other services that are built on a certain platform in the cloud. I really like what Ben is sharing his slides. I would like to share it with you. He's calling them deep systems. Images are better than words and these experiments in the slide explain it all. You can see that as your service scales you need to be responsible to a deep chain of other services that this service actually depends on. These is what it means. We are maintaining deep systems. Obviously microservices and distributed systems are deep. Let's try to imagine just a certain service you have. Let's say that this is the auto service. What do you do in this order service. You fetch data from one service and then you need to fetch data from another service and you might produce an event to a third service. The storage is on your own, but you understand the concept. It's just complex. Deep systems are complex and you should know how to deal with them. As part of transitioning into being more cloud native distributed and relying on orchestrators such as kubernetes at your foundation, engineers face more and more challenges that they didn't have to deal with before. Just imagine this scenario. You are on call, there is some back pressure that bring your slo target, there is some issue with one of your availability zones, and third of your deployment couldn't reschedule due to some node availability issues. What do you do? You need to find it out and you might need to point that to your on call DevOps on colleague. By the way, DevOps may be working on that already was. It might trigger their slos as well. This kind of incident happened and as a cloud native engineer you should be aware of the platform you are running. What does it mean to be aware of that? You should know that there are AZ in every region. Your pod affinities are defined in some way and these pods that are scheduled have some status. That are scheduled have some status. These cluster events and how you read the cluster event in case of such a failure. This was just a particular scenario that happened to me and that happened to many of you before. As you see, it is not just your service anymore, it's more than that. And this is what it means to be a cloud native engineer. Was I said already. Being cloud native engineer is fun, but also challenging. These days engineers arent got just writing code and bidding packages, but are expected to know how to write these own Kubernetes resource yamls use helm, containerize their app and ship it to a variety of environments. It is not enough to know it at a high level. Being cloud native engineers means that it's not enough to just know the programming language you are working on well, but you should also keep adapting in knowledge and understanding of the cloud native technology that you are depending on. Besides the tools you are using. Building cloud native applications involves taking into account many moving parts such as the platform you are building on, the database you are using, and much more. Obviously there are great tools and frameworks out there that abstract some of the complexity out from you was engineer, but being blind to them might hurt you someday or maybe night. If you haven't heard of these fallacies of the distributed computing I really suggest you to read further on them. They are here to stay and you should be aware of them and be prepared. In cloud things will happen, things will fail. Don't think you know what to expect, just make sure you understand them. You handle them carefully and embrace them. As I said, these would just happen. We talked a lot about the great benefits and also the challenges, so we had to deal with these challenges. Let me explain to you what did we do to cope with these challenges. So we utilized chaos engineering for that propose we have found this method pretty useful and I think that this can be nice to share with you the practices and also with others. Let's first give a quick brief what is chaos engineers? The main goal of Chaos engineering is as explained in the slide that I just copied from the Chaos principle website. The idea of the chaos engineering is to identify weaknesses and reduce uncertainty when bidding a distributed system. As I already mentioned in previous slides, bidding distributed systems at scale is challenging and since such systems tend to be composed of many moving parts, leveraging chaos engineering practices to reduce the plus radius of such failures improves itself as a great method for that proposed. So I created a series of workshops called on Karaki King. These workshops intend to achieve two main objectives. Train engineers on product ferros that we had recently and train engineers on cloud native practices turing and how to become better cloud native engineers a bit on our own core procedure before we proceed, we have weekly engineer shifts and an octeam that monitors our systems with these four seven there are these alert severities defined severity one, severity two and severity three which actually define from these business impact alerts to the actual service owner alert monitor. We have alert playbooks that assist these oncology responding to an event. I will elaborate on them a bit later. In the case of a severity one, the first priority is to got the system back to normal state. Don't call engineer that is leading these incidents shall understand the high level business impact to communicate. In any case that there needs to be a specific expertise to bring back into the functional state, the engineer is making sure the relevant team or service owner are on keyboard to lead it. These are the tools that the engineer got in the box to utilize in case of an incident. Pretty nice tool set. Now that we understand the picture, let's read down into the workshop itself. The workshop sessions are composed into three parts. We have the production and the goal setting. Then we might have to share some important stuff that have been changed lately and some that change right away. Let's dive into each one of them the session starts with a quick production of the motivation. Why do we have the session? What are we going to do in the upcoming session and make sure the audience are aligned on the flow and agenda. It's very important to show that every time as it makes people more connected to their motivation and understand what is going to happen. This is part of the main goal. You should try keep the people focused and concentrated so make sure the things are clear and concise. Sometimes we utilize the session as a great opportunity to communicate some architectural aspects, platform improves or process changes that we had recently, for example to don call process or core service flow adaptations and much more. We work on maximum two production incident simulations and overall session time shouldn't be longer than 60 minutes. We have found out that we lose engineers concentration for longer sessions. If you walk hybrid it's better to do these session when you are in the same workspace as we have found this much more productive. The communication is making a great difference. Let me share with you what we arent doing specifically in this part which is the code of the workshop. I think that this is one of the most important thing. Our Nkolaki King workshops sessions are usually trying to be was close to real life production scenarios as possible by simulating real production scenarios in one of the environments. Such real life scenarios enable engineers to build confidence while taking care of real production incidents. Try to have an environments that you can simulate that incident on and let the people play in real time. As we always say, there is no identical environment to production and since we are doing specific experiment, it's not necessary to have a production environment in place. Obviously as more as you advance it might be better to work on production, but it's much more complex and we have never done this before. Since we utilize chaos engineers here, I suggest having a real experiments that you can execute within a few clicks. We are using one of our load test environments for that proposal. We started manually. If you don't have any tool, I suggest do not spend time on that. Don't rush to a specific chaos engineers tool. Just recently we started using litmus chaos to run these chaos experiments, but you can use anything else you would like to or you can just simulate these incident manually. I think that the most important thing is, as I said before, we need to have a playground for the engineer to actually exercise and got just hearing someone talking on a presentation slide, you will be convinced that when they are practicing and not just listening to someone explaining something on a slide, it makes the session much more productive. Right after these introduction slides we drill down into the first challenge. The session starts with a slide explaining a certain incident that we are going to simulate. Usually give some background of what is going to happen. For example there is some back pressure that we couldn't handle since specific UTC time represent some metrics of the current behavior. For instance we present the alerts and the corresponding Grafana dashboard. You should present something very minimal because this is how it actually happens during a real production incident. Then we give engineers some time to review these incidents by themselves. Give them the time to think about it is crucial they arent exercising alone thinking if they haven't code and suffering similar before these very important step it will encourage them to try find out more information and utilize these know how to get more information such as gather cluster metrics, view the relevant dashboard, read the logs and service status. It is very important aspect you should understand the customer impact and it's even more important specifically when you are in an on call in case of a security one. You should communicate these impact on the customers and see if there is any walkaround until the incidents resolved completely. Engineers not always aware of the actual customer impact. It's very good time to discuss it, put their analysis from time to time and encourage them to ask questions. We have found out that discussions around the incidents is a great place for knowledge sharing. Knowledge sharing can be anything from design diagrams to some specific Kubernetes command line. If you are sitting together in the same space it can be pretty nice because you can see who is doing what and then you can ask them to show which tools these use and how they got there. What I really like on those sessions is that it triggers conversations engineers tell to each other to send some of their Clis or tools that make their life easier while debugging an incident. The workshop sessions will teach you a lot on these know how that people have and I encourage you to update the playbook based on that. If you don't have such playbook, I really recommend you to have such we have a variety of ECL playbooks. Most of them arent composed for major services. One alerts they provide don't call engineers with some gotchas and high level flows that is important to look at when dealing with different scenarios. These are how our playbook templates looks like. Drive the conversation by asking questions that will enable you to share some of the topics that you would like to train on. For example, some examples that I have proved to be efficient ask an engineer to present a group final to look at. Ask an engineer to share his keybanolog inquiries or ask someone else to present its drag tracing and how to find such a trace. You sometimes need to moderate the conversation a bit as the time flies pretty fast and you need to bring back the focus a bit during the discussion. Point your finger or interest on interacting architectural aspects that you would like the genie to know about. Maybe you can talk on a specific async channel that you might want to share your thoughts about. Encourage the audience to speak by asking questions around these areas of interest that will enable them to even suggest new design approaches or highlight challenges they were thinking about lately. You might be surprised and even add them to your technical debug at the end of every challenge. Ask somebody to present these endtoend analysis. It makes things clear for the people that might not feel comfortable enough to ask questions in large forums or engineers that have been judged joined the teams or junior engineers that might want to learn more. It's a great source for people to get back into what has been done and also fantastic part of the knowledge base where you can share onboarding training process to new engineers that just joined the team. I found out that people sometimes just watch these recording afterwards. It becomes handy even just for the engineers to get some of you of the tools that are in their best. So just make sure you record and share the meeting notes right after the session. As you can see, chaos engineering for training is pretty cool. Leverage that to invest in your engineering team knowledge and skills and it seems to be successful and at least it was successful for us. So to summarize some of the key takeaways, we found out that these sessions arent an awesome playground for engineers. I must admit that I didn't think about was engineers for this simulation at the first place. We started with just a manual simulation of our incidents or just presented some of the evidence we gathered during a time of fellow to drive conversation around them. As we move forward we leverage the usage of chaos tools for that proposal. Besides the training to become a better cloud native engineers, donco engineers are feeling much more comfortable in their shifts and understand the tools available to them to respond quickly. I thought it can be good to share as we always talk about chaos engineers experiments to make better reliable systems, but you can leverage them also to invest in your engineers teams training. Thanks for your time and hope it was a fruitful session. Feel free to ask me any questions anytime. I will be very happy to share more. Thank you.
...

Eran Levy

Director Of Engineering @ Cyren

Eran Levy's LinkedIn account Eran Levy's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways