Conf42 Chaos Engineering 2024 - Online

Bring Chaos Engineering to Your Organization in a Fun and Continuous Way

Video size:

Abstract

In your organization’s journey to adopting chaos engineering, do you face challenges such as developers’ lack of understanding about the infrastructure, limited usage of observability tools, and their unfamiliarity with Chaos Engineering? Come and listen to the story of how we address them!

Summary

  • Electrolux has a highly complex system and by this I mean IoT. All of our developer teams use different tools for monitoring. We decided to consolidate everyone into one observability platform. The first step towards gamification was when we created a game.
  • We want to make it fun using chaos engineering for our developers to learn and to improve troubleshooting shooting capabilities. We decided to conduct seven experiments in total, which are all related to troubleshooting. If you want to adopting chaos engineering in your organizations, we are super happy to share more about this journey.
  • For Chaos game day, the goal is to bring infrastructure knowledge and also to enable developers to do troubleshooting using one observability platform. There are lots of nice chaos engineering frameworks in the world. Many operations or the experiments can be automated.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, we are going to talk today with you about how we brought cows engineering in our organization in a fun and continuous way. And my name is Christina, I am a three product manager at Electrolux and I'm here in Stockholm office together with my colleague Long. He is senior SRE engineer with PhD in Cows Engineering in particular. So let's start here is the three section we are going to discuss today. And for the first one I would like to jump in into the context we work at Electrolux with a highly complex system and by this I meant IoT. So to explain what we are doing, I'd like to say that we are working in digital experience organization and here is where connectivity start at Electrolux and being developed. Our appliance is divided into three categories, taste, wellbeing and care. These are connected appliances. So just to provide you a simple use case, here is a lady and let's just imagine that she bought her oven, then she made it connected to the Internet via our mobile app. Then she places some food in it and the camera recognized the food and suggests the cooking mode, the temperature and duration. Then she starts cooking and goes out for a run. The oven keeps monitoring the cooking status and turns off the heating when it's done. And also she receives a notification from the app so she knows she can go home and enjoy the meal. To build all possible and impossible use cases, we have a firmware team that makes the brain of appliances smarter. We have a team of brilliant backend engineers that develops a connectivity cloud to send data from the appliance to a mobile phone. And of course we have a team of mobile developers together with designers building the meaningful digital experience for our consumers. And also here we are SRI team. And to provide more context, I also would love to say that we have a part of different types of appliances like fridges, vacuum cleaners, dishwashers, many third party company integrations and we currently launch in over 60 countries and support other 20 languages. We also run our services from several different regions and a part of the general complexity of the IoT. If you join as a developer Electrolux IoT domain, so the first day you can be overwhelmed with the range of various tools and cloud vendors we use. We have three main connectivity cloud platforms. The majority of services use AWS, but we have some services running on Azure, IBM and Google Cloud. And it means that all of our developer teams use different tools for monitoring. When it comes to incident, our team sri team needs to look into all the observability tools that used and identify the root cause which can take for a while. It has some historical reasons, but before we will move to the next session, it's important to note that when we started our journey, we hadn't had such a plan to solve all the complexity via chaos engineering. It just came after. So as a first step, we decided to address the obvious challenge first and bring everyone and every application we have into one observability platform. So we would have end to end tracing from the appliance via our connectivity platform to the app. So we'd be able to trace and troubleshoot much better. So as I said, there was some historical reasons that every team had their own preferences and their own observability platform. But we decided to consolidate everyone into one observability platform. So we selected datadoc as this platform and we started to onboard team by team. Here is a few dashboards from the different teams and I remember how one day engineer leads reach out to our team and say that they had an ongoing incident and they need our help. So it was a big change because usually it's we who notice the incident and now we started sometimes even not be involved, but they still sometimes needed our help to identify root cause and fully resolve the issue. So I would say that the first step towards gamification was when we created this page just to understand what is the difference among our developer teams, because for some teams they still use only some basic functions like only checking logs. And I just wanted to highlight with this example that if you made some changes in your organization, you cannot accept that everyone will be quickly adapt. So I remember also there was product people, they reached out to me and they asked how to get all the stars, which was quite fun because I would say that was the first step when they wanted to play this game with us and start to learn about a new observability tool. But in general we started to brainstorm what we can do about this big difference among different teams and how to help those teams to learn about observability tool. We provide more. So we got an idea to run internal cows engineering game day to promote our tool set and improve developers knowledge about monitoring. So we tools our developers that we are going to attack their services with many use cases that could happen in real life and we have an impact on our end users. We also tried to cover the different topics of troubleshooting, a part of log searching like metrics analysis about performance and traffic trace analysis and about latency. But let's call the experts in cows engineering so you could listen a bit more about it. Long perfect. Hello everyone. Thanks Christina for the nice introduction. This is long and I'm super proud to be here for the third time to talk about chaos engineering at comfort e two. This time we are going to share more about how we adopting chaos engineering in a fine and continuous way. As you already know that we have this one observability platform and we are going to onboard developers and enable them to learn more about how to use these observability tools. And also we think it's good to extend this purpose using chaos game day so that they can learn more about the infrastructure and also improve the knowledge of troubleshooting. So as a step two, how to make it fun using chaos engineering for our developers to learn and to improve troubleshooting shooting capabilities. We came up with this idea, Chaos game day. If you have a similar request and you want to adopting chaos engineering in your organizations, we are super happy to share more about this journey. First of all, I would like to share how we prepared for the Chaos game day. Of course it's very important to communicate and to decide the target environment beforehand because as we know chaos engineering, the ultimate goal is to improve the resilience and to directly conduct experiments in production. However, we are not there yet and we think it's good to begin with our staging environment because the first Chaos game day will be focused. We're focused on the education and also the onboarding experience for developers. So we need lots of communications with different teams and we also need to decide and we finally decided to use the staging environment. And second, we want to make it a fun way to learn and to improve the troubleshooting capabilities. And here we think it's good to use the CTF form of Chaos game day, the capture flags. This one is more used by security developers I would say, but it's also a perfect option for Chaos game days. So with the context of chaos engineering experiments, we think everything can be a flag, for example, the piece of logs, name of metrics or the method that raises exception, et cetera. We can predefined a set of flags and then we trigger or inject some exceptions on purpose and invite developers to figure out what is the flag we injected into the target environment. And based on this form, we decided to conduct seven experiments in total, which are all related to troubleshooting and also related to different aspects of observability platform knowledges. I will share this more in the next couple of slides. And finally, we also need some preparation for the logistics and also for the player registration and access control. So this is the example about the flow for one experiment execution. So before conducting any cost engineering experiments, we need to prepare the experiment and also we need to place the flag. For example, if I want to invite developers to figure out what is the abnormal behavior that outputs lots of error logs, we can pre inject some errors and trigger this service to output some specific error logs. And then we mark these error logs as a flag and ask developers to dig into their infrastructure and services to figure out which is a service that outputs the errors. Of course, we can also conduct experiments on the same Chaos game day and then report the error and invite developers to do the troubleshooting. So on the Chaos game day for each experiment, we will release a set of instructions for developers and then they will know what is the abnormal behavior from the end user's perspective. For example, if I'm using the mobile app to control my fridge, I will report as an end user, I cannot change the temperature for my fridge, and I got this error in my mobile app, then developers will use everything they can use on the observability platform to troubleshooting to do the troubleshooting. Of course, considering this is an educational process, we also prepare a set of hints for each experiment. But the only difference is before releasing the hint, developers or the teams will get a higher score if they succeed in finding the flag. And after that we close the experiment. And after the Chaos game day, we will do follow up sessions with all the participants. So here we would like to share a bit more about the tricks of experiment design. Because for chaos engineering experiments, we should always start from the hypothesis design, or we should always consider the goal of this set of chaos engineering experiments. And for us, for this specific Chaos game day, the goal is to bring infrastructure knowledge and also to enable developers to do troubleshooting using this one observability platform. So we always consider what kind of observability information or metrics we can use for chaos game day experiments. For example, we can inject error logs, we can inject some abnormal behavior from metrics or from some traces, et cetera. And then we will enable developers to do the troubleshooting and at the same time to improve their knowledge. Secondly, it's totally fine to trigger a failure at different levels because we would like to report this error or the abnormal behavior from the end user's perspective. No matter, this error is where this error is injected. We can always report that from the consumer's perspective, I cannot control something from my mobile app, and then it will be more natural for developers like they receive some tickets from the support team and they need to figure out what is happening in their backend services. So it's okay. For example, we can trigger this latency error from the database infrastructure perspective, or we can also inject some exceptions in the microservice and then report the error from end user's perspective. And finally, it's suggested to take advantages of various frameworks. You don't need to write or implement everything by your own, but actually there are lots of nice chaos engineering frameworks in the world. For example AWS fault injection simulator or chaos litmus as another open sourced chaos engineering frameworks. Both of them will be very helpful for you to conduct experiments. So as a summary for this Chaos game day, we have 41 developers from different twelve teams, four countries to participate, and also we conducted seven experiments. In total we received lots of submissions, 181 which was lot and which caused some issues for us. I will share more later on, but the good thing is we received lots of good and positive feedback. And we even found something extra as a surprise for our infrastructure. Like we didn't inject any failures here, but with the help of Kels game day or with the participants from different teams, we managed to find some more resilience issues in our infrastructure. So regarding feedback, we got lots of positive feedback from players. And here I would like to share one example that I liked the most. There was one developer who was actually a bit upset because the team she belonged to didn't win the match and she couldn't see any logs or metrics from the mobile app and she was kind of angry at herself because it's their team who didn't prioritize to have it. After the Chaos game day, the team actually started the real user monitoring integration on Datadog, and all the other participants also started to set up their monitors and alerts. So Chaos game day became a nice motivation for teams to improve their services observability. And from the SRE team's perspective, we think it's a very good approach to ship operations responsibilities to developers because with the help of chaos engineering experiments, they gain more knowledge about their infrastructure and also improve their capability of troubleshooting using our observability platforms. We also checked the number of incidents before and after we having chaos engineering game days. It's 33 percentage less. And of course it's not only because the effort we made for conducting chaos engineering experiments and also we improved the incident management process with the help of Chaos game days. Now developers and also different team leads requested to continuously conduct chaos engineering experiments, but what is the price for that? So we considered again about the feedback and also about the efforts we made for conducting the first Chaos game day. And we think there are many things that can be further improved. For example, many operations or the experiments can be automated. Some of the review or the check of the submitted flags can be automated as well. We can also organize and promote or provide a platform to conduct chaos engineering experiments instead of organizations the game days with lots of effort for the logistics. So how to do it in a continuous way? We come up with the idea that chaos engineering operations can be actually integrated with the platform engineering practices. So now we actually had developed our internal developer platform IDP around like two years ago, and we think chaos engineering is just one of the extra good aspects or good feature for our internal developer platform. So to give you a bit more context of our IDP, this is the overall design of the IDP we have. So as an SRE team member, we define a set of templates for our infrastructures and these templates are currently implemented using terraform. We define the set of standardized options for different resources like EKS, databases, et cetera. And then we also provide one single entry for our developers to actually create and manage all the infrastructure using backstage. So imagine if there is a new joiner to Electrolog's IoT team and he needs to create some infrastructure for daily tax, for example the eks. Instead of exploring the AWS console or ask around about what is the configurations for different infrastructures, she can simply visit the backstage IDP plugin and select eks and then she will get all the recommendations and all the options ready for creating eks cluster for her. This is a screenshot from our IDP and the first picture shows the Electrolux catalog for all of our infrastructure resources. And here you can see there is a list of resources like databases, cluster, EKS clusters, msks, et cetera. And as long as the developer has access to these resources, the developer is able to check all these details on the single platform, like the details of the infrastructure configuration. And if it is the eks, developers are also able to check, for example, the deployment in this eks cluster. So in order to integrate chaos engineering operations with our IDP, the first version we did we implemented is chaos engineering experiment shadowing plugin. There is a button for developers on different resources. For example, for a microservice which is deployed in Kubernetes clusters, we provide this button and there is the dialog for configuring cost engineering experiments. Developers can choose the fault models and also some configurations for this specific fault model, like the value of latencies or the type of errors, et cetera, then developers are able to trigger these experiments in a specific environment. They are also required to document the hypotheses because the experiment is done by IDP and most of the information are analyzed on the observability platform. So developers needs to cross compare the findings with the predefined hypothesis. This is a nice approach, but to make it really more scalable and more flexible, we think it's even better to consider the chaos engineering frameworks and also to consider some plugins on backstage so that we can actually provide a tape for different resources and also provide more richer fault models for chaos engineering experiments. This is the current plugin we have. We use litmus chaos backstage plugin together with the Litmus Chaos engineering frameworks version three. In this way, we actually provide a multitenancy setup for our chaos engineering experiments, and we can also adopt different fault models from different layers, like cloud provider fault models and also like Kubernetes fault models, et cetera. So considering the future plans, we think it's better to provide a multilevel and automated experiments platform for developers. For example, we can provide more fault models based on the type of infrastructure or the type of services. We can also improve the full feedback loop with the help of IDP, because currently we have one single entry for infrastructure and service management, we have another platform for observability management, and we can somehow automate the loop of these chaos engineering experiments. Okay, this is a talk for today. Considering the complexity of IoT systems, chaos engineering is definitely a good approach for resiliency improvements. However, we don't want to overwhelm our developers, and we don't want to add an extra task for them to do. Chaos engineering. We come up with the idea chaos engineering gamification, and also the integration with platform engineering approaches. If you would like to go deeper in platform engineering, feel free to give us a thumb and we will give another talk maybe in the recent future. Thank you guests and enjoy at comfort. Bye.
...

Long Zhang

Senior SRE @ Electrolux

Long Zhang's LinkedIn account Long Zhang's twitter account

Kristina Kondrashevich

Product Manager, SRE products @ Electrolux

Kristina Kondrashevich's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways