Conf42 Chaos Engineering 2024 - Online

Smart Chaos : Leveraging Generative AI in Chaos Engineering

Video size:

Abstract

The contemporary distributed ecosystem, characterized by speed, scale, and complexity, has surpassed the capacity of human mental modeling. Generative AI is now offering a solution to this intricate problem by enabling the implementation of Smart Chaos, reducing the dependency on human involvement.

Summary

  • Chaos Engineering 2024 is about leveraging generative AI to build autonomous chaos engineering workflows. I have used AWS Partyrock to implement some of these use cases. Finally, I'll wrap it up with some of the best practices and pitfalls based on my experience.
  • The distributed systems are very complex. If there's an issue in any of these layers, it will have a ripple effect on other layers. Bugs can appear anytime, right? So why are we not able to make management of distributed systems easy?
  • Generative AI is nothing but the ability of AI models to create new content or create original content. One of the top use cases is the text generation. These models are really helping us to make a difference in our operations.
  • We want to do chaos engineering in a methodical way. There's nothing like doing a chaos engineering a chaos way, and that will not give you any value. The first task is discover, discover your services. When you are going through the chaos engineering process, you have to measure everything.
  • So here I'm looking at ten stages, ten step into the chaos engineering workflow. How we can actually leverage generative AI to provide solutions to this area. One is discovery. Another is understanding blast radiance. And then about monitoring and analysis.
  • The idea is that that is why we need generative AI. So we don't need really SMEs to be involved all the time. This is the architecture diagram of an electronic vehicle charging system. We are going to use this as our base and see how we can automate some of the stage.
  • AWS party rock is one of innovative the AI play pen which AWS has released. I asked generative AI to come up with system dependencies. Using large language models, ensure that you are using the right model and that will give you more accurate data.
  • The next aspect is understanding the steady state of this application or the system. So our generative AI is able to look at the architecture diagram and then define what is good mean without even human involved.
  • So moving on, this is about hypothesis creation. What I'm doing is giving the architecture diagram link and I have asked generative AI model to generate the hypothesis. Once you have the experiments, it's about understanding blast radius. This is again helping us massively to cut down human involvement.
  • We are able to use generative AI, every aspects of our chaos engineering workflow. We can use it to discover our services, understand the dependencies and it will come up with a hypothesis. Once the experiments have been executed, we can monitor using observability tools. This is what I call the smart chaos.
  • I'd like to finish my session. It was wonderful being part of Chaos Engineering 2024. I hope you enjoy my presentation and there are a lot of other presentations. Have a nice day.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Hello. Welcome to Chaos Engineering 2024. My name is Indika Wimalasuriya and part of my presentation I will walk you through smart Chaos, which is about leveraging generative AI to build autonomous chaos engineering workflows. During my presentation I will talk about distributed systems and mainly its importance when it comes to resilience. So why we have to make our distributed systems resilient? And then we will discuss high level what is chaos engineering and generative AI as well. Important part of my presentation is discuss about methodology of chaos engineering and then how we can apply generative AI solutions into the different aspects of or the life stages of the chaos engineering workflows. We will discuss about quite a lot of use cases, and I have used AWS Partyrock to implement some of these use cases as well. And then we will check how we can build an autonomous chaos engineering workflow which can learn and which can act, and we can generate the chaos solutions using generative AI without human intervention. Finally, I'll wrap it up with some of the best practices and pitfalls based on my experience. So, moving on, I hope you all understand, the distributed systems are very complex. One of the main reason why they are very complex is we have layers of architectures, right? We have the hardware layers. It can be your serverless, or the computing layer or the front end layers as well. What these layers has done is in case if there's an issue in any of these layers, it will have a ripple effect on other layers as well. And one other challenge is in case of a bug in one of the layer, again, it can cause different behaviors to the impact or ripple effect to the other layers. And what we also know is bugs can appear anytime, right? It can be that you did a deployment last week, but maybe bug might appear today. So managing these complexity distributed systems is a challenge because there are a lot of unknowns. And even though we think that we are knowing everything and we are on top of the development rigor, the test rigor and the CI CD pipelines and all the automations we are bringing in, we still tend to encounter issues. And that is one of the challenge. And that is something probably you will ask why? Like why are we not able to make management of distributed systems easy? Or is it something down to the skill level? Or is it something down to the people aspect? Or what is it? That's something probably you want to find out, and probably you want to answer. And if you are thinking that these issues are only bound to a certain level of size of the company or size of the engagement. But I think you are wrong. Why? Because I am giving you a couple of examples here. If you look at it, during 2021, Facebook had a massive outage which impacted Instagram and WhatsApp and it outage lasted around 5 hours. This has impacted Facebook badly and it had an impact on their stocks as well. And same here, one of the leading content delivery network, the fastly had the outage which lasted around 1 hour again and it had the impact on series of other applications or systems. Mainly this is down to fastly being one of the leading CDN. It has been used widely across the industry. So this impacted Amazon, eBay, Reddit and Twitch, Guardian and even the New York Times and even some of the uk government websites. This was identified that there was a bug in the software and which has got move into production and which has resulted this outage. So this is again one classic example, one issue impacting multiple systems. And this is kind of a classic example of the complexities distributed systems are bringing. And last year we have seen Datadoc, one of the leading observability tools and it's kind of like probably number two in the gardener observability magic quadrant. So Datadog experienced a substantial outage which resulted in impacting almost like most of their customers. Because Datadog, as a SaaS solution, people are dependent on the systems being up and running to enable their systems, getting the alerts and other observability related work. So this is again identified that there was a route and there's a restart required. So this is again a classic example. Like even with companies which has money, which has able to invest money on the proper tools, processes and even people is not immune to outages. So again, now you will question this, right? Why? Because if you are spending money, if you are bringing in the right tool, right people, and we are building the right processes, then why are we every day like ending up with these kind of situations? And one of the key thing, what you have to understand is we are doing lot of testing. And you might even ask that these systems are going through a rigorous testing cycle. It can be the developer testing, it can be the regression testing, it can be performance testing, it can be loud testing, it can be security testing. And once even you deploy your code to production, you will do a post deployment testing. So there are a lot of testing happening in these systems and even with this testing, and we have using lot of tools and automations to improve our testing capability, why are we still encountering major issues? And one of the answer is we are only testing what we know. So example, if I take example, if a customer gives a requirement, developer go through the requirement, probably, and the developer and the business analyst is converted to these stories. And then these stories, developers will start building parallel. The QA team, the quality assurance team will go through these requirements as well. And they will come up with their test cases. And this is a very important point to remember. So when the QA team is coming up with the test cases, what they are doing is they are referring to the requirements and they are putting their thought and they are coming up with those test cases. So what we have to understand is the QA team is coming up with what they know, right? So we are good at testing what we know. So that is one of the main thing. And most of the time when we are coming up with issues, what we have seen is it's the unknowns distributed system, because they are very complex and because it's like vast and it's difficult to manage, there are a lot of unknowns. And these unknowns are getting missed out when we are doing testing, because we test what is known to us. So this resulted in these issues getting creeped into our production systems regularly. And this is one of the challenge, and this is a challenge where we are planning to address using chaos engineering. So if you look at chaos engineering, chaos engineering is pretty much trying to understand what are the unknowns, right. As I said, when our quality assurance teams going and doing testing, they are only doing testing on knowns, what is known to them. But assuming a distributed system deployed in a data center, are we thinking of like someone in the data center pulling a cable or someone is switching off the machine accidentally or some router is enabling and which has resulted in traffic is getting failed. So those are the scenarios usually are not being conscious or covered. Part of our typical, the quality assurance testing. And one of the main reason is there's a lot of unknown here. So why chaos engineering is we want to develop a mechanism where we can test unknown. So this is nothing new. Chaos engineering is here in industry for some time now. So this was pioneered by Netflix. Especially when they are moving out, they are on premise to the cloud. They developed this chaos monkey, which is used to go and do some chaos in their production environment. So this allowed them to understand what are the issues and what are the reliability issues and the resilience issues in their system, so that they were able to develop a world class streaming media platform. So one of the reason chaos engineering is very important is it is able to test your resilience and it is able to do sometimes some things which we are not even thinking. So that is the advantage. So chaos engineering is allowing us to improve our reliability and build our systems with resilience in mind, and it's allowing us to achieve our service level objectives, and even it's allowing us to achieve our meantime to result and other targets. So this is very important for any of the distributed system to look at chaos engineering seriously. So with this, I'll park the topic of chaos engineering for a moment, and let's move into generative AI. So I'm probably sure you all are aware of generative AI. With the hype of chat GBT. Now, everyone is aware of generative AI, and everyone is using generative AI. So generative AI is nothing but the ability of AI models to create new content or create original content. By looking at the large amount of data, these large language models are able to come up with innovative solutions. When comes to creating new content, it can be either images or the videos, or it can be in any media form. So some of these models are really helping us to make a difference in our operations, or how we work, or how we approach our day of life work. So if you look at applications of generative AI, as I said, one of the top use cases is the text generation. So it's not just a chatbot or just a standard bot which you can communicate. So the same text generation concept, you can apply into your coding, and you can apply into the writing, you can apply into the production support, you can apply into manage knowledge bases. So the opportunities are endless, has good as your imagination, it has the capability of image generation as well, and video generation, and so many other things. So this is helping us to come up with the new innovative solutions for some of the problems or the challenges we have. And again, coming back to chaos engineering. So even though we say chaos, we want to do chaos engineering in a methodical way. So there's nothing like doing a chaos engineering a chaos way, and that will not give you any value. So what we want to do is come up with the proper methodology, which will allow you to develop your workflows, and then come up with the proper chaos engineering mechanism. And even though Netflix has been pioneered and they are able to do this in production environment, we currently suggest you started in your non production environments and then expand it later. So what are the key steps of chaos engineering? So the first task is discover, discover your services, what are the components of your application, where your application is deployed, and what are the dependencies upstreams and downstreams, those are very important because those are the places where things can go wrong. So discovering your service is very important. Then you have to understand your steady state studies is nothing but what is good mean for you. It can be like example, if it is web application, it's about application has to be up and running, application has to serve customer request within a certain time. So that is what goods mean for you. So every system we have to understand what is the steady state and that will allow us to understand and ensure that we have one identification. When we say good, so good can't be based on different people's opinion. It has to be something which is written and something which is acceptable to everyone. And then we will build our hypothesis, what are the failure scenarios for this application? And the hypothesis will help us to come up with experiments and then to run them. And when we are running we will verify and we will do an improve and continuous improvement of this cycle. And we are obviously able to integrate this with CI CD pipelines. But this is at the moment happening manually. And one important thing you have to be remember is when you are following this process, you have to have ability to measure everything. There's no point you do chaos engineering or chaos testing, but you are not looking at how your systems are behaving. I'm sure you have heard of this term called wisdom of production. It's about getting the knowledge of how your systems are running in production or production environments, or even some environments which are identical production. Remember I said you are not supposed to or encouraged to do chaos engineering testing in production. But what you can do is you can build an identical environment and you can do where you can get the experience and the knowledge. So key things is parallel to your chaos testing. You have to have your observability in place. So observability is looking at the external outputs, trying to define the internal state of your system. So that is very important. When the chaos is happening, you want to know how your systems are behaving and then you have to have your SL laws defined and they are in place so that you will able to understand when these chaos testings are run, what is the impact on your service level objectives. And that is directly correlated with what is impact on your customer experience. So at this moment we are not probably have the ability to understand what is customers feeling, but we have some service level objectives which are very aligned to customer experience. So this is allowing us to understand how this is impacting to our customer. And finally you can look at the latency traffic error and saturation. So combination of all of this will allow you to measure everything. So this is very important. When you are going through the chaos engineering process, you have to measure everything, otherwise that is a waste of time. So now let's look at part of each different stage of chaos engineering, how we can bring generative AI. So that is the key. So here I'm looking at ten stages, ten step into the chaos engineering workflow, and we'll try to go and see how we can actually leverage generative AI to provide solutions to this area. So one is discovery. So we are able, like the traditional way is people will use manual approach or something very close to manual to discover things. But we have the option, like we have sometimes the observability tools and our apms application performance supporting tools where it can create service maps, but that again will help us to do discovery. But sometimes, most of the time this is happening in manually and dependency identification. So this is again something which is currently happening manually and we are able to bringing in generative AI solutions which we will look at in future. And steady state defined. So steady state defined is you are looking at your architecture, the services and everything, and you are defining what is good mean to you. So this is at the moment pretty much happening manually. And this is something, again you can leverage generative AI solutions, hypothesis is nothing but your failure scenarios. And this is again at the moment what's happening is your entire team will start looking at the services, the architecture diagrams, system dependencies, bottlenecks and probable causes, and then come up with the failure scenarios or the hypothesis. But then again there's a human intervention required. And you all understand, if there's a human intervention means there's always humans will do what is known. So we are missing out the unknowns. So that is the area where again we can leverage generative AI, an experiment design, again something which is happening manually, or we can do it in a partial automation, but with this generative AI, we are able to full fledged automate this, like we can full fledged use generative AI and its content creation capabilities. And to come up with these experiments, and once you have the experiments, you will have to understand the blast radius. I mean once you come up with the chaos engineering test, you will not go and just execute it. You want to understand what is the impact it's going to cause upfront, right? So that we call blast radius. So once you are doing that experiment, you can understand your blast radius is correct, or whether it was less or whether it had a wider impact. That is again a learning. So understanding blast radiance, which is usually happening in a manual way where you can leverage generative AI solutions to do it in an automated fashion. And there's something called Rumsfield metrics, which is about known knowns, known unknowns and unknown knowns, so which we will cover in a subsequent slide. Here's again about coming up with a hypothesis for known knowns, what you know in your area, likewise. So this is again a place where we can use Geni, and then about monitoring and analysis. We can plug our geni solutions with observability tools, and then we can bring in the capabilities large language model is bringing to the table, and we can leverage that documentation and reporting is something a bread and butter for generative AI, because it's kind of like what is supposed to do or the basics. And then obviously we can go through this in an iterative way. We can share all the learnings, the observability data, service level objective data and other the latency, traffic saturation and error rates, and all those four golden signal data and feed them into generative AI, where it can come up with the holistic approach and to improve your systems or the workflows of chaos engineering. So, moving on now, I will go to the stages which we have discussed and discuss about how we can leverage generative AI for this. I am at the moment using one of the architecture diagram. So these architecture diagrams, I have pulled it up from Amazon.com website, one of their case studies. So probably one glass, you might have an understanding of what this architecture diagram, or probably it will take a little bit of time for you. And the idea is that that is why we need generative AI. So we don't need really SMEs to be involved all the time. So this is the architecture diagram of an electronic vehicle charging system. So it's hosted in AWS, it has components like Route 53, you have your network load balancer, it's using Fargate, and it's using some of the IoT components, and it has lambda SQS, step functions, and it has the Aetna, DynamoDB, S three, and other visualization tools as well. So this is again a very comprehensive, probably a highly complex distributed system. And what we are going to do is we are going to use this as our base and see how we can automate some of the stage just which I discussed earlier. So, moving on, one of our first task is, can we identify the dependencies here? What I'm doing is. So all those examples I have tested using AWS, party Rock AWS party rock is one of innovative the AI play pen which AWS has released. So if you are aware, AWS has the search maker which is about hard way of deploying and managing your AI models. And then you have bedrock where it's API based so that it's more of a serverless kind of experience. And party rock is where you kind of like plug and play and start using models. So if you want go to AWS party rock where you can experiment and create all those things which I'm discussing here. So here what I have done, I used that architecture diagram I showed you earlier. I created a small app where I have given the architecture diagram and I asked generative AI to come up with system dependencies. As you can see it says based on the architecture diagram, here are some of the key dependencies. So it's identifying EV charging station, OCPP protocol handler charging station management system, payment provider notification component, telemetry induction, billing system authentication has the key system dependencies. So this is good without us looking at this, without someone from the team or the SME looking at this generative AI, just by looking at architecture diagram, it's able to derive this. So what you have to understand is this is not just image reading. So this is about just image reading. Then understand those components using the massive amount of learning or the data it's have make a story which makes sense. So that is the beauty of large language models. So here it's able to easily come up with the dependency list. So here's the example. Like when you are coming up with the dependency list, like you can change the model types. Now here if I'm not mistaken, I am using the model call command. So here my output is more accurate, or I would say it's kind of like very clear compared to the previous one. So it's a pro tip. So if you are using large language model, like ensure that you are using the right model and that will give you more accurate data. So moving on, next aspect is understanding the steady state of this application or the system. Here again, I'm using party rock. I have given the architecture diagram link and then I have asked to come up with the steady state definition. So what the party rock or the large language model is producing is it's providing the steady state definitions. So it's saying API gateway. The API gateway is available and return the correct response code, 200 code when OICP requests are sent to API. And likewise it's able to list all the steady state definitions. Example payment gateway, it says the payment gateway is available and successfully processing payment for charging sessions. So this is the steady state. So our generative AI is able to look at the architecture diagram and then define what is good mean without even human involved. So moving on, this is about hypothesis creation. So creation of what are the failure scenarios? Again, what I'm doing, I have given the architecture diagram link and I have asked generative AI model to generate the hypothesis so it's able to come up with a meaningful and relevant accurate hypothesis. So first one, it says if the OCPP handle goes down, electric vehicle charging station will not be able to start stop charging sessions, leading to inability to charge the vehicles. So likewise, if the billing system goes down, new charging sessions cannot be started, has authorization, and payments cannot be processed. So it's coming up with the failure scenarios and also potential impact. So as you see here, again, I am simply giving generative AI the architecture diagram and asking it to come up with this hypothesis. But probably you might already understood it. Now, I can improve this massively, not only architecture diagram, I can give the observability data, I can give the other services, I can give a live service map taken from an observability tool, and I can give a lot of data so that generative AI can improve its answers and then experiment design. So again, as I discussed, I'm giving the architecture diagram. I'm just asking generative AI to come up with the experiments. So if you can see, it's able to come up with hypothesis and steady state and even list the test case. So here it states test case, simulate a failure of CPP handle by powering off the server or disrupting the network connection to the OCPP handler. Observe the behavior of system and impact on EV charging process. So again, another test case, simulate a database disruption by stopping the database service or corrupting the database files. Observe if the charging stations can continue charging and if the charging stations data is accurately recorded and updated. Likewise, I mentioned to you, so generative AI is able to smartly come up with these experiments. So this is again helping us massively to cut down human involvement and moving on. Once you have the experiments, it's about understanding blast radius. I mentioned to you, when you are executing your chaos testing or experiments, you have to understand or have some understanding of the impact. So here again what I'm doing, I'm giving the architecture diagram so that generative AI can have a big picture. And then I'm giving a test case. Then I'm asking it to come up with the blast radius. So here it says this is a blast radius for the test case. So here we know it's going to impact OCPP handler, EV charging state, back end services and users. So this is massively advantage. So generative AI without even us involving able to come up with this kind of data. And as I said, now if you remember I used architecture diagram to get all this data. And probably you might think architecture diagrams are sometimes outdated and can we improve this? Of course, this is an example of a very simple application where service map was generated by Cloudwatch x ray where once the application is being used, so I know the client is there, API gates, microservices, my databases. So I can just feed in this diagram with the architecture diagram so that generative AI solution can compare what is in the architecture diagram and what are the services it in live or operations or in whatever the environments you have deployed this. So this will allow it to. So the more data, more accurate data, more information we are feeding into the chaos engineering or the generative AI tool which is going to do this. This will allow it to come up with more accurate answers. So next, so we quickly touch about this Ramsfield metrics. So this is about known known. So when you say non known, it's about evaluating components of your system that are familiar and thoroughly understood, such as system architecture, infrastructure, identified failure points, CI CD test, and then we have the known unknowns. And this is about investigating potential issues and vulnerabilities in your system that are known but haven't undergone rigorous testing or validation, such as theoretical vulnerabilities or unverified failure scenarios. So that is again a known unknown. And then we have the unknown knowns reviewing issues that are considered but may have been forgotten or overlooked with passing of time, such as adherent to best practices, documented procedures, or insights from historical incidents. So this is known as known unknowns, known unknown knowns. And finally, it's about unknown unknowns conducting comprehensive chaos testing to discover, foresee and anticipated vulnerabilities that may emerge unexpectedly, leading to surpassed or often unpleasant nature. So Ramsville field just gives kind of an approach where you can plan your chaos testing or when you are coming up. So what we can do is we can feed this data or approach or the framework of Ramsville metrics into our generative AI here. What I'm doing is I have given the architecture diagram, then I'm just saying, come up with the known known. Or I could have improved this thing by looking at this, come up with test cases or hypothesis to map to known known and moving on. I was able to do the same thing for known unknown as well. So I'm sure now you have that understood. We are able to use generative AI, every aspects of our chaos engineering workflow. We are able to use generative AI to discover our services, we can able to use it to understand the dependencies, we are able to use it to define steady state, we are able to use it to come up with our hypothesis, we are able to use it to come up with our test cases or the experiments. And then we are able to use it to come up with, what do you call after test cases? The blast radius. So those are the ingredients or the pieces of our chaos engineering workflow. So if you are looking at a typical CI CD pipeline, you have the developers coding and committing code, you will build it and you will deploy it for testing and probably you will do the deploy as well. And then you have the observability tools which if you are using AWS, you can leverage cloud matrix, cloud logs or x rays. And here we can plug our chaos engineering pipeline to here as well. So what this does is part of this pipeline, our chaos engineering pipeline can get invoked and then it will start triggering a workflow. So what this workflow is, this is what I call the smart chaos. This is about autonomous chaos Engineering. So what we are doing is first our generator AI will have access to the training data set. This about your architecture diagrams, this about your service maps, this about observability data, this about all the inputs which we can give it to generative AI to come up with the proper solution. Then it's able to come up with, I mean it will obviously come up with the defined steady state. It's able to come up understand the dependencies and it will come up with a hypothesis. And then based on that it will try to come up with some experiments. So what we want is when we are creating experiments, we want to create templates, small experiment templates, so that we can make it as a collection and reuse. So that is probably we can give already some of the templates like API failures or instance termination or system resource filling up and those things as the templates so that this workflow can use. Then this workflow can create these experiments and then start executing it. And then once the experiments have been executed, we are able to monitor using our observability tools, like we are able to monitor the observability data or the telemetry data. And we will start monitoring our service level objectives and the traffic saturation error and other things. So that is again will give more data points to this workflow. So what we have to do is we'll come up with a small blast radius and we'll ask the generative AI to use a small blast radius and then increase it in subsequent runs. So this is more of a very automated workflow where we are leveraging what I mentioned to you, right? Each stage I have shown you how we can use generative AI and now we are bringing it all together to come up with a proper workflow. And if you look at it how this can be actually look at in an actual production environment or a typical other environment. So you will have your DevOps and sres which are doing your changes. They will ship these changes using CI CD pipeline which will go and deploying into your different environments and parallel that we can trigger your smart chaos. And smart chaos is also integrated with observability, service level objectives, error budgets. And what smart chaos will do is it will try to pull up the observability tool and get the actual service maps, and then it can refer the knowledge it's having about the architecture diagrams and all the other diagrams, the logs and the metrics and traces and everything based on that, it will try to come up with what are the steady states, what are the dependencies? It will come up with hypothesis and it will design experiments. And based on after it design experiments, it will look at a library of templates, which we call experiment templates, and then combine those templates scripts. It can create the actual chaos workflows and then it will start pushing this into the relevant environments to run. While doing that it can start monitoring it and look at the telemetry data and improve, right? Obviously it will look at controlling the blast radius and then increasing in subsequent runs. So this is happening fully autonomous where we don't have to spend time, we don't have to get our peoples involved. We can let the generative AI learn about our application setup and everything and then build smart chaos or the chaos engineering automations for us. This I like to call autonomous chaos workflows. So with this, before I wrap up, so if you are seriously trying to get into this, I have a couple of best practices you have to consider. One thing is as like any other generative solutions, this smart chaos is depend on providing good quality data to our generative AI models. If you provide unclean data or data which might not be relevant, then your model will struggle. And then one of the best practices have a big template library ready with some of the small subunit of experiments. So this will allow our workflows to quickly use these templates, bundle it up and create one template. And other thing is that let the smart chaos expand gradually. Don't go into in a big bang way. I mean, that is always not the best or advisable thing. Let's do it in more of the control and way that we can expand. And one other thing is have the feedback loop, right? You can introduce some of these feedback loops so that generative AI can provide notifications and all those insights and runtime, you can look at how things are happening and then also based on that, you can fine tune some of these workflows as well. And what are the pitfalls to look at it? So one of the big pitfall is when you are looking at generative AI, sometimes we have seen there's a data bias. So you have to ensure that when you are providing this data, so the model will not go into a data bias situation. So here, when I say data bias, so example, it's about balancing the flaws, right? So it can identify some of the critical workflows and non critical workflows. So what we don't want the model to be is biased to the critical workflow where it will go and only look at the critical workflows. But then we want to have some kind of good cover of non critical workflows as well. And then we want this approach to expand in a controlled fashion, not avoid rapid chaos expansion, because that is not a recommended thing. And we want it to feed all the observable telemetry data and everything, so that experiments, monitoring, measuring, everything can be comprehensive, and then it can learn and iterative in a nice fashion. And finally, what I want to tell you is, even though this is a very good approach, and I'm pretty sure we are going to see this in happening real time, there's a need of human expertise. Don't eliminate your human completely, still roping him in. Start looking at the areas, try to look at the bigger picture, look at things from holistically, and then try to see how we can improve this in a holistic way with this. I'd like to finish my session. It was wonderful being part of Chaos Engineering 2024. I hope you enjoy my presentation and there are a lot of other presentations. Please go and check them as well. And I'm very much happy if you are still in and if you are listening, I'm privileged to be part of this. And I'm wishing and hoping that you have learned something which you can go back and used in your day to day work with that. This is nice, us doing this presentation. Thank you very much. Have a nice day.
...

Indika Wimalasuriya

Senior Systems Engineering Manager @ Virtusa

Indika Wimalasuriya's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways