Conf42 Site Reliability Engineering (SRE) 2024 - Online

- premiere 5PM GMT

Stress-testing Azure Resources using Chaos Studio

Abstract

Azure Chaos Studio allows organizations to run disaster exercises and stress-test their environments for outages and learn from it. This session will guide you through the concepts, using an action-packed demo scenario to blow you out of your seat.

Summary

  • Peter Detender is a Microsoft technical trainer at Microsoft providing technical training. In the bit of free time that he has, he likes to share knowledge on any topic that's Azure related. Feel free to reach out on Twitter, by email or LinkedIn.
  • side reliability engineering stands for site reliability engineer or engineering. We refer to site as any possible workload that should run business critical and running 24/7. The engineering piece is applying to the principles of computer science. It could also mean figuring out how to apply existing solutions to new problems.
  • chaos engineering is the discipline of experimenting on a system in order to build confidence in the systems capability and withstanding turbulent conditions in production. The last part I included here is the DevOps engineer because human beings are still important.
  • Azure Chaos Studio is an Azure service offering chaos engineering as a service. Chaos studio allows you to bring in a full set of faults into your scenario. Faults can be organized to run in parallel or in sequence depending on the needs. Let's have a look at a couple of demo scenarios.
  • A managed identity is an Azure ad service principle, like a security object interaction from one Azure service. It can interact with other parts of the platform, virtual machines, kubernetes, clusters, app services, redis, cache, Cosmos, DB, and so many other services. Here's how to create a new backend from scratch.
  • You can run the same performance testing against multiple endpoints. Adding VM scale sets, web apps, testing kubernetes, clusters and anything else that I already talked about. From here you can add multiple steps, making it more complex. Some of these tests are failed, some of them are successful.
  • The first link here is our official Chaos studio documentation on the Microsoft Learn platform. The second one are pointers to additional learn resources. Enjoy the rest of 42 side reliability engineering conference and I hope to see you again.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, welcome to my session on stress testing Azure resources using Azure Chaos Studio. To share a bit about myself, I'm Peter Detender, originally from Belgium but moved to Redmond, Washington about two years ago. I'm a Microsoft technical trainer at Microsoft providing technical training. How hard can it be to come up with a job description right to our top customers and partners across the globe? In the bit of free time that I have, I still like to go back to Azure. But sharing knowledge presenting at virtual conferences like this one or in person on any topic that's Azure related Azure DevOps site reliability engineering or app modernization. I also like to write articles on my blog website zero zero seven ffflearning.com or publishing books where the latest one was a bit more than two years ago on the art of site reliability engineering. Feel free to reach out on Twitter, by email or LinkedIn. Now, with the personal marketing out of the way, let's jump straight into the technical piece of the session and starting with describing what site reliability engineering is about. Now in short side reliability engineering stands for site reliability engineer or engineering. And initially it's from Google, actually pointed to running main application, the www.google.com search website and it should have been available all the times. Now when the practice moved out of Google and became like public practice. Right. We refer to site as any possible workload that should run business critical and running 24/7 now the other part is the reliability piece, where reliability means that you want to guarantee as a team that any running application you need to support is available no matter what's happening or maybe even better according business requirements. And the engineering piece is applying to the principles of computer science and using engineering concepts to build and maintain your systems applications all the way from developing into the monitoring. Now, drilling down a bit more on the specific, I think it would probably take me, I don't know, two or three days, maybe more. But you could simplified a little bit in these core responsibilities. First of all, when you're wearing your developer hat, it means that you're working on writing software for typically larger scaled workloads. Now, sometimes you also take the responsibility for side pieces of running your application like backup, monitoring, load balancing, and even if you like moving into the operations and then last, it could also mean figuring out how to apply existing solutions to new problems. Good. Now with that, I need to move a little bit more away from site reliability engineering into chaos engineering. Now, what is chaos engineering? More specifically, I could summarize it as the discipline of experimenting on a system in order to build confidence in the systems capability and withstanding turbulent conditions in production. Now this is the official definition coming from the principles of chaos, which if you ask me could be the name of a rock band. Now there's three core key words I want to emphasize. First of all, it's experimenting, which means if you know a bit about DevOps, it also means failing fast, because the faster you fail, the faster you're forced to recover. You're going to learn how to make your systems more reliable, more resilient. So I call experimenting like licking a fuse as a kid, right where your hair would spike, or maybe even again as a kid, don't ask me how I know, but going downhill with your bike, you're not super experienced in it yet, and you go like super downhill, super fast, and maybe you're falling and you break your arm and then you go, oh my God, this was so cool, I'm gonna do this again. Now, the more loopholes we can identify up front, the more confidence, which is the next part in the definition we can have in the systems reliability. By introducing a series of event simulations based on real incidents or based on imaginary outages that could happen, you can target your workloads and learn from the impact. And then the last piece is overall withstanding any possible turbulent conditions. Think of it as cpu pressure or unplanned load, or maybe an unplanned outage that could qualify as chaos engineering issues. Now one example I would like to use here to start is what I call the curious case of cpu pressure. Now what does it mean? Imagine you have a workload, could be anything, could run in cloud, could run on Prem, could be hybrid, it's been running fine for months. And the average cpu load, now why do you know that? Because if it's running in Azure, you're going to use Azure monitoring. If it's running on Prem, you're going to use on prem monitoring tools. And in the end it's not too important as long as you integrate monitoring. But then suddenly there is a spike and eventually when you hammer your system, it probably goes down or it crashes. Right now it's stopping the application, the database goes down, the web app is not longer available, and so on. Now apart from troubleshooting the data piece, you also going hand in hand with testing your engineering team. Like how can we rebuild the system, how can we get it up and running again as fast as possible? Now it might also be that you don't even know the reason why, and that's why you want to use chaos engineering because what you're going to do is integrating functional testing to make sure that any possible outage, planned or unplanned, is not going to happen anymore. Or at least I would say trying to minimize the risk. That's the main thing. You can see here that I'm using a couple of examples like a virtual machine, a Kubernetes cluster key, vault network security groups. Why? Because all these are supported in my Azure Chaos studio service that I'll talk about later. The last part I included here is the DevOps engineer. Why? Because human beings are still important, right? There's still a huge amount of issues, unfortunately, when running environments because of human interaction. And don't forget we're mainly talking about production environments here. Now you might go, wait a minute, Peter, why are you not that happy with human beings? Or do you don't like, like DevOps engineers? Now I provide training and DevOps is one of my main technologies I'm providing training on. Now why is it so important to integrate this DevOps engineer as the curious case of cpu pressure? Because we all know what happens. We publish applications maybe on a Friday afternoon. Why? Because we have the whole weekend to recover in case of something going down. But then again going back to production environments. It also means that we need to make sure that everything keeps up and running. And if it's not because of the platform, if it's not because of the load of the platform, in some cases, unfortunately, it's still the human being. Now we still are in the curious case of cpu pressure. Now what's important here is that we're actually trying to step away from one individual component. Why is that? Because if you think about the virtual machines, Kubernetes, clusters, then yes, we look at CPU, but typically reusing CPU as the example. It's not the main root cause. Why not? Because there's a lot more going on in keeping your systems up and running besides monitoring cpu load. So it might be that cpu is biting because of latency in your database operations running, I don't know, some complex, oh sorry, running some complex calculation or running a database update. Or it might be that there are network connectivity issue by which now the operation cannot write to the database. And because the fact that it cannot write to the database, that's actually causing cpu pressure. So my analogy here is explaining that systems are complex virtual machine scale sets. Yes it's running a virtual machine, but it's running a few more than just a single one or a more complex architecture like kubernetes. Overall validating your IaaS, PaaS and serverless workloads like Azure functions, or maybe even the latest one service bus as part of your architecture Cosmos database. And again, so many other examples. And then to add even more complexity on this, it's like all of those in one single scenario where you're running virtual machines for part of the workload. Next to that, you're running short running container tasks inside kubernetes clusters, maybe kubernetes clusters across a hybrid scenario, partly Azure AWS, Google Cloud, and why not Umpra? And again, bringing all that together and then still having the human being as a potential weak spot. Imagine you need to manage your DevOps teams and they're active all over the globe, different time zones, managing the permissions and so on. Now you might think that chaos engineering is the next big thing, and maybe it's even following side reliability engineering, which you could say was following DevOps, but yet it's maybe too revolutionary for your cloud environments. I think nothing is more wrong. In fact, chaos engineering has been around for more than ten years now. Initially developed by software engineers from Netflix already 2008 when they started migrating from on prem data centers into public cloud data centers. While there are a lot of similarities across managing your own data center or using public cloud, theres also quite some big differences. And it was mainly those differences that forced Netflix engineers to create service architectures using higher reliability. Now to be clear, I think that Chaos engineering is not DevOps 3.0, but I definitely should be part of DevOps teamed Arsenal of tools to meet your business requirements to validate how your applications are running. So with that, I would say let's make it a little bit more technology focused on one specific service called Azure Chaos Studio. Now Azure Chaos Studio is, as you can probably figure out, an Azure service offering chaos engineering as a service, which means that you can inject faults into your Azure workloads. Now, thinking back about the definition, preferably you're going to use chaos engineering against your production environment. But honestly, trust me, you can do this against test and development environment as well. Now, whether you're testing how applications will run in Azure, or you're migrating applications to Azure, or maybe you're already running production workloads in Azure, Chaos studio allows you to bring in a full set of faults into your scenario, ranging from virtual machines which we call agent based chaos testing, or serverless if you have Kubernetes clusters targeting Azure, key vault targeting network security groups, and one of the latest services we actually added is service bus. Now the core of Chaos studio is Chaos experiments. So a chaos experiment is an Azure resource that describes the faults that should be run and the resources those faults should be run against. Now faults can be organized to run in parallel or in sequence, and I'll show you in an upcoming demo depending on the needs. Now, chaos supports two types of faults. I already talked about service direct, which means you're gonna a service which doesn't require an agent. Next to that you got agent based faults and that means you're gonna target a virtual machine workload which could be windows, Linux and Kubernetes clusters as well. Now the core is a chaos experiment. Now when you build a chaos experiment, what you're doing is defining one or more steps that execute sequentially, each step containing one or more branches, as we call it, that run in parallel within the step, and each branch containing one or more actions, such as injecting a fault, waiting for a certain duration, or anything else you could come up with. Now finally you organize the resources which we call targets that each fault will be run against. You can move them into a group scenario called a selector, and that's where you can easily reference a group of resources. So in short you would start with experimenting. You're going to create an experiment. You define the step by step process. For example hammering cpu load. Next to that I'm going to simulate latency. Next to that I'm going to fire off a crash of a web server or maybe running some heavy loaded database task or anything. Again that's running inside a virtual machine or inside a Kubernetes cluster or targeting network security group, or simulating an outage, or not having the correct permissions to connect to key vault and app services and functions and so many other examples. And then in the next step you're going to define the actual actions. So reflecting on this is what I want you to do. I want you to simulate an action called cpu pressure. Now think back about one of the examples I shared before. Within the cpu pressure I want to run a 99% cpu load, maybe 20, maybe 50, whatever number that could work and is relevant for your outage testing. And then what you want to run is the x amount of time. I want to run this for five minutes, ten minutes, and I want you to repeat it for the next hour, although that would be an actually pretty long test. Now most probably you don't need the full 3 hours or 1 hour or maybe even a couple of minutes to validate and figure out if your virtual machine can handle the load or any of the other services. I already mentioned. Now those two previous slides is technically all you really need to know about managing Azure Chaos studio. You deploy resources, you create experiments and you're going to run them. So with that, let's have a look at a couple of demo scenarios and what it actually looks like in a real life scenario. So this is my Azure environment where I already enabled Chaos studio. And again, if you don't really know how to do that, you go into your subscriptions and within your subscriptions you're going to search for resource providers. So go a little bit down here into the resource provider section and that's where you literally enable the Azure fabric features of the platform where you're going to search for chaos. And in my case obviously because otherwise I could not really demo anything. You're going to click that register option here on top, giving it a couple of seconds, worst case a few minutes. And from there, once it's a green registered option, it means you can start using your chaos environment or the Chaos studio service, I should say. So let's jump back to Chaos Studio and the first thing we're going to do is defining a target. Now a target is again anything that's already running in your environment quite important. It needs to be up and running. The reason why is because you need to define how you're going to manage your target. As you can see here, I got two scenarios already enabled. I have my Ubuntu Linux virtual machine where I can manage actions. And the second scenario I got is my AKS cluster. Now you can see that there are a lot of other scenarios available. I can literally target my virtual machine scale set over here. I can test against an NSG. I don't have an example for app services, although technically you can actually stop the app service itself. The virtual machine is pretty obvious, but also validating interactions against like a key vault in this case. So it's not just about virtual machines, it's not just about kubernetes, but it's expanding the target environment. How do we install that agent? Right, that would be the next step where again you got service direct and you have agent based. So for a virtual machine you probably gonna go for agent based. So it's nothing harder than selecting your target. Virtual machine going up here, enable target. And since again it's a vm, we gonna install that package. The next step what you need is a managed identity. Now what is a managed identity? Right. Cael Studio, for the first time you need to create that managed identity. If you don't really know the details a managed identity is an Azure ad service principle, like a security object interaction from one Azure service, Azure Chaos studio, in this case to interact with other parts of the platform, virtual machines, nsgs, kubernetes, clusters, app services, redis, cache, Cosmos, DB, and so many other services. So that would be step one. I already created my managed identity, very important. It's a managed identity for the Chaos studio service. It's not a managed identity for the virtual machine that you're going to use as a target. Second dependency component is application insights. So again, you already know we need application insights for our observability, like providing the metrics sharing you the output where you need to dive in your Azure portal or again using some automation engine, terraform, Powershell, Azure Cli doesn't really matter and you're going to deploy our application inside service. From there you just need to define which one you want to use. As you can see, I got quite a lot of them because I'm taking monitoring quite serious and we're going to enable it as well. And that's all it takes. Where from here it's going to install that agent. Now to speed up my demo a little bit. I already have this for my Ubuntu via and you can validate your deployment. You don't really have to wait in the portal to just validate what's going on, but it's nothing more than any traditional extension. So maybe you're already familiar with using chef puppet, some anti malware scenario like Microsoft Defender, and installing it as an extension, like adding a little piece of software like an agent inside your virtual machine where my portal seems to not to be refreshing. Let's try that again. So I got my chaos engineering and provisioning succeeded again, this takes just a couple of minutes, but I didn't really want to wait to show you how it works on the web VM itself. Keep in mind if it's a Linux backend, you need to install that chaos ng package inside your cluster as well. And again, you could probably find out how to do that using the traditional virtual machine approach for for your Linux like apt get depending a bit on the Linux flavor you're using would be a good option. So what we have right now is our target. We have the virtual machines defined and if you want we could also go back where the deployment is still running totally fine. Taking a little step back to my chaos studio where now I could target the similar concept but using a different service. So I'm going to give it a couple of seconds before it's pulling up the capable or compatible resources and maybe using my Cosmos database, where this time I'm going to enable it for a direct service model, which means I don't need to deploy an agent. That's really how easy it is. It's going to flip back, but we're not going to wait for it. You probably get the idea how to do that, where the next step is defining an experiment. I already have a few experiments up and running that I'm going to reuse just to again, keep it a little bit entertaining, not wasting time on a lot of stuff happening in the backend. Nothing blocks me from showing you how to create a new experiment. So once you have this, it's opening up an experiments configuration. So, interesting enough, an experiment by itself is nothing more than a standalone azure resource. It also means that you can automate the deployment using arm templates, bicep to just give you an example. And we're going to call this the lord chaos. My targets are running in central us, but in the end it's not really that important. But I typically like deploying my experiments in the same region where I'm going to run my testing or gonna run my engineering experiment itself from here. As I outlined in the presentation, the highest level is a step within the step you got a branch and out of the branch you're going to define a fault where a fault can be multiple actions. You can totally customize the step one branch name and adding multiple branches. I'm going to keep it a little bit easy for now because I don't want to run overtime and just focusing on creating a new folder where a fault is, I would say based on a library of fault injections where a lot of them are already available and if needed you can insert some custom ones out of a fault library that I'll show you in a minute. Now, since we have different target endpoints, it means that we can go for different fault scenarios. A couple of obvious ones, like validating the shutdown of a virtual machine or shutting down the full scale set and finding out what's the impact for my application workload, specifically targeting Cosmos DB, where again I didn't need to deploy an agent. It's that direct service direct model. And I can start hammering my Cosmos DB, running a failover from one region to another, shutting down, moving down to the rus that the request units and causing some kind of latency and again, testing how my web front end is responding to that. Right, a nice list of aks, specific scenarios where again all of them are based on that open source chaos mesh scenario and then from there a whole collection of standalone ones. Let me change the color to just highlight that it's slightly different, interacting with key vaults like not giving you access to find your secrets certificates anymore. Cpu pressure, physical memory testing, virtual memory, cpu load stopping a service killing a process, and so many other scenarios available. So let's check out what that fault library is about. So in the official Microsoft Docs there's a pretty long list of potential faults and specific actions that you can use. Now the way it works for like the easier tasks or the easier fault validations that I talked about, you can just select it and you don't really have to do anything. But you might also end up in a more complex scenario like for example cpu pressure. Quite easy to understand what it's going to do and it doesn't really require a lot of settings. Now what you need is a JSon file. You could have like an open source library. Again like chaos mesh giving you YAML syntax for kubernetes environments. We're now out of that AKS specific chaos engineering testing that we offer out of Chaos studio in Azure. You need to translate the YAML files into JSON and that's also documented in this article. Don't worry too much about where to find that fold library. You have that link in the Chaos studio portal and I also have it in my slides all the way at the end. Good. So let's go back to our scenario and we're going to simulate like a cpu pressure. Now I can define some of my settings, right? I'm going to run this for let's say ten minutes and I'm going to hammer my server with 99% cpu. Don't ask me why it's not possible to select 100%, but you probably get the idea because once it reaches 100% cpu, it's typically crashing the server. Next we're going to define the target resource, where I'm going to hammer my Ubuntu and the web VM. So the nice thing here is that you can run the same performance testing against multiple endpoints. Could be the same server workload, could be different ones. Like my web VM is running my web engine, my Ubuntu VM is running a Mysql postgres database, and I'm going to test the behavior when I'm hammering my server with a cpu load. And that's literally all it takes. If you want you can add a delay like testing something for a few minutes. After that waiting couple of minutes, testing it again, waiting a few minutes, testing it again. So that's another maybe more complex scenario. And then adding an additional step is nothing more than doing the same thing where again it's going to run through that step by step scenario, running through that sequence where let's say for now we gonna kill a process and the process name is called CMD Exe. Not the most fancy one, but again this could be like any super important business critical workload where you just gonna stop and kill that process. That's the idea. Where next while my cpu load is hammering, I'm gonna trigger some other process stop. And again from here you can add multiple steps, making it more complex, integrating the time waiting scenario, you probably get that idea. From here you review, create. That's basically it. Now again, I already ran this process and what I have is cpu experiments. And then later on I got something for my aks cluster as well. Now from here we have chaos studio. We enabled the service in the platform. Next we define target direct service or using that agent based deployment. Step number three, we defined experiments as easy or as complex as you wanted, targeting a single step with a single fault to a single vm target object, or making it more complex. More, yeah, more complex. Adding VM scale sets, web apps, testing kubernetes, clusters and anything else that I already talked about. Now from here, next step is obviously triggering an experiment. Now before this is capable of kicking off, you need to define role based access permission. You need to provide the correct permissions. And interesting enough, it only needs reader permissions so there's no real security violation I would say, or security violation actually better where the cpu experiment by itself is becoming an Azure standalone resource, which means it has a service principle in the back and we need to define our RBAC permissions. Defining read RBAC role based access for my PDT CPU experiment towards that specific resource. So we go into our target scenario, in this case my ubuntu vm, we go into access control. We're going to define that. My PDD CPU experiment gets reader permissions, so add new role assignment. Reader permissions is all it needs and we're going to specify the member where my member is. PDT CPU experiment that should be able to find something. Not for now for some weird reason, but you probably get the idea. Not gonna wait for it. So now the next step is running your testing. And as you can see I got a few other ones from other demos that I did, and some of them are failed, some of them are successful. So we're obviously aiming for successful ones. So what I'm going to do here is kicking it off, starting my process to just show you what it's about. It's going to move this into studio processing cube, so nothing really to wait for. And while it is running you can validate the details. Looks like my portal is running behind a little bit now. Here it tells me because I was not able to actually define that RBAC. It's literally going to tell me like too bad it's not working. I don't have permissions to run this, which is totally fine. I explicitly wanted to show you, but I can go back to another task and showing you the outcome of it, the exact same scenario. I simulated cpu pressure running cpu testing during 1010 minutes. So you can nicely see here it started 446 and it finished 456 and it ran for about ten minutes and the outcome is complete. Now the part that I cannot really show you, but you should probably find out that you could go back to your Azure monitor, could go back to log analytics, going to your virtual machine if you want to run this live and validating where you go into the metrics, checking the cpu load and seeing that the cpu is spiking up to 99% and then after ten minutes it's totally dropping back and at the same time, and that's obviously more important, you're going to validate the impact on your workload. Another scenario for my AKS cluster, why not? Is first of all defining your chaos studio. We already did defining the target agent list. So the direct service option and creating an experiment where what I did with this one is defining a step where I'm going to run a predefined chaos mesh. So that open sourced library, the family of the library that I showed you earlier in the docs and just grabbing one of those parameters where you can see here parts of that JSon, and I don't really know if it's related to the preview, but if I go into edit mode it's not going to show me the details of that JSoN. So that's something I noticed while going through the configuration that from here it's not really, I would say publishing, showing you the details of that JSON file experiment. So maybe something to keep in mind during your preview testing to copy your fold library snippets aside, if you don't just want to rely on what's already available over here in that fold library, if you want to spin up a new one. So what I could do here is my pots and grabbing one of those jSon. So that's literally what I did and you can see it down here. So let's run this experiment as well. We're going to start it, gonna kick it off. The best way to validate your running paths and how they're behaving is running cube control, right? You can validate a little bit of it using Azure monitor as well, but why not showing you Kubectl get pods. We're now out of my experiment. It should run. Some of my pots and some other ones are getting stopped. So that's literally what we're testing here. I can see that already. In just a couple of minutes it's terminating and it is restarting once where the other ones, because there's a little bit of time I kept in between the recording of kicking off the task and moving to the behavior of the pots, where you can see that I'm literally stopping a pot, starting it again, stopping it, starting it again. That's the main scenario we're testing. I'm only running one note in the backend, but you probably get the idea from here. Awesome. I'm so excited that all my demos actually worked. Now, to make sure I'm not running too much out of time here for my session, let's wrap it up with sharing some resources. The first link here is our official Chaos studio documentation on the Microsoft Learn platform. The second one are pointers to additional learn resources that could be helpful if you're a little bit teased by the demos that I did, because those are the step by step instructions on how to integrate chaos studio for your Kubernetes clusters and using virtual machine agent based. With that, I would like to thank you for watching my session. I hope you learned from it, and obviously even more, I hope you enjoyed it. Don't hesitate reaching out in case of any questions on the session on Azure, and maybe even more so on Azure Chaos studio specifically. Enjoy the rest of 42 side reliability engineering conference and I hope to see you again in any of my other online sessions. Thank you so much and enjoy the rest of the day.
...

Peter De Tender

Business Program Manager - Azure Technical Trainer @ Microsoft

Peter De Tender's LinkedIn account Peter De Tender's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways