Conf42 Chaos Engineering 2022 - Online

Chaos engineering at Microsoft with Azure Chaos Studio

Video size:


Resilience of cloud applications requires collaboration between the cloud provider and the cloud consumer. At Microsoft, we are embodying this ethos with Azure Chaos Studio, a fully-managed chaos engineering experimentation platform for accelerating discovery of hard-to-find problems, from late-stage development through production.

In this session, we’ll discuss how Chaos Studio is being used by both Microsoft cloud service teams and Azure customers to improve resilience and share some key learnings from putting chaos engineering into action at enterprise scale.


  • John Engelkemnetz is principal program manager on the Azure Chaos studio team. He talks about the concept of resilience and what it means to establish and maintain quality in the cloud. Building quality into the entire service development and operation lifecycle is the right way to tackle this challenge.
  • There are two interesting aspects of cloud applications that make building resilience a little bit more challenging than it may have been on premises. Using cloud native applications can increase resilience by virtue of the fact that these are built to be resilient to failure. There is a shared responsibility between the cloud provider and the cloud consumer.
  • Azure Chaos Studio is what we're using within Azure across Microsoft cloud service teams to enable us to improve our availability. Microsoft's approach to chaos engineering fits very well with the models and approach that are out in the industry. Chaos engineering can be used in a wide variety of scenarios.
  • Azure Chaos Studio enables you to do chaos engineering natively within Azure. One of our aspirational goals is to provide experiment templates for the most severe Azure outages. Having this be fully managed means you can focus on the outcomes rather than the implementation.
  • Microsoft has been using chaos engineering for several years to improve the resilience of our own cloud services. The company is investing heavily in failure scenarios over adding specific faults. Over 50 teams at Microsoft, 50 cloud services are using Chaos studio today.
  • Chaos engineering really needs to start with maturing your tooling and processes before you go to introduce any amount of chaos. A great way to understand where to apply chaos engineering is to quantify and analyze past outages. It is important to build confidence in pre production before heading into production.


This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Cloud hi, my name is John Engelkemnetz and I'm principal program manager manager on the Azure Chaos studio team here at Microsoft. To talk to you a little bit about how we do chaos engineering at Microsoft using our new service, Azure Chaos Studio, and to tell you a little bit about some of our learnings in doing chaos engineering. So to start us off, I do like to talk a little bit about this concept of resilience, as well as what it means to establish and maintain quality in the cloud. So we know that resilience is the capability of a system to handle and recover from disruptions. And a disruption can be anything from a major outage that drops availability to 0% for a long time window to something much more minor, say a deviation in the availability that is only slight, a sudden high amount of stress, higher latency, et cetera. All of these are examples of more brownout type cases where there is still some disruption to the service, even though it is not a complete and utter in availability of that service. Now, regardless of whether it is a brownout or a true blackout major outage event, any sort of impact that is disruptive to the availability and performance of your service is going to impact customer experience. And we know that when there are outages that impact availability, there is business impact. You can have upset customers, you can lose revenue. And a key thing with chaos Engineering is being able to measure the impact to your business when there is an outage, in terms of that cost to the business, whether it be lost revenue or lost sales or anything that might fit into that category. But beyond the simple business impact of an outage, what we found running a major cloud provider is that our customers are running mission critical apps on Azure, and that means that beyond lost revenue, there can be major legal consequences of an outage, and even in some cases, life or death consequences. So in a legal example, many financial institutions need to provide audit evidence that they can successfully recover from a disaster. If that does not remain true, there can be legal consequences from a government. Another example in the life and safety area is emergency services. Increasingly, emergency services operate on top of cloud providers, and an outage in an emergency service might be the difference between an ambulance getting to where it needs to go on time or that ambulance not being able to respond in an emergency situation as appropriately. So we take this really seriously, knowing that businesses the stock market, finances as well as life and death scenarios and legal consequences can happen when there's an availability due to a service outage. Now at Microsoft, we think that building quality into the entire service development and operation lifecycle is the right way to tackle this challenge. And when we talk about building that quality into the entire service development and operation lifecycle, we really mean two things. The first is thinking about quality from the beginning of the ideation of a new service through the development of that service, and through the deployment and operation of the service. Now that continues through to the continuous deployment and development of that service, and even maintaining quality through deprecation of a legacy service. The other thing that this means to us at Microsoft is that beyond simply making quality something that our site reliability engineers and DevOps engineers think about, quality has to be something that is a part of the culture of the entire company. And that means including leaders, managers, as well as other folks involved in the building and development of applications. So product managers, testers, folks who are doing marketing, even getting them involved in thinking about quality of the services from a business perspective, help to reinforce the importance of quality. As a product manager, my accountability is not just to additional users or additional revenue, it is also to having a service that is quality. Quality is a customer requirement, and that means that both me as an individual contributor as well as my management chain all have to be thinking about quality and prioritizing it as a fundamental, similar to security that everyone takes seriously and contributes to as a new service is being built or while operating an existing service. At Microsoft, one initiative we're doing to tackle this is making sure that as part of the core priorities for every employee who works in our Azure division, we're including that we think about quality as one of those core priorities, and then we're measuring ourselves so that everyone is accountable for contributing to improvements in quality in some way, shape or form. Now, all of this becomes particularly important when we're talking about cloud applications. There are two interesting aspects of cloud applications that make building resilience a little bit more challenging than it may have been on premises. The first aspect of this is that the architecture types for cloud applications tend to be highly distributed, highly complex, and oftentimes less familiar to folks who are using those. So while there are enormous benefits in leveraging services like Azure Kubernetes service or Azure app services, there is a slight drawback in that the patterns for building resilience of those services may be a little bit less mature. And certainly knowledge of how to leverage those patterns within any given organization might be lower. So using cloud native applications can increase resilience by virtue of the fact that these are built to be resilient to failure. But there is this consequence of potentially lower knowledge and lower ability to execute on having those best practices built in when developing a cloud native application. The other part about migration to the cloud that can be challenging is the sudden increased difference or distance between the cloud consumer and the application that you've written and the underlying compute that's going to run that code. So depending on the service type you choose, let's say you're developing a serverless application, there may be three, four, five layers of compute between the code you've written and the actual code that is running in our physical data center. Now, that benefits you as a cloud consumer because you benefit from the scale and cost efficiency of the cloud. You also benefit from the resilience that can get built at scale by a large scale cloud provider like Azure. However, it does mean that there's these abstractions that can mean sometimes you're at the mercy of the cloud provider when it comes to resilience, and there's plenty that can be done to defend against a failure in your underlying cloud provider. But sometimes you're really just sort of hoping that the platform is stable, because if there is an issue in the underlying platform, there's not much you can do to avoid that becoming an issue for your upstream service. And this is why we believe that much like the security pillar of cloud development with resilience, there is a shared responsibility between the cloud provider and the cloud consumer. And when we use this term shared responsibility, what we mean is that we both have a shared accountability for ensuring that our applications are robust, redundant, reliable, so that your downstream customer, the consumer of your applications, don't see downtime. Now, if the cloud provider were to have 100% availability, but the cloud consumer were to have not implemented best practices in terms of resilience, or in the alternate case where a cloud consumer has implemented every best practice available to defend against any sort of failure, but the cloud provider is just simply having horrible SLA attainment and constant outages. In either of those scenarios, there will still be downstream impact to a customer. And that's why we believe that as the cloud provider, we need a solution that helps our customers to become resilient and to defend against an issue that can happen either within their own service, or an issue that could happen in the underlying platform that impacts a service depending on that platform. So we believe that we need to provide that sort of solution, as well as continue to meet our responsibility in our shared responsibility for continuing to up our availability and our resilience of the services that you depend on. So all of this is highly relevant to Azure Chaos Studio, where Azure Chaos Studio, the exact same product that we make available to customers that run on Azure, is what we're using within Azure across Microsoft cloud service teams to enable us to improve our availability by doing chaos engineering. So let's talk a little bit about this concept of chaos engineering. Microsoft's approach to chaos engineering fits very well with the models and approach that are out in the industry and that many of the experts in site reliability engineering have developed over time. One thing we do like to emphasize is the importance of leveraging the scientific method when going to do chaos engineering so that your chaos isn't simply chaos for Chaos's sake, but rather is controlled chaos, structured chaos, chaos that has a definitive outcome and results in some sort of tangible improvement. So if you're familiar with chaos engineering, you're likely familiar with the idea of starting with an understanding of your steady state, having appropriate observability and health modeling such that you can identify an SLI, a service level indicator, and a service level objective that are going to kind of be your bar for availability, and then leverage that steady state to formulate a hypothesis about your system where we say we believe that we won't deviate from the steady state more than a certain percentage given some particular failure scenario happening within our application. Now with that hypothesis, you can then go and create a chaos experiment and run that experiment to assess whether your hypothesis was valid or invalid, and that allows you to do some analysis to understand were we resilient to that failure. If not, we have some work to do to dig deeper into the logs, the traces, to understand exactly why our hypothesis was invalid, and that inevitably teams to some sort of improvement in the service. And this is cyclical, both because you're going to continuously want to up the bar in terms of the quality of your service, but also because services are going to continue to change, whether it is a service that is growing and there's continuous development happening on that service, or if it's simply the fact that over time the platform that your service depends on has mandatory upgrades, say upgrades to your version of kubernetes, upgrades to your operating system, version upgrades to your version of net or python or whatever that libraries are that you depend on, some of those will be forced on your service. And so that means that maintaining resilience against certain scenarios requires that you're thinking about this cyclically, and not just as a one time activity. Now at Microsoft, we also believe that chaos engineering can be used in a wide variety of scenarios from those that we hear of as shift right scenarios where we're being game days, business continuity disaster recovery drills, ensuring that our live site tooling and observability data covers all of our key scenarios. But also we believe strongly in pulling those quality learnings earlier into the cycle. So we prevent any regressions in what we've done in shift write quality validation. Now when to use shift write quality validation versus pulling something into your CI CD pipeline and leveraging that as a gate to a deployment being able to be flighted outwards? Well, I think a major factor here involves whether or not you need real customer traffic or really well simulated customer traffic. If there is a certain scenario where you can generate load on demand and you only need load for a specific amount of time, and that load doesn't have to be as random or fit the exact patterns or scale of true production customer traffic, well, that's something that you can generate via a load test and then perform chaos engineering in your CI CD pipeline. But there are plenty of cases where shifting right means having some percentage of real customer traffic or really well simulated synthetic traffic that would make your service appear to be undergoing real stress from users. Now shift right an interesting thing we found since we've introduced Azure Chaos studio is that the sort of colloquial wisdom that chaos engineering needs to happen in production, that wisdom may not apply very well to the mission critical services that a number of our customers are building and are running on the azure cloud. So when it comes to a shift right scenario being done in production versus pre production, we believe that there is a very useful and valid case for when you should be doing chaos in production. But there's also plenty of cases where chaos should really be done or start in pre production. The first case we hear from customers is simply, hey, we have not built up that confidence yet in a particular failure scenario to go cause that failure in production. So it's the beginning of our journey with availability zones, or we're just beginning to stress test a new application, chances are you're going to want to start in pre production where there's less risk before moving that sort of test out into the production environment. And the production test becomes more of a final checkbox validation that everything's working as expected. The other thing that comes up with shift right being in production versus pre production is risk tolerance when it comes to a particular failure. If you are a mission critical application, if you are that healthcare provider that is determining whether prescriptions are issued for emergency medical needs, chances are you may say that the risk of an injected fault in production causing an outage are too great and that production simply is not a suitable environment to really be doing chaos engineering. So keeping those factors in mind can help you determine when and where you might do chaos engineering. Now, a brief word about Azure Chaos Studio Chaos Studio is our new product, available as part of the Azure platform that enables you to do chaos engineering natively within Azure. It's a fully managed service, which means that there is no need to install any utility, make updates, maintain a platform. Those can be expensive and they can be a challenge for any service team to have to go and operate, maintain and secure those tools. So having this be fully managed means you can focus on the outcomes rather than the implementation. We're well integrated with Azure's management tools, including Azure resource manager, Azure policy, project Bicep, and several of the other aspects of Azure so that things fit very naturally in your ecosystem. The way you deploy your infrastructure is how you can deploy your chaos experiments, and you can manage and secure access to your chaos experiments exactly as you're doing with any other part of your infrastructure estate that exists in Azure. We have integration with observability tooling to ensure you can do that analysis when a chaos experiment happens. And we have an expanding library of faults that covers a lot of the common Azure service issues. One of our aspirational goals is to provide experiment templates for the most severe Azure outages that happen on the platform. And that's something we pay a lot of attention to, is when there is a high severity Azure outage that impacts a customer. How can we transform that into a chaos experiment template that would allow a customer, a cloud consumer, to go and replicate that failure to ensure that they are well defended against having an impact to their availability should any similar sort of outage occur. And the final thing I'll mention about Chaos Studio is that safety is very important to us. We're not a simulator, we're not simulating faults, the faults are really getting injected. And that means that when we shut down a virtual machine, the virtual machine is getting shut down. When we apply cpu pressure, that cpu pressure on an AKS cluster is really happening. What this means is that whether it be unintentional, accidental fault injection, or something a little bit more malicious, we want to help make sure that you can defend against those by having appropriate permissions built into the system restrictions and administrative operations on what resources can be targeted for fault injection and what faults can be used on a particular resource, as well as permissions for the experiment to access those resources. So there's plenty of safety built into the mechanisms in Chaos studio. Now let's talk a little bit about chaos engineering at Microsoft at Microsoft, we've been using chaos engineering for several years to improve the resilience of our own cloud services, and the majority of those learnings have contributed to us building Chaos studio, both as a central service that all of our cloud service teams can use within Microsoft, as well as an offering that we can make available to our customers. And it's currently in public preview there are over 50 teams at Microsoft, 50 cloud services that are using Chaos studio today across a range of Microsoft products, from the power platform to the office suite to the Azure cloud services, we believe. And there are two areas of particular focus in Microsoft right now when it comes to chaos engineering. The first is investing heavily in failure scenarios over adding specific faults. So we've learned in analyzing our incidents and in looking at past resilience challenges that oftentimes it's a more broad scenario, say a region failure or an inability to scale with load or a network configuration change is the real scenario that you want to be able to replicate when doing chaos engineering, and oftentimes you're leveraging a set conf 42 recreate that failure scenario. But at the end of the day, it's the failure scenario that matters, not the individual faults that contribute up to that. So rather than focusing on delivering faults for every single option, we like to deliver faults and encourage our teams to build experiments around those scenarios and light up the correct faults for those major scenarios. Take availability zone down an Azure active directory, outage a DNS outage and focus on those. The other thing that we've been really investing heavily in is shifting this left, particularly when it comes to high blast radius outages. In the past we've known that there are a couple of places where things like DNS or Azure active directory, any sort of outage in those services can have impact on the majority of Microsoft cloud services. And so while we've done a lot to defend against those dependencies having impact on every other cloud service, when they do see impact, we now want to pull that from. We validated it for services that are going out into production and shift that left into preventing any regression in any ability to be resilient against those particular types of failures. And in fact, one thing we're looking forward to doing at Microsoft is ensuring that at least for the Azure division, every single new deployment of every service has specific failure scenarios validated as part of pre production, as a pre production gate before that build is suitable to go out to production to ensure that we're never regressing a scenario and our resilience to high blast radius outages. Now, two great examples of using chaos engineering within Microsoft. The first is the Microsoft Power platform team. They've been doing region failure experiments with chaos Studio and have identified several opportunities where when a data center went down, they were unable to not only recover from the failure, they were able to recover from the failure. But they said hey, we also want to be able to fail over to a secondary region. So when there's a failure in region, also have that failover. They discovered that by leveraging Chaos studio to shut down all of their compute and all of their services in a region to validate that the backup would come up and when it didn't, they were able to go and identify an issue causing that backup not to occur. Another example from that team was simply during an outage event, acknowledging that they didn't have the appropriate observability to detect the outage early and respond quickly. Now with Chaos studio they were able to recreate the conditions and find new spaces where they needed to instrument further in our monitoring so that they could mitigate and identify those failures and automate responses to them quickly. Another great example is our Azure key vault team. That team has been doing several availability zone down outages as well as scaling up the service outages. And a great learning from that team was while collaboration is important and validating configuration is great. Small teams and changes over time might mean that an original configuration in an autoscale rule might not have the same effects over time that it originally did. So in this case, in a pre production environment, they were being some chaos engineering and discovered that for a pre production service, the autoscale rule was misconfigured such that when stress was applied, the virtual machine scale set they ran on was not scaling up further. And so they were actually able to identify that in pre production, mitigate it before it ever became an issue in production. Now, what we've learned from doing chaos engineering within Microsoft, as well as partnering with some of our big customers who are leveraging chaos engineering and Chaos studio to do chaos engineering in their environments, there are a couple of insights that I'd like to share. The first is that chaos engineering really needs to start with maturing your tooling and processes before you go to introduce any amount of chaos. So ensuring that you have great robust observability, that you've already built backup mechanisms and you've made your service respond correctly to outages, making sure that you have a great livesight process in place, and that you have troubleshooting guides and automatic mitigations in place. Those have to be there before you start to do chaos engineering, because chaos engineering is not suitable for a case where you're learning something you already knew. Chaos engineering should reveal something new about your service, something unexpected. That's when chaos engineering is best. The second insight we've had is that a great way to understand where to apply chaos engineering is to quantify and analyze past outages. Now, the quantification is something I talked about a little bit earlier, where being able to put a monetary amount on an outage can help create that visibility across a larger company and across different sets of stakeholders to make them more invested in the importance of reliability. Putting a dollar amount or a rupee amount, any sort of currency amount on your service when there is an outage, and what that dollar amount was for. The outage is a wonderful way of keeping everyone's head centered around the importance of resilience. And then once you've built that, going back and analyzing past outages to identify when you've had high blast radius outages and or high frequency outages. Those are two great places to start by looking back at your previous incident and your responses to incidents, and then deciding from there which chaos experiments you might want to start with. A third insight for us has been it is important to build confidence in pre production before heading out into production. Now this requires that you've built a great pre production environment within Azure, we have the concept of our canary regions. These are two dedicated regions within resource manager that are unavailable to our customers. But almost every azure service has a stamp in those clusters or in those regions, and services have to go in and bake for a certain amount of time in those regions before they can move into production. Now, the fact that the services being deployed in our canary regions are dependent on other services in canary regions helps us to proactively identify any dependency issues, any failures, and mitigate those before anything hits production. In fact, in certain services like Microsoft Teams, we're the dog fooders. Where Microsoft Teams, our own Microsoft service traffic for our company goes through the Teams dog food environment in a canary environment. And that helps us to make sure that we are building quality in those environments before something goes out to the general public who rely on Microsoft Teams. And a final insight for us is just that quality can't happen if it's only on one person's back or only on one team's back. For a large scale organization, you really have to create a culture of quality where everyone believes that this is important. And we talked about this a little bit earlier in the presentation, but it is critical to remind us that that culture of quality has to come before any investment in chaos engineering. So with that, I'd like to say thank you very much for your time, and I hope you enjoy the rest of the conference. If you'd like to learn more about Azure Chaos Studio, you can go to aka Ms. Azure Chaos Studio. You'll be able to learn more about our service, get started, read our documentation, as well as see some of our user studies or our customer studies. So enjoy the rest of the conference, and thanks very much.

John Engel-Kemnetz

Principal Program Manager @ Microsoft

John Engel-Kemnetz's LinkedIn account John Engel-Kemnetz's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways