Conf42 Chaos Engineering 2022 - Online

Confidence in Chaos - How properly applied observability practices can take the ‘chaos’ out of chaos testing

Video size:


Everyone wants to use observability because it can reduce time between a problem and a solution, but there is a difference between applying traditional observability practices and how they are applied for each custom project and scenario. Narmatha Bala is a Senior Software Engineering Manager in the Commercial Software Engineering (CSE) team and runs the Observability chapter of the CSE Engineering Playbook, a best practices guide put together by development teams to accelerate the sharing of how best to work together, to quickly transfer learnings across teams with customers, particularly with new technologies and projects as a way to enable customers to carry the work forward and continue the development. CSE is a global engineering organization that works directly with the largest companies and not-for-profits in the world to tackle their most significant technical challenges.

In this talk, Narmatha will share her experiences in instilling Observability best practices with engineers who use best utilize observability in applications versus engineers that think they do. She will cover why actionable failures are a good thing in system design, how to best define Service Level Agreements and Objectives (SLAs, SLOs), and where monitoring fits into the process. This talk is best suited for intermediate practitioners who want to improve their chaos testing by improving their observability practices.


  • Narmatha Balasundram is a software engineering manager with a commercial software engineering team at Microsoft. In today's talk, we will see how to increase confidence as you're doing your chaos testing. Having good observability can help take the chaos out of the chaos testing, he says.
  • chaos testing is about deviating what the normal looks like. For an SRE engineer coming in, what are the key metrics that they should be aware of? Creating a dashboard for that particular stakeholder or for the SRE teams makes much more sense than just throwing everything into a combined dashboard.
  • Slas are service level agreements between two parties about what the services are going to be. Error budget is the maximum amount of time that a technical system can fail. It identifies before the chaos testing even starts. Monitoring and alerts are a great place to get an overall view of how the system is doing.
  • As we wrap this up, here are a few closing thoughts with observability. Always have distributed logs that have information correlated with each other. And secondly, iterate instrumentation. As long as we do these as an iterative process, we learn stuff.


This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Cse hi, I'm Narmatha Balasundram. I'm a software engineering manager with a commercial software engineering team at Microsoft. In today's talk, we are going to see how to increase confidence as you're doing your chaos testing, and how having good observability can help take the chaos out of the chaos testing. So what is chaos engineering? Chaos engineering is the art of intentionally breaking the system with the sole purpose of making the system more stable. So in a typical chaos testing environment, we start with a steady state, understand how the system behaves during steady state, and come up with a hypothesis. So if we say the system during high traffic is supposed to scale, then we do best to see if the system behaves as we intend it to behave. And when we find that the system does not behave as we intended to, then we make changes and then do the testing again. And this is a very iterative process. So what is observability? Observability is the ability to answer new questions with existing data. So having observability while doing chaos testing helps us understand what the base state is. How does normal look for the system, and what the deviated state is when there is an unusual activity or unusual events or response in the system, how does that change things? The chaos engineering without doing observability will just lead to chaos. In the talk today, we'll see how we can take that chaos out of testing observability. So what is the true observability in the system and what are the different attributes. Secondly, we'll see what are the golden signals of monitoring and how we can help have actionable failures. Number three, we'll see what is service level agreements and service level objectives, and how can we use this in our chaos testing to understand how the system is behaving. And lastly, monitoring and alerting, identifying based on the service level objectives, we will see how monitoring can help during chaos testing and even before we start the chaos testing process. So the true observability of the system are the things are made up of the things, what the different attributes of the application look like. So we talked about a little earlier, we talked about what chaos testing is, and it is about deviating what the normal looks like. And when the testing is run, what does the deviated state look like. So for us to understand normal, we need to understand what does the health of the components of the system look like when there are requests going against a service and it gives a 200 okay, and that means the service health of the API is doing well. And when there are additional stress that is going on in the system, the resource health of the system could also be constrained. Things like CPU or disk, iOS or memory for that instance. That could be a constraint. And additionally, with these two looking at what the business transaction KPIs are. So let's say if we are looking for number of logins per second, or if we are looking for the number of searches per second, then these are the key business indicators that we should look for as we are looking at these data across the different components. Now, all these data collected by itself, service health, then resource health and the business transaction KPIs separately, does not help in giving the holistic view of the system. Creating a dashboard and having the dashboard represent these values that the different stakeholders of the business are looking for is what makes it cohesive. And let's say at the time of chaos testing. For an SRE engineer coming in, what are the key metrics that they should be aware of and they need to look at? Creating a dashboard for that particular stakeholder or for the SRE teams here makes much more sense than just throwing everything into a combined dashboard. Alerts. Alerts are the end result of a dashboard. So let's say we see a change in the current state from into an abnormal state, then creating alerts and identifying the parties that needed to be alerted on each of these scenarios are very important. We talked about what these different attributes are, but how do we ensure that these attributes are what is getting measured? So Google's SRE team came up with the four golden signals, namely traffic latency errors and saturation. So let's take the example of a scenario where there is a high stress on the system. It could either be because of increased traffic, or it could be because of the vms being down. So we start off with a normal tech, normal. The traffic, for instance, in a normal scenario looks like 200 seconds per request per second, increasing the traffic during the chaos, testing to, let's say, 500 requests per second, or even 600 requests per second. How does that affect the latency? Latency during a normal state looks like 500 milliseconds per request. With an increased traffic, how does that deviate from what the normal state looks like? And how does traffic and latency play a role in errors? Are you seeing more timeout errors? And because of the high traffic that's coming in into the system, are the resources like the cpu, the memory and the disk IO, are they constrained? So these are the key things to watch out for as you're looking at the signals that's coming in, in the system. So looking at the, we talked about the attributes and we talked about what the golden signals are. How do we identify what these actionable failures are and what makes a good actionable failure? The actionable failure is something where the key to the recovery time is very minimum. So from the time you identify a problem to the time it gets recovered needs to be minimum. Meaning any logs that we collect in the system that contain information should have enough contextual information in it, so we get to the problem area faster. Sometimes when logs are built, and this reminds me of one of the scenarios that I had in my previous experience, where just building can observable system meant creating logs, right? We've all done that, where we just go in and log, and we were feeling pretty confident that we did all the good things. We had good logs, we had alert system in place, things seemed fine. And then we realized, as we started looking at the production scenarios and production troubleshooting, that the logs are very atomic and with no correlation between the logs from different components or having no contextual data, it took longer for us to identify what may have caused the issue. And a point for you to remember is logs. There's a lot of logs, and it's just huge volume of logs. And as you're thinking about what your observability looks like, just make sure that these logs are ingested well. And there is good analytical engines at the back end that can actually help crunch through these logs and give the data that you're looking for. Next, we look at service level agreements and objectives. So slas are service level agreements. They are typically agreements between two parties about what the services are going to be and what is the uptime and the response time between the services look like. For instance, let's say an agreements between a mapping provider and let's say a right sharing application. So the kind of agreement that they would get into would be that the mapping provider can come up with the agreements that they say that the maps would be available about 99.99% of the time. And when they make an API changes, it could be something like they give a two week notice to the right sharing company. And this is what is called as an slas. And so when the mapping provider company takes it back, they need to understand what do they need to do to be able to meet the slas. Now, this is the slO, which is a service level objective. So what are the objectives that a team must hit to meet the agreement that they just made with their client? So this boils down to what are the things that they need to monitor to be able to meet the agreement that they just had. So we cannot talk about slas and slos and not talk about error budgets. So error budget is that maximum amount of time that a technical system can fail without the consequences of having any contractual obligations. So let's say if the agreement is about 99.99%, the error budget for the applying provider companies is about 52 minutes per year. So that is the maximum amount of time that the technical system can fail. So we'll look at how knowing what these slas and slos are can help us with our chaos testing. So number one, it identifies before the chaos testing even starts. It helps to understand what does critical issues for the user experience look like. Let's say if it's a streaming providing company, provider company, and then they look at there's a little bit of buffering as the users view the content. Is this something that needs to be identified as a critical issue, or is this something that can be, that resolves by itself? So, knowing what the criticality of the issue is helps for us to be able to make a decision on identifying to fix it or not. Let's say if it's an issue where it is very intermittent and then the reload of the page fixes it, or it's very short lived. Right? So these are the scenarios that helps with chaos testing. So during the chaos testing, we want to be careful when to do the chaos testing. We do not want to be introducing more uncertainty in the network when the user experience is deteriorating or when there's a system performance and things are being slow, we want to be very informed about when we want to start doing the chaos testing. And once the chaos testing starts, we measure how the system is doing with the chaos and without the chaos and what the difference is. And this helps us to increase the load in the system, because we are seeing a real time feedback on how the system is doing by looking at the signals that we are monitoring using our golden signals. And then we figure out, is it good to tune up the traffic or are we hurting the system? And should we be turning back the traffic? Then comes our monitoring and alerts. So monitoring and alerts are a great place to get an overall view of how we are doing. How are the attributes doing with respect to the golden signals? And also, when things do go bad, what do we do with those? So while we are doing the chaos testing, when the system is bound to break, bound to deteriorate, we evaluate what are the missing alarms? Are the alarms even in the right place? Are we looking at the right things or are we just looking at symptoms and they are not truly the causes? Are we measuring the right things? Are we looking at the right latency, or are we looking at the right error numbers, for instance? And then once we do that, then we take a step back and we look at the thresholds of the alerts. And this is a very key component because sometimes the alert fatigue, if the threshold is too low, then that may result in an alert fatigue. Too many alerts, folks that are responsible for fixing the alerts may become immune to the fact that there is an alert. So the threshold of the alerts are key. And lastly, the alerts need to be sent to the right team. So we need to identify who owns or who is responsible for fixing the alerts for each of the segment. So doing this practice while doing chaos testing helps us to make sure that all these different things are aligned when we start seeing things going bad in production. We've already done this, we've already tested this. We know when things go bad. How do we identify things and how do we fix those things? And are these going to the right folks? As we wrap this up, here are a few closing thoughts with observability. I would suggest, depending on where you are in the observability track, I would say always start small. Start with auto instrumentation that's available out of the tool that you're going to use, and start small and keep adding information on top of it. And in our distributed environment, like how our tech stack is built, always have distributed logs that have information correlated with each other, and there are sufficient traces so you could track how things are moving along. And there is enough context for the logs as you're logging them. And secondly, iterate instrumentation. It's a rinse and repeat process. And there are things you discover that needs to be added. And that is all right. As long as we do these as an iterative process, we learn stuff. And as we learn stuff, go and make it better. And lastly, celebrate learnings. So once you figure out something doesn't work, it's quite all right. Go back in and fix it again with that. I just want to say thank you for listening through this presentation and I'm very happy to be a part of this. Thank you. Bye.

Narmatha Bala

Senior Software Engineering Manager @ Microsoft CSE

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways