Conf42 Chaos Engineering 2021 - Online

Chaos Engineering + Generic Mitigations: The Path to Self-Healing Systems

Video size:

Abstract

Chaos Engineering is a powerful tool to test your resilience when the unexpected occurs. But you also need to build out the practice for what comes next - how you rectify issues caused by unexpected conditions.

In this talk, Leonid Belkind, StackPulse CTO and Co-Founder, discusses how to use Chaos Engineering together with generic mitigations to deliver resilience that can’t be accomplished with either alone. He’ll share examples of how to fit these practices together, what’s needed to get started, and how to use both together to begin building out self-healing systems.

Summary

  • How calcium paired with a method called generic mitigations can serve as these really strong foundation to building resilient systems. Could lead us to this dream of self healing systems. If any of this is of interest for you, stick around.
  • Kaus engineering is defined as a discipline of experimenting on a certain system. The goal is to build confidence in system's capability to withstand turbulent conditions in production. Engineering helps us shed a light spotlight on where we are not resilient enough and prioritize our work.
  • When talking about resilience, we define proficiency as the ability to deliver more resilient services for less cost. Usually it needs to be a combination of different tools, different methodologies. To achieve better results in this, we need to be familiar with different methods.
  • A mitigation is defined as any action taken to prevent certain impact of outage or breakage on our service. Generic mitigations is a concept that was born in the world's most advanced organizations. Sometimes it is being applied when an outage happens before the source of outage is fully analyzed.
  • Oraze explains a typical timeline of an outage. Instead of diving into analyzing it further, we apply a mitigations strategy. Only then we perform root cause analysis. We develop a fix and we deploy it to production. This is the gain of our service users because service becomes operational earlier.
  • Rollback of any kind, business logic, binary executing it, configuration change data status, rollback to last known. Scaling out without taking into consideration the relationship between different components may actually introduce more noise. How do you scale things down should be well thought of, well rehearsed, and of course when needed, applied.
  • Kaus: Engineering on one end, engineering mitigations on another end. calcium engineering can recreate unexpected, irregular, turbulent conditions. Using them, we prepare ourselves for keeping these service operational under unexpected conditions. Together it really helps us manage our investment.
  • What do you need to develop and use generic mitigations? First, you need a platform for developing these logical flows. It is very important that the infrastructure for orchestrating them is separate from your production infrastructure. Picking the right platform here again is extremely critical to success.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey, everyone, I'm Leonid from Stackpulse. And today I'd like to talk to you about how calcium paired with a method called generic mitigations can serve as these really strong foundation to building resilient systems, maybe even leading us to this dream of self healing systems. So if any of this is of interest for you, stick around. It's going to be an interesting talk. So, in this conversation, we will talk about what is Kaus engineering and what does it usually do for us. We'll focus a bit on resilience of our systems and maybe define how much of it do we really want or how much of it do we really need. Then we'll also try to focus on whats does it mean to be proficient, to be really good and effective at building resilient systems. And then we'll dive into the world of generic remediations, figure out what they are, how they can help us, and how this combination actually serves as a very strong conditions for what everybody wants, or what everybody delivering services wants, which is resilient services. So, yeah, we are at the chaos engineering conference. So I bet everyone here has an idea about what KAus engineering is and why do we need it. But still, to make sort of like a structurally well delivered point of these, I'd like to take this presentation. Let's remember whats council engineering is defined as a discipline of experimenting on a certain system. And that experimentation, its goal is to build confidence in system's capability to withstand turbulent conditions in production. What does it actually mean to withstand turbulent conditions and why the experimentation is so important? Why can't I plan my system ahead of time to be very resilient? So turbulence, as a word is defined, let's say, by Cambridge dictionary as something that involves a lot of sudden changes. Now, these Merriam Webster definition is talking about irregular commotion, right? Things that are both very sudden and unexpected in their occurrence and can also be very radical in their impact. And this is sort of like the foundation of Kaus engineering, because this turbulence means that if we plan for it, if we build certain things, planning for certain variations, the production will still surprise us, right? Surprise us in again changes being sudden changes and challenges being not exactly how we expected them to be. Let's maybe look at comes engineering from a different perspective and try to understand what does it do for us. Let's say we invested the efforts and built a foundation and we start running these experiments. Why are we doing all that? So when we are building a software system, especially when we are building a software service, its fault tolerance, or its resilience, is a very important piece, because for those whats consume this service, the quality of service, in many cases even protected by service level agreements between the consumer of a service and a provider of a service, is a very important goal. Now, this fault tolerance or resilience gets constantly tested by changes in infrastructure, changes in networking, changes in application, changes in user consumption patterns, right? All these can destabilize our system and cow's engineering experiments, right? They can simulate these things so that we can look at the outcome of the experiments and build our systems to be resilient to similar, right? Not exactly the same things as we've seen with the experiments, but similar things that will happen in production. So, a very short summary for now, right? So, calf engineering experiments help us surface how our system behave in various unexpected conditions that can occur in production. But the end goal is not to run experiments, right? The end goal is to achieve resilience, right? And in order to do that, we need to take the deliverables. Whatever these experiments surfaced, we need to prioritize these findings. And then we need to invest in modifying our systems, sometimes their architecture, sometimes the infrastructure, whats run on, and so on. And so both in order to become more resilient to these particular types of interferences that were surfaced, right? So the goal is, resilience comes. Engineering helps us sort of like shed a light spotlight on where we are not resilient enough and prioritize our work. How much resilience do we actually want in our products? Now, this may sound, at first it's a very strange question. Imagine a situation where you would be sitting in a restaurant, ordering a dish, and then a waiter would ask you, okay, dear madam or sir, how tasty, how delicious, would you like us to make this dish for you? Would you like a really delicious, once in a lifetime culinary experience? Or would you like us to be like passable? I mean, decent, right? But nothing to write home about. Naturally, each and every one of us would answer, whats do you mean? I mean, I came here to this restaurant to enjoy myself. Of course. I want you guys to cook it to the absolute best of your culinary ability, right? Similarly, you're about to consume a service, any kind of a service, an email service, a monitoring service, an HR, whatever, right? How resilient would you like it to be? Well, you want it to be very resilient, right? Because you need it every time. If it's not there for you, why would you compromise for anything less than the absolutely best resilience? Now, in theory, that sounds great. In practice, it costs to those of us who operate services to make them more resilient. And if I had to look at these improvements in resilience as a function of investment cost, the graph would look like this. Now what it means is that at first, introducing an improvement in resilience, even quite a significant one, can be done with bearable costs. But as we become more and more resilient, bumping the resilience up yet more every time costs us more and more and more. And maybe adding these extra, however you manage and measure your resilience, in the end you may add this little resilience, but it will cost you twice or twice as much. And here's the deal. At the end of the day, not every investment makes sense. If we're building a commercial service, these it has consumers and these people, they are willing to pay us for this service something, right? And then we have our cost of operating and delivering these service. And everything in the middle is our margin, right? That is if we are a commercial service provider. Still, even if we're talking about internal service, there is a certain point of acceptable comes that we are building to invest. And in reality, these amount of resilience that we cant is actually the intersection of these two, right? It's what is acceptable both as a resilience and as costs. Let's maybe look at it slightly from a more complex but still very similar perspective. So we have our resilience. We will just turn it into the customer facing aspect and we will call it service level. So we have a low service level and then we invest additional costs in improving it. And then we have high service level. And now there are consumers for our service. For them, acceptable service level may sort of like vary depending on cost. But still in this line, there are two very important points that we should all consider. There is this point we call a minimal acceptable service level. Anything below this point just doesn't make sense. Your consumers would not cant it. And then there is a second very important point. And it's the point where comes for delivering a more resilient service are actually higher than what the service consumers would be willing to pay. Naturally, cost sensitivity may differ between one prospect consumer to another. So there is a range here. That's what gives rise to phenomenon such as premium products, right? Where people or organizations are willing to pay more for something that is more resilient, more luxurious. But still, no matter how the model looks, the conclusions will be these same. So there is this point naturally for us to operate with a margin. We need to be somewhere within this area, right? Because this way, certain amount of consumers, customers are willing to acquire a service. With this service level. We are operating it for costs that are lower than that. And in the middle, this is what we make. So in essence, these is the area where we would end up operating our service depending on different properties. This is in essence how much resilience we cant. Whats comes it mean to be proficient in building resilience, right? Because we would like to be efficient. We would like to be proficient in whatever we are building. Here's a very important point that may differentiate between a successful and proficient service provider and less successful one. When talking about resilience, we define proficiency as the ability to deliver more resilient services for less cost. Now, how do we do that? As a matter of fact, there are many ways to do that. In fact, usually it would not be a single strategy that would differentiate, let's say in two identical conditions, between someone delivering certain resilience level or service level for cost a and somebody else delivering exactly the same resilience level for cost that is higher. Usually it needs to be a combination of different tools, different methodologies, caused in a smart way, each and every one to apply to a specific problem. And the combination would actually deliver a very proficient solution where achieving a certain resilient level will be done for reduced costs. If I'm proficient, I can build a solution that is very resilient for whats much. If somebody is less proficient, same level will cost them this much. To sum up, this part so far is that as we already established, our end goal is to improve the resilience of our services, right? To keep our customers happy, to keep ourselves within the service level agreements and then service level objectives, and to enable our digital services. In order to achieve better results in this, we need to be familiar with different methods, different technologies, different methodologies, right. And this way we will be able to combine and build the right combination of different approaches to resilience that will be right for us and for our particular use. Case so far about resilience, generic mitigations, whats are these and how are they even related? So, a mitigation is defined as any action taken to prevent certain impact of outage or breakage on our service, on our system, in production. Applying an emergency hot fix to production because something has been broken down, that's a mitigation. Connecting to a production machine to clean something up, to restart something, as bad as it may sound, is also a mitigation, right? Generic mitigation is a mitigation action. Any, not just the two examples that I previously mentioned that can be applied to wide variety of outages. As a matter of fact, sometimes it is being applied when an outage happens before the source of outage is fully analyzed and identified. Wait, whoa, whoa, whoa. Am I talking about a bad software engineering? Am I trying to convince you to apply band aids to sweep the real problems in your software under the rug? I don't know. Something bloats in the memory. Well, let's restart it once a day and nobody will notice. Something fills up some partition. Well, let's clean it once a day or once a week, and again, maybe nobody will notice. No, this is definitely not what I'm talking about. Generic mitigations is a concept that was born in the world's most advanced from the sort of like software architecture and operations quality organizations, organizations such as Google, such as Facebook. So oraze, actually, in order to explain a generic mitigation concept, let's look at a typical timeline of an outage. Something where a certain problem happens to your system or service. How does it go? So naturally, it begins with a source, right? Something bad happens, hopefully. Then a monitoring system identifies certain symptom of something bad that has happened and raises an alert, maybe more than one. Then something or someone looks at these alerts and tries to understand the context, figure out the exact impacts and boundaries. Then potentially a triage occurs, right? Because remember I mentioned that alerts sometimes show us symptoms of a problem. So triage would try to figure out, okay, these are the systems, but what is the actual problem? Then we would usually perform root cause analysis to understand what causes the problem. We will implement, test, review the fix, and we will deploy to production this box chaos, a different color. Because this is where the impact on the users of our system. This is where it ends, right? This thing is taking time, right? How much time? Well, you know what, actually it really varies, right? We've seen outages being resolved within mere minutes. And unfortunately, we've all seen outages even again at world's leading digital services that take hours, I'm afraid to say, sometimes even days. How about considering an alternative timeline where sometime at the triage stage, where we start understanding where the real problem is. Instead of diving into analyzing it further, we apply a mitigations strategy that restores the service to its operation, and only then we perform root cause analysis. We develop a fix and we deploy it to production. As you can see, there is a time difference here in the outage, right? This is our gain in our service level objectives. This is the gain of our service users because service becomes operational earlier. How much earlier? Well, you know what, let's take just a small piece of this whole chain, a time it takes to a certain fix once it's implemented and verified to be deployed to production. So as recently as a couple of weeks ago, there has been a discussion thread on Twitter between leading practitioners. How quick should our code be delivered to production in very modern, continuous deployment, progressive delivery environments for it CTo be considered good enough. And for instance, the consensus there was that anything sort of like around or within the boundaries of 15 minutes, 1515 is considered to be good. So just whats small piece is 15 minutes of an outage, maybe complete, maybe partial of your service to its customers. Let's not even talk about how much time it can take. You CTo understand the root cause to implement the fix, to verify, review the fix, make sure that it's these correct fix, even if the right people to do the analysis and the fix and these review are currently available. Or maybe they're not. Or maybe it's the middle of the night, right? This game is absolutely critical and as you can clearly see, it doesn't come at the expense of understanding what went wrong, fixing it and making sure that it never ever happens again. Not only it exactly, but anything like it. It is all about having a set of tools that allows us to return our service to production earlier. That's what the purpose of generic mitigations is. Interested now you would like to ask me for examples of generic mitigations. Well, I'm glad you may have asked this question. Let's look at a couple of patterns to sort of like explain what I'm talking here about. Rollback. Rollback of any kind, business logic, binary executing it, configuration change data status, rollback to last known, sort of like working started. Some people may say, well yeah, rollback is very simple. Of course we support rollback. I mean we did support rolling out the update in the first place. But unfortunately it is not as simple as it may sound. In multicomponent system with dependencies, with data schema slicing, being confident to perform a multicomponent rollback, testing it from time to time, being able to remorselessly run it in production, to return it to a solid started. That's not simple. That requires preparation, that requires thought, that requires testing. Let's look at a different generic mitigation pattern. So, upsizing or downsizing more and more systems that are building cloud native architectures, supports, horizontal or vertical auto scaling, usually within certain boundaries. Sometimes, especially when we're talking about scaling things down, human interaction and human intervention should be very much desired. And again, scaling up a single stateless component, if you're using a modern orchestrator, probably sounds not that complicated. Scaling out without taking into consideration the relationship between different components may actually introduce more noise and just shift the problem in your architecture. This is where a strategy for how do you scale things up? How do you scale things down should be well thought of, well rehearsed, and of course when needed, applied. So again, not as simple as just launching a couple of more pod replicas, draining traffic from certain instances, and then flipping it over to a different cluster, different member, different region. Again, something that is a great tool when delivering multiregion multilocation systems. This flip of a traffic, managing it so that there is as little impact as possible on the service users. The smallest amount of impact possible, of course, is zero, and that's the desired one. That's what they would want you to do. Let's hope that we can do it again. Rehearsing that, making sure whats it is operational so that the least technical person in your organization in the middle of a night will be able to execute the strategy. Strategy really needed to be thought over is whats all. No, of course not. There are many, many more. Let's give a couple of more examples just to open our eyes. CTO various possibilities. So quarantining a certain instance, a certain binary, a certain cluster member in a way that we remove it from the rotation so it stops handling production traffic that gets rebalanced among other healthy peers, and then investigating the root cause of the problem. On this particular instance, a block list. Being able to block a specific user, a specific account or session that creates challenging, problematic series of requests, queries, et cetera. Being able to do this in real time. Being able to do this granularly. Being able not just CTO black and white, block or unblock it, but maybe actually introduce guardrails or quotas on it. Preparing a strategy for that could again be a lifesaver in production situations. If it is mature, if it is well tested, if it is again applicable by the least technical member of your staff in the middle of a night if needed. Disabling a noisy neighbor. Imagine a shared resource, a database, for example. These extreme pressure from one set of components may impact the ability to operate other set of components. Identifying the source of the noise again, inflicting guardrails, quotas, or maybe pausing it for a certain period of time to release some more critical processes. That's a strategy. Thinking about it, thinking how to make it repeatable, how to make it usable in real time. This is definitely a generic mitigation pattern that many world leading organizations use today. So to sum up this part, generic mitigations are not practices of applying patches and band aids to production and sweeping the real problems in your products under these rug. No, it is a practice of building strategies and then tools for improving your ability to meet your own service level objectives and to get your service back to operational state faster without compromising on root cause analysis, quality of your code, management of your technical debts, definitely building them, testing them, keeping these in a warm state so that you're not afraid to use them. This is a very important tool in the toolbox of being proficient in building resilient systems. Very, very important notion. So how does the two connect Kaus? Engineering on one end, engineering mitigations on another end. And when they are used together, does the end results become greater than just the impact of each and every one of them, or not? So calcium engineering can recreate unexpected, irregular, turbulent conditions similar to those that we will encounter in production, and sort of prepare us for the production challenges. Generic mitigations is a very important tool for meeting those production challenges and keeping our service level objectives, while keeping our expenses on reliability at bay. How does the combination of the two actually provide a greater value than just each and every one of them? Here's how. So, generic mitigations. Using them, we prepare ourselves for keeping these service operational under unexpected conditions. And how do we know that the investment here was made prudently? Because we tested with comes engineering experiments and we indeed see that now different kinds of experience gets remediated by the generic mitigations. Furthermore, these calcium engineering experiments surface points where we are still not ready for production fault tolerance. And then it's a continuous cycle of us strengthening these areas, again potentially with generic mitigations. And together it really helps us manage our investment. Right. Remember this cost of delivering a very resilient service, how do we do this proficiently? Indeed, by surfacing where our biggest problems are, by providing cost effective tools to resolve these problems, by immediately verifying that these tools indeed have resolved these problems, and furthermore, building a cycle of prioritized surfacing our next investment, et cetera, et cetera. This is where the combination of these two is extremely powerful. Now, what would happen if we just invested in comes engineering? Well, we still need to resolve the problems that these experiments surfaced. If we don't have a rich tool set of how to deal with these, we will end up investing quite a lot in rearchitecture and many, many other expensive things where this may not be the only or even the best solution for the problem. If we invest just in generic mitigations, our ability to test them in real life is very restricted. I mean, sure we can recreate like surrogate scenarios where they would be caused into action, of course, but then again, our confidence in the ability to withstand the real production turbulence will be much, much lower and it will be more difficult, not impossible, but still much more difficult to figure out. Where do we start? Which particular mechanism will give us the highest return on investment in terms of raising our level of resilience for the cost and then the second highest, and so on and so forth. This is actually where the combination of two is really helpful as a continuous improvement process in the resilience of our software services. Well, this is pretty much what I wanted to comes in this talk, some afterthoughts and maybe comes suggested action items if you're considering to do something about it. So what do you need to develop and use generic mitigations? First, you need a platform for developing these logical flows. You would ask, wait, what's wrong with just using the, I don't know, programming languages I use today and so on? And there is nothing wrong. Still, you need to verify that you're using the right tool for the right purpose. And these mitigations, you should be able to make them modular. You should be able to write them once and then reuse them in different conditions with different strategies inside the different environments. You should be able to share them right between users of similar components that face similar challenges. Just as any software, it needs to undergo software development lifecycle. You need to version it, you need to review it, you need CTO test it itself, et cetera, et cetera. So thinking about the right way to develop these flows is very important. Secondly, you also need a platform that will trigger, or maybe in a broadly sense, orchestrate these generic mitigations and monitor how successfully they are applied, involve humans in the loop if required, et cetera. It is very important that the infrastructure for orchestrating these mitigations is separate from your production infrastructure. Otherwise, when that production infrastructure is impacted, you will not be able to use it to mitigations itself. Right? This gives rise CTO things such as the observer cluster pattern. That may be a good discussion for another time. Visibility. Every time these mitigations are invoked, it is of utmost importance to be able to analyze what exactly is happening, how successful their application was and so on and so forth and again collect and process data for learning. So if you're seriously considering going into generic mitigations, these are the main two fields I would recommend looking into. Similarly, when talking about house engineering experiments, whats do I need to be able to perform these successfully and have them run efficiently in my environment very similarly. So I need a platform for injecting these comes variants into various layers of my architecture. And a good set of experiments should actually target different layers on these infrastructure level, on the network level, on the user simulation level. Again, because this is the set of different directions from where the variance in my real production would come and therefore my experiments needed to be hopefully close to that. Secondly, a platform for conducting experiments in a responsible manner in my environments, something that comes with a lot of guardrails and an ability to contain, maybe to compartmentalize the experiment, right. Extremely important data collection, learning, an ability CTO stop the experiment at any given point where things have too much of an impact if containment has failed. Picking the right platform here again is extremely critical to success and to getting return on investment in your investment in comes engineering as sort of like the final afterthoughts. I'd like to leave you with a thought of if what I've been telling makes sense to you, and if you are thinking of applying the combination of these two methodologies, how would a platform or a methodology for continuously applying the combination look? Just a thought. So, to sum up, I'd like to thank you so much for listening in. I would like to hope that whatever we've seen here makes sense. Maybe you have picked certain ideas, maybe you'll be able to implement some of these in your environment. For further reading, there is a growing amount of information about generic mitigations. There is a fair amount of information about comes engineering and various ways to apply that. I'm leonid from stackpulse. Thank you so much for tuning in.
...

Leonid Belkind

CTO @ Stackpulse

Leonid Belkind's LinkedIn account Leonid Belkind's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways