Conf42 Chaos Engineering 2022 - Online

Self Healing Systems

Video size:


SaaS systems enable all kinds of cool new ways of designing and building software applications. One way this is manifested through the notion of self-healing systems, that is, systems that monitor and fix themselves.

Through the use of intelligent error monitoring and feature flags, we can start to see applications that can do just that.


  • Nick Hodges is the developer advocate at Rollbar. He talks about building self healing systems. Rollobar's mission is to help developers build software quickly and painlessly. Our values are honesty, transparency, pragmatism, and dependability.
  • Self healing systems is the idea that a system can recognize that it's not working right and fix itself without any internal, I'm sorry, external assistance. With SaaS software, you have control over the entire application. The requirements for a system to be self healing are three.
  • For a self healing system to work, it needs the ability to deploy to a cattle based system. The ability to accurately track and report errors is critical to self healing systems. Finally, you need to be able to predict a future and take actions based upon your knowledge of the future.
  • errors are everywhere. They occur in every application. Sometimes they're a problem, sometimes they're not a big deal. A better metaphor for a self healing system is the notion of a circuit breaker. This allows you to respond quickly and algorithmically to the particular problem.
  • Rollbar allows you to track errors by grouping them and seeing them as they occur in very specific patterns. Instead of limiting change, we fix the bugs that matter the most first, and achieve fast iterations with automated releases. This is the method and the means by which automated systems are possible.


This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Close good morning, good afternoon, good evening, wherever you are in the world. My name is Nick Hodges. I'm the developer advocate at Rollbar, and I'm here today to talk to you about building self healing systems and how they work. All that fun stuff about automating the process of managing your application. Before we get rolling, I'll talk a little bit about myself. As I said, I'm a developer advocate. I preach the good news about Rollbar throughout the developer community. I'm a longtime developer and development manager. I'm a Delphi guy way back. Anybody out there remember Delphi? I've written a couple of books on Delphi. I did turbo Pascal to start out, and lately I've been working on angular and typescript, really cool language and platform for developing web applications. I'm a former naval intelligence officer, did that for twelve years. I will tell you that it is about 142 times more boring than they show it on tv, in the movies. And finally, I'm obsessed with pistachio nuts. If you come over to our house on a Sunday afternoon to watch NFL football, don't get between me and the pistachio nuts bowl. Before we get started, I just want to talk a little bit about Rollobar's mission and values. Our mission is to help developers build software quickly and painlessly so they can get back to building cool new features. Instead of fixing bugs, spending a lot of time fixing bugs and whatnot. We like to think about ways to make developers more productive and to make it as fun and interesting as possible. Our values are honesty, transparency, pragmatism, and dependability. We like to show those in everything we do, from the sales process to presentations like this. So if you see us not following those values, hold us to it. So the topic for the day, of course, is self healing systems. Self healing systems is kind of a self explanatory definition, but I scoured around the Internet and this is the best definition I found. It's the idea that a system can recognize that it's not working right and fix itself without any internal, I'm sorry, external assistance. So this could mean something like turning off a feature flag or rolling back to a previous release, or somehow, some way getting itself into a working state. So selfhealing systems, something that's only relatively new, mainly because we at this point with SaaS software, have control over the entire application. In the old days you used to have to distribute executables out to your customers. And of course everybody's systems was different. Everybody had their own individual application that they were running. And sometimes a bug would appear on that system out there and there's no way for it to self heal because you would have to reproduce the customer's environment as best you could in order to reproduce the bug. A selfhealing systems possible because with SaaS systems internal. SaaS systems are internal to our environment. They're built inside our own cloud. The client application that is built runs on the web and you have complete control over that client application. Now it's still true that you deploy mobile applications and they will suffer the same problem with versioning and bugs, but at least with your web system and your built in back end, you have complete control over the system, enabling it to self heal. Before we talk about self healing systems specifically, we can talk about the levels of selfhealing systems. In other words, what are the basic three levels of ways that systems get repaired? One of course is the manual change. A bug is reported to you by a customer. You open up the product in the debugger, in the old school way, you find the bug, you fix it, and you return it to the customer with a fixed version that happens entirely manually. There's no automation process going on. Some automation plus manuals. Step level two, if you will, requires that you have some automation involved. For instance, you may have a system that monitors your website to let you know when there's an error, but however it doesn't really help you or point you in a direction of where that error is. And you have to manually fix the error yourself and deploy the system manually, or deploy it through your CI CD program. Fully automated systems are self healing systems. That is a system that is able to detect errors and resolve them in a sense, not in a sense of fixing the bug, but in a sense of dealing with it in such a way that the state of the system is returned to equilibrium so that it works properly. That's a fully automated system. So level three obviously is selfhealing systems. The term self healing to describe what is level three. So what are the requirements for a system to be self healing? I've talked a little bit about it already. One is it has to always be up, that is the application. The front end has to be available and working properly at all times. You don't want a system to be down for maintenance, because then it really isn't self healing. If a change to the deployment process or a change to the way in which the application is run requires the application to be brought down and not function. Then it can't really be selfhealing. In addition, a self healing system needs to be monitoring. Without monitoring, you can't self heal because you don't even know that there's a problem. You need to see that problem in order to be able to fix it. And obviously, for a self healing system to work, it needs to have a restoration mechanism. It needs to have a way for it to be able to repair itself or get itself back into a known working state. These three things are what a self healing system needs in order to actually function as a self healing system. So now we can talk about what a self healing systems what are the characteristics of a self healing system? So first we need the ability to deploy to any server, not just your specific server that you're running and managing yourself, but to any server. In other words, we need to deploy it to a cloud environment. So your servers need to be cattle, not pets. If you're not familiar with the term cattle and pets, it refers to the ways servers are handled. For instance, if you have a pet server, that's a server you own, you take care of it, you feed it, you manage it, you care for it, you give it a name, all that kind of stuff. It's a pet versus a cattle server, which is just a herd of servers that you don't care necessarily so much about any individual member, but about the herd as a whole. And so you deploy your application to cattle when you deploy it to a cloud environment that has many, many systems that run your application instead of just one specific pet server. So that for a self healing system to work, it needs the ability to deploy to a cattle based system. You also need the ability to deploy without downtime. This is normally done through blue green deployments, where in the most simple form, you have a blue machine and a green machine. The blue machine is running and the green machine is updated. And then immediately the load balancer, or whatever type of system you're using to manage your traffic flow switches from blue to green in one instant, and no downtime has occurred as a result. And then the blue machine can be updated to be a green machine. And then you can turn both of those machines on, and the load balancer knows about everything that's up and automatically feeds traffic to working systems. Third, we need the ability to monitor the hardware needs and adjust dynamically. This is where cattle versus pets come in, right? If you monitor the hardware needs and see that your hardware, say your cpus are being stressed or there's too much network traffic going through any specific or number of machines. You can spin up new machines, new systems, in order to dynamically adjust and respond to the demand that occurs as the demand for your application increases. Finally, the ability to monitor services in real time and react to failure, I. E. The chaos engineering portion. You need to be able to track the services that your application requires, and if any of them go down. You need to be able to react to the failure of any one particular given system. You need to be able to respond correctly if one of your services suddenly fails. In order to be self healing, you need to be able to deal with the failure of any of your given microsystems. And finally, you need to be able to have the ability to predict a future and take actions based upon your knowledge of the future. Of course, we don't really know the knowledge of the future, but I'm talking about things like knowing that every Monday morning at 09:00 your system demand increases drastically. You know that. And so you can schedule the increase of your cattle. You can buy more cattle, I. E. More servers and spin them up at any particular time when you know that that time is going to require more processing power because of the demand on your application. So what is the common thread between the problem and the solution? Error tracking. That's what's required in order to both create a self healing system and to solve the problem of an error that happens. In order to know that you need to self heal, you need to know that there's something to heal from. You need to know that there's an error. And the ability to accurately track and report errors is critical to self healing systems. They are part of the problem because error tracking and errors create the need for self healing systems. And they're part of the solution, because by knowing and understanding your errors, you can properly respond to them and know whether a system needs to react or not to any particular error and how one might react to different kinds of errors using automation. Now, errors are everywhere. They occur in every application. Sometimes they're a problem, sometimes they're not a big deal. In a SaaS application, errors occur probably almost constantly. There may be errors that happen that are probably not a big deal, and you don't need to worry about them. Someday you'll take care of them. Maybe you don't even need to take care of them. Maybe there's an error that pops up that is inconsequential and doesn't have any effect on your application at all. Maybe there's an error that's critical and needs to be dealt with immediately. But in any event, errors are everywhere. And so let's talk a little bit about the nature of errors and the nature of error tracking, because obviously, that's what is the critical portion here of what we're talking about with self healing systems. So here's an example of a relatively crude and imperfect error detection system. When I was a kid, we had Christmas lights that looked like this instead of the little teeny ones that we have now. And they were connected, I believe, what you'd call in series, so that each bulb's ability to light up depended on the bulb before it. And if any one bulb on the entire string failed, the entire string failed to light up because the electricity couldn't flow through every single bulb. And this was a very crude way to detect an error. You would know that there was an error in one of the bulbs by the fact that the light string didn't work, but you wouldn't know which bulb. So you'd literally have to get a known good bulb, replace every bulb in the string in order to find the one that was bad. Sometimes there was more than one that was bad that would not report itself when you'd replace it with a good bulb. And so this is a very crude and difficult way to detect errors. And this might be analogous to the ability of you to detect errors on a remote machine. For instance, customers machine, the classic works on my machine, but it doesn't work on the customer's machine problem. And those are gone, and those issues are mostly gone today with the advent of SaaS applications. But to a large degree, this was a way that errors were detected in software systems. Now, here's another system that reports a problem, and it tells you that your car is overheating. It doesn't tell you why, but it does at least tell you that your car is overheating. Overheating is a symptom of any number of different issues with your car, but it is a slightly better way of reporting can error in that you know your car is overheating, and you know that there's a problem, so you can turn the car off, fix the problem. But again, this doesn't work for a self healing system. Probably a better metaphor for a self healing system is the notion of a circuit breaker. We all understand circuit breakers. We have them in our homes. A particular circuit on your house, electrical system becomes overloaded. The amperage is too high because you've plugged in 700 different items into it, and the circuit breaker blows. Well, this is a very effective technique for error reporting, because specifically, first it solves the problem of the error. It prevents a house fire, for instance, whereas if the system did not have a circuit breaker or a fuse, if you will, then the electricity in your house might overload in that particular circuit and start a fire, which is very bad. So the circuit breaker prevents a fire. It also tells you exactly which circuit the error is occurring on and allows you to go and make changes to the system that you've got plugged in to that particular circuit. So circuit breakers are a good symbol for self healing systems, because an error reported by some type of circuit breaker mechanism, a critical error that is fired, allows you to respond quickly and algorithmically to the particular problem. For instance, with your system, with an application, SaaS application, the firing of a critical error, I-E-A circuit breaker snapping, may result in you rolling back to a previous version, or may result in you turning off a feature flag of the error, of the feature that the error occurred in. So the circuit breaker is actually a pretty good error reporting mechanism, allowing you to deal with errors in an automated way, because all too often this is kind of what happens. In other words, everything's going along great, and all of a sudden, boom, you've got a problem. Response times go way down, there's something wrong, but you don't know what to do. You have no idea why that particular thing is happening. So you need an error reporting system that gives you clean, clear information about why something like your transaction time here is suddenly jumped from a very acceptable level to levels that are not at all acceptable. That type of error reporting is critical to what it is you want to do in terms of building a self healing system. So this is the problem that we face, right? We have this sudden influx of errors. We have this sudden problem that we notice, and we don't necessarily know why it's occurring. Now, traditional monitoring, I. E. Log files and different types of monitoring systems, don't necessarily give you what you need to understand and respond, because there's a huge amount of data coming in, and that data needs to be processed in a way that human beings aren't very good at. Nobody has enjoyed spending time poring over log files, looking for patterns or trying to figure out exactly where the error occurred and why the error occurred and what was going on when the error did occur. So we need a better way to see where we're headed when we want to find out where an error occurred, why it occurred, and how we can know exactly what the problem is and where we can go to fix it. You can't automate the response for every error, but you can automate the response for some errors. If you have that particular level of intelligence and knowledge and trust in the way a specific error is tracked and managed and reported to you. If you trust that the error is truly critical, and then it does require a feature flag being turned off, then you can do that automatically, which is very powerful and capable, keeps people from being woken up in the middle of the night, which is a very good thing. Now, one thing that factors into all of this is that a developer's job has changed over the years. It used to be that our process limited change. We did waterfall. We manufactured the software and actually put it in a box and it had to be perfect before release. And when we were done was the point when we said, okay, we're going to ship this thing. And its shipping actually involved etching it onto a CD, building it onto a CD, and mailing that CD and the box and the manual out to customers who then opened the box, pulled out the CD, loaded it onto their individual machines one at a time. So with perfection was sought there through compiled languages, unit testing, strong QA and the ability to fix, or the desire anyway, excuse me, to fix every single bug that was known before shipping was maintained. But now we're much more agile. We do research into what the issue is and we fix it on demand. We can fix a bug immediately with a SaaS system. We improve every release, every release is slightly better and we're never done because the application is constantly iterating on a minute by minute basis. Really? Company like Amazon will deploy thousands of times a day to improve the quality and features on their system. The process actually encourages change. Instead of limiting it. We fix the bugs that matter the most first, and we achieve fast iterations with automated releases and automated feedback. This is the method and the means by which automated systems are possible. So these new types of systems have error reporting that's very different from previous methods. The errors come in in a fashion. They're different types, different categories, different pieces of information, and they're reported to us in a very generic way. They come in, they may have colors to them, for example, when they happen, but they come into our system into a log file in a very colorless way. And what you want really is for those errors to be grouped together and classified in such a way that you can tell whether they're new, they're reactivated, they're resolved, they're critical. They're not so critical. They're information only, whatever. So in order to respond in a self healing manner, you need to know what those critical errors are versus what those just informational errors are that occur inside of SaaS systems. You need to group and organize and manage them. And that's something that's very difficult for humans to do as they look through log files. But it's something that's very easy for a system like rollbar to do using machine learning. Very powerful capability to group the errors by giving each error a fingerprint and grouping those errors according to those fingerprints. So this is a real basic look at the way rollbar works and allows for self healing systems. Real straightforward. First, it sees errors. It allows you to track errors by grouping them and seeing them as they occur in very specific patterns. So if there are ten different errors coming in to your system, rollbar can see the distinction between all ten of them and can report on the frequency and characteristics of those errors. So if you have an exception that you've raised that you've marked as critical exception, and that you know you need to immediately fix and deal with and respond to, rollbar can identify that type of error for you and allow a hook to take place, and a web hook or some type of hook, allowing you to respond to that particular type of error because of that grouping and that matching of errors and the fingerprint that each error gets. So rollbar then say, determines that your error is a critical error. There's a service that's down, and the reason that you're not able to access that error is an exception that, you know, saying that the service isn't getting what it's expected. So some change in your most recent deployment has caused a service to not respond correctly because you're not asking correctly for information that change in your system has caused an error, and you know that perhaps the way to deal with that is to roll back to the previously known error, send a message or an email or a slack message, or something in pager duty to the developer in question, and roll back to the previous known working state, and thus healing itself and reporting back that there was a problem. Then this is a great example of saving somebody some phone call in the middle of the night from support because of an error. In the olden days, back in the olden days, say five years ago, ten years ago, one of the things you would have to do when an error occurred was you have to call somebody to get them to fix it, particularly if it was a critical error where customers were, say, losing money or customers weren't able to spend money, which of course would be a big problem for your business. But because you can now, through tools like rollbar, identify and recognize specific errors as critical, you can fix that with the self healing system and then have somebody just come in in the morning and deal with it by returning the system to a known good state. So finally, yes, rollbar can automatically roll back a release. It can automatically turn off a feature flag, and it can automatically create an environment that you know was working correctly before the deploy happened. Now, you might wonder, why would you not know this right away? You might. And you might know immediately that a problem occurs with a new feature or with a new release, but you might not. Maybe the problem occurs because of the character set used in a time zone that is different than yours, or a time zone, or a character set that you haven't tested with, perhaps. And maybe that happened in the time zone halfway around the world, in Japan from the United States, or vice versa. And maybe the error occurs simply because a customer does something in a specific way that no one has ever thought of. Customers are known for being chaotic and chaotic neutrals. Probably. My guess would be, and Mary, there might even be some chaotic evils out there. But in any event, you want to be able to deal with those errors and roll back to its known good state, whatever that means, whether it be a feature flag or a previous version or a previous release, in order to be able to correctly and properly respond and heal the system without human intervention and without having to call somebody at 02:00 in the morning. So, in summary, self healing systems are possible. They're only going to get more capable as we move forward. Tracking and learning about errors is the critical part of all of a self healing system. Knowing what the nature of the problem is, how often it's occurring, where it's occurring, is all very, very critical. And rollbar is the system that lets you self heal. So I'd like to thank you for your attention today. Please feel free to reach out to me. You can contact me on my contact page at try nick. And thank you for your attendance and your attention. I'm happy to answer any questions.

Nick Hodges

Developer Advocate @ Rollbar

Nick Hodges's LinkedIn account Nick Hodges's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways