Architecting for Resilience: Strategies for Fault-Tolerant Systems

Video size:

Abstract

Resilience expert guiding you to fault-tolerant systems. Boost your stability and success!

Summary

Maria: Today I'm going to talk about architecting for resilience strategies for fault tolerance systems. She says it's not enough to design the system, you also need to maintain it. So today in my talk, I will touch both those aspects.
Do we really need fault tolerance system? Because building, because in ideal world when we have unlimited amount of resources. And so we will need to prioritize what type of failures we are going to take care of first. List all the system features and prioritize them against each other.
First thing is monitoring. If we want to have fault tolerance systems, we need automatic detection. Next thing is testing. We need testing environment as identical to production as possible. And next thing is chaos engineering. Invest into your tools and invest into automation.
Make sure it is modular and that it is distributed. Another thing to keep in mind is dependency management and single point of failure. Another concept is graceful degradation and failover. When some parts of your system are failing, users still able to use it.
The next concept is diversity. Usually like running different systems, different technologies, different clouds might be a problem because it is more maintenance. But if we are talking about fault tolerance, it might be beneficial. redundancy is like duplication of the part of the system to withstand any problems.
Sometimes we need to think about very low probable problems things. On 13th August 2015, four successful lighting strikes hits Google Data center. Twitter and Instagram both had some problems on July 14, 2022. So what's the probability of this? It is pretty low, but it's happened.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello. Hello. My name is Maria and today I'm going to talk about architecting for resilience strategies for fault tolerance systems. It is important to mention before we are going into details that it's not enough to design the system, you also need to maintain it and to have proper tools to maintain it properly. Otherwise ever if you are doing best possible system design, your system still might fail. So today in my talk, I will touch both those aspects. Yeah. So let's go into detail then and first thing, let's review type of failures that we might face while designing while dealing with fault tolerance systems. First and most obvious one, hardware, it's like is hard drive is failing or something like this. Second is network. It is like when our servers are fine but connection between them or to them are not good. Another thing is human error and this is happening quite often. I would say also any possible natural disasters like tsunami, flood or lighting, it might looks like low probability, but depends of our requirements. We might need to take care about it also. And another thing is security breaches. Okay, so we define type of failures and after that what I suggest to think about is requirements. And here we need to ask ourselves a question and answer to it honestly. Do we really want to protect ourselves from all possible failures that's going to happen? Do we really need fault tolerance system? Because building, because in ideal world when we have unlimited amount of resources, unlimited amount of time, unlimited amount of engineers to work on our system, we can design really perfect system, but in real world those things are limited. And so we will need to prioritize what type of failures we are going to take care of first, what we want from our system and is it really required and is it really worth it from business perspective? Also, one thing I suggest to do from the context of system requirements is just list all the system features and prioritize them against each other. Like let me give you an example here. Like for example, let's say we have online shop and if our customers can't like online shop, food delivery and if our customers can't check out, it's probably a big problem. People can't order food, business is just stopped. But for example, if search is not working or if suggestions for search is not working, is it the same severity of the problem compared to checkout problem. And I suggest to range this and it will help to prioritize the effort and understand which part of the system we need to duplicate and which is not because as I said, we don't have unlimited resources and defining our requirements is important here okay, now let's talk a bit about tools. And first thing is monitoring. Monitoring is important because without monitoring you wouldn't be able to understand if your system is actually working or not. And you don't only want monitoring, you have automatic detection there. And it means that you will need to set up threshold, deal with false positives and so on and so on. It actually might take time, but if we want to have fault tolerance systems, we need automatic detection because we can't just sit whole day and night and watching on the dashboards. Next thing is testing. Testing is important. Everybody understands this, but I think it's worth to repeat. And if we are talking from about fault tolerance system from that perspective, there are several things I like to mention. Like first is testing environment. We need testing environment as identical to production as possible. Like some small copy of the cluster or something like this. Because we don't want to touch production, but we want to do our testing as have run all our test environment as close to production as possible. I would say invest into test environment. Another thing is stress testing. And while ET is from fault tolerance, it's also important. And like for example, if we are talking about online shop again and if we are talking about Black Friday, we need to make sure that our system will withstand all the load and stress. Testing is something that can identify in and define different bottlenecks. And another thing I found, it's rarely mentioned, but it's worth to mention. Test your tools. Test that your monitoring system is actually monitoring, test that your recovery script is actually working. Your tools is something that what you will rely on. And yeah, invest into your tools. In this case we are going to the next point like about tools and I would say here, invest into your tools and invest into automation. Automate as many things as possible. And of course think about auto recovery and how you can deal if, I don't know, database is corrupted and you need to switch to the backup. Can it be automated? How it can be automated? Invest into this and invest into runbooks and documentation. Documentation is important in general, but runbooks is something that when you have a disaster going on and you need to fix system ASAP, you don't really want time to spend looking for some specific command, trying to recall how many parameters for this specific function. Or if you have new engineer joined recently. Yeah, they might not know all the details. And when you need to do something, a sub run books is helpful. And next thing is chaos engineering. Chaos engineering is a practice when you are injecting some errors and failures in the system to see how it will recover. And all big companies doing this and they are doing this for one reason, because it is useful. It helps to identify some problems that was not obvious before and fix it. And because cows engineering is like controlled process. For example, if you're doing something that causing your system to fail, you can roll back it. Roll it back. Let me give you an example. If you have a system that distributed among three data centers but designed to work with two data centers, good test can be is if one data center is disconnected and check if system is working. This is like one of the examples that can be done. And next thing is incremental rollout. This thing is going to protect us if we have some problems with our build in the software. Because idea there is to roll out to a small percentage of the users and wait for some time depends on your system and on feedback. And if all the metrics are okay, roll out more to other people, to other users and rollout will take longer in this case. But if we have some critical bug in our system, it will save us on it. It will be detected on smaller amount of people and system would not go down fully. Okay, so we've talked about tools and processes and let's talk about system design. And here are some concepts that you should keep in mind when designing fault tolerance system. And first of all architecting. Make sure it is modular and that it is distributed. Like for example, if we are talking about different data centers, system should be designed in the way too, in specific way and that it is scalable. Yeah, our favorite Black Friday example. Another thing to keep in mind is dependency management and single point of failure. And here I would suggest such exercise that after you will finish all system design, draft design for your system, look on it and ask yourself what's going to happen if each of the components are going down. Like what's going to happen if database down, what's going to happen is this part is down. What's going to happen is that process is going down. And the answer should be that it would be okay. If the answer is system all system is going down, it means you probably will need to find something and to think of some possible ways of recovery or duplication there or something like this. So yeah, make sure that all your dependencies, all dependencies of your system are taken care of. And I'm not only talking about internal dependencies, I'm also talking about external. Because for example, if you're using some external API, think what's going to happen is this API started to return errors. Are your system still going to work or not? And yet take care and design you and try to design your system in a way that it wouldn't fail. Another concept is graceful degradation and failover. And failover is then we are talking that some process is down. We have duplicate of this process that is taking it and serving the request. But graceful degradation is quite interesting. I find as quite interesting concept. It's when some parts of your system are failing, users still able to use it. They just might have reduced experience, user experience. And like for example, if your system is running out of memories and request is too heavy, you might consider to provide for your system that your system will at that point like return some simplified version of the request and it still will work. Maybe, I don't know, results would not be as nice, but still a system will still work and thing what to pay attention to here is like adaptive UI. If your system is partially working. Like for example, if we are talking about online shop and for each item you have button like find similar that is showing similar items. If this feature is not working, consider what you want to do with this button. Do you want to gray it out? Do you want to hide it? And what do you want to show to your users if some parts of the system is not working? Okay, next concept is diversity. Usually like running different systems, different technologies, different clouds might be a problem because it is more maintenance. But if we are talking about fault tolerance, it might be beneficial. And let me give you an example. We have some company and we have internal resources and internal chat for that company. But for some reason internal resources going down, what's going to happen? People and employees would not be able to communicate with each other to actually recover system. In this case it is important to have some backup channel and to make sure that employees can communicate and prepare recovery plan and to restore the system. And for example, some other example is that if you're running your servers in your system in some cloud, maybe you want to duplicate something in another one, in another type of cloud. I mean not only another data center. Yes, it will give you additional overhead, but it will protect you from the cases, some edge cases, and this specific cloud will go down. Okay, and here's redundancy. Probably main topic for fault tolerance system. It's like redundancy is like duplication of the part of the system to withstand any problems. And here, let's see what kind of redundancy we might have so we can have hardware redundancy. It's like raid well known or hot standby. It's like when we have another server totally identical to the first one. If first going down another second one, taking all the work software, we can have like process replication. Again, if process is going down, another one is serving the request and we have inversion programming like if we have critical system and this critical system might be implemented in several ways using different programming language or something like this. And when we need to get the output of this system we are actually comparing the results of all the implementations. And if it's not identical it means somewhere there is an error, somewhere. Next thing is network. It's like some backup network paths. If main thrust is going down, we switch into the backup data. Mirroring is when we have a database identical to the first one with all the data in sync and the first goes going down, we switch into the second one and data backups, of course backups as you're always useful and geographical, of course we can have geographical redundancy, it's multidimensional center deployment. Or maybe you ever want to deploy it in different countries, it's up to you. Remember about tsunami and here in the end I wanted to mention a few cases that happened, really happened. And that I found really interesting because it is something that you think what's the probability of this? Is it really going to possible? And actually yes. When we are talking about fault tolerance systems, sometimes we need to think about very low probable problems things and like first is Google. On 13th August 2015, four successful lighting strikes hits Google Data center. And another example is Twitter and Instagram both had some problems on July 14, 2022. But for both of totally unrelated reasons, totally independent systems still goes down. So what's the probability of this? It is pretty low, but it's happened. And something that happened recently is that a lot of descriptions of items in Amazon been replicated with. I apologize but I can't fulfill this request. It happened just a few days ago and it happened because people automated descriptions of their items using chat GBT. But yeah, external APIs. It is something we talked about. It's been mentioned. Okay, so I will stop here. I hope it was useful for you. Thank you for your time,

Slides

Download slides (PDF)

See all 57 talks at this event!

Conf42 DevOps 2024 - Online

January 25 2024

Architecting for Resilience: Strategies for Fault-Tolerant Systems

Video size:

Abstract

Summary

Transcript

Slides

Maria Rogova

Software Engineer @ Meta

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2024 - Online

January 25 2024

Architecting for Resilience: Strategies for Fault-Tolerant Systems

Video size:

Abstract

Summary

Transcript

Slides

Maria Rogova

Software Engineer @ Meta

Join the community!