Conf42 Chaos Engineering 2021 - Online

Normalizing Chaos: A Cloud Architecture for Embracing Failover

Video size:

Abstract

What if instead of designing cloud architectures where failover is an exceptional case, we embraced failover as a normal part of running and system and failed over all the time? Let’s deep dive into an architecture currently in production doing just that and share lessons learned along the away. This talk will use production examples and real-world experiences to showcase an architecture where failover is now the norm instead of something that happens in exceptional situations.

Early in my career, I envied those who answered calls at 2am to jump in and heroically save mission-critical systems. I saw the late nights as a badge of honor. After participating in my fair share of on-call events, I started to think about if we could optimize for events happening at 2pm instead of 2am. This evolved into thinking what if failover handled as part of the normal running of the system and not only in exceptional situations.

I’ll dig into an architecture where we are able to artificially inject chaos as part of the normal running of the system and discuss tradeoffs of where an approach like this makes sense.

Summary

  • Ryan Guest is a software architect at Salesforce. He'll talk about normalizing chaos a cloud architectures for embracing failure. If you'd like to chat on email, my email is included. Or at social media, my Twitter hashtag is at Ryan Guest.
  • Aims to optimize for getting paged at 02:00 p. m. or 03:00 a. m., not in the middle of the night. Taking a look at some of the principles from chaos engineering, could we use some of those and apply them to the architectures we built?
  • Ryan has been at Salesforce for 13 years. Working in data privacy and security are probably the two areas that interest him the most. Salesforce just completed a complete rearchitecture of their most important infrastructure. The company wanted to interoperate with the latest and greatest, but also legacy systems.
  • Given those challenges in running a large scale cloud operation, I think it's easy for engineers to fall into the firefighter mentality. If you're always trying to keep things up and fighting fires, it can sometimes hurt innovation and product growth. So I want to show you how our architecture evolved to get there right.
  • It's not good for is if your data changes often. You really want something where changes don't happen that often. Similar to that, if you value performance over reliability, this isn't the protocol for you. Like all engineering decisions, you consider those trade offs and look at what's the best.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hello. It's great to be with you all here today at Conf 42 talking about chaos Engineering with some of my friends. Today I'm going to talk about normalizing chaos a cloud architectures for embracing failure. My name is Ryan Guest. I'm a software architect at Salesforce. If you'd like to chat on email, my email is included. Or at social media, my Twitter hashtag is at Ryan Guest and I'd love to follow up after this conference and talk about some of those topics here. So to begin, I'm going to start with a premise, and that is what if we could optimize for getting paged at 02:00 p.m. Instead of 02:00 a.m. I think it's kind of funny that every seasoned engineer I talk to, they have a getting paged at 02:00 a.m. Or 03:00 a.m. In the morning story. And it seems like it's almost become sort of a ritual from going from junior engineer to senior engineers. You have to have a situation like that. But I'd like to throw out there, it doesn't have to be this way. And then as we look at designing systems for reliability and other aspects, one of the things we can look at is potentially could we optimized for getting paged or equivalent to system errors happens during normal business hours and not in the evening or not in the middle of the night. Taking a look at some of the principles from chaos engineering, could we use some of those and apply them to the architectures we built? And that's exactly what we did. And I want to share one reference architecture that we used at Salesforce and were pretty successful with to sort of optimize for failures, not happens at off hours in the middle of the night, but actually happening during normal business hours and having full cognitive ability and the full team on staff to respond to them and make changes. Now before I go into that, I think it's important to give you a little background on myself. Like I said, my name is Ryan. I'm a software architect and I've been at Salesforce for 13 years. Over those 13 years, I've spent most of my time on the core platform. Working in data privacy and security are probably the two areas that interest me the most. And Salesforce are an interesting time right now when we talk about infrastructure and reliability in that we just completed a complete rearchitecture of our most important infrastructure. Here it goes by the name Hyperforce. And with hyperforce we had a couple of key values that we focused on, one is local data storage. So for our customers or tenants, they can know where their data is stored. Those next is performance at both a, b to b and b to c scale. And each of those have different sorts of requirements. We also want to have built in trust. We know that being a cloud provider, trust is key to our success as well as backwards compatibility. So at Salesforce, we've been running cloud systems for 20 plus years, and so we wanted to interoperate with the latest and greatest, but also we had legacy systems that are critical, mission critical for our business and want to keep those running. Another important value of hyperforce is we wanted to be multisubstrate. So whatever cloud provider that we want to use under the covers to provide infrastructure, we want to easily be able to migrate and let customers choose which ones to use. Given those challenges in running a large scale cloud operation, I think it's easy for engineers to fall into the firefighter mentality. And like I was talking about earlier, all season, engineers having those story, or a lot of them, multiple stories, about getting woken up in the middle of night by a page or a phone call or a text and having to respond to an issue. We feel good about being able to jump in and save the day. There's a certain, like I said, badge of honor that folks wear and they can recall times they've done this, or I was the only one that could solve the problem, or I had the magic keystrokes to bring things back to normal. So I want to acknowledge that it does feel good, but on the other hand, it can lead to sort of firefighter mentality where we spend a lot of our old time fighting fires and not enough time saying, how can we prevent them in the first place? This is not something you can do forever, so you don't want to be constantly fighting fires. Doing that is one of those obvious ones that leads to burnout if you don't handle it well. But also, it can take away from innovation, right. If you're always trying to keep things up and fighting fires, it can sometimes hurt sort of innovation and product growth. So you have to have a balance between the two. There's also at some organizations have, this is compounded by organizational problems, where we end up rewarding those people that save the day. So if you have two engineers, one engineer builds a system where the system is reliable and just by the design of the system, you don't have to log in and ssh into production at 03:00 a.m. And run a magic script to fix things because a disk log is overflowing. And those other one who does that is always on call and always available. And this is a bit facetious, but we need to think what's better for your career. I think there's a natural tendency for the spot bonuses and the rewards go the person that's buying the fire, because all that stuff is more visible. But it's also important to recognize that folks that are building maintainable, reliable systems, that works important too. Long term, that's better for the company. So I think we need to balance both. And although it is okay to appreciate folks when they do fight fires, we need to think in a bigger context, what can we do to prevent that in the first place? And so I want to show you how our architecture evolved to get there right, and so we can minimize the amount of fires that do happen. And when fires do happen, can we choose where they go? So when you think about failover, I think most services start like this. With a traditional failover, this is pseudocode, but try and connect to server a. If that doesn't work, we'll try and connect to server b. And this is like circuit 2001 Failover logic here and a very basic example, but this is calls over the place and it's used in production in a bunch of companies and pretty straightforward. And for a lot of places, for simple architectures, this works really well. But when we start to apply some of those chaos engineering tools, we can see that this architecture can be exploited. And as things go bigger and bigger and get more complex, it can be hard to keep up with this. Now, looking at this, one of my all time favorite distributed systems papers is called analysis of production failures in distributed data intensive systems. Comes out of the University of Toronto a couple years ago. And what this paper really did is they dug into popular open source systems, they looked at Cassandra, Hadoop, redis and a few others, and said, where are failures coming from? And they found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code, right? And so they did the research and found that error handling, that's where a lot of the big bugs occur. And we think about it, we think back to the previous architecture I was talking about. We only really get to the error handling scenarios if things have already gone bad in the first place, right? If the happy path or the normal server I connect to is down, then we started to start to excise all this exceptional logic. And their conclusion was that simple testing can prevent most critical failures. And it begs the question, why don't we test these systems more? And so I would say think about that. How can we test the exceptional use cases more? How have we focused on that? As mentioned earlier, like I said, this paper is one of my favorite. They dug into all these sort of open source systems and looking at failure modes. What they did is they went through a lot of the bugs, they sampled the bugs, the highest severity bugs from the public bug trackers and did some great research. Also what I found interesting too is that you were scrapping the code for looking for an error handling code for to do or fix me. Two comments that developers mostly leave when they know it's an exceptional case. They know they probably need to do more, but they're just focused on the happy path. That was a great way to sort of identify, okay, these are the key places that a failure happens. It's going to be an issue. So looking at that research and then thinking about how come I always get paged at 02:00 a.m. And not PM, I started thinking, what if we just failed over all the time, right? So there wasn't no exceptional case. What if all the time we were just in failover mode and just operate like that? You could say it's a sort of chaos engineering mode. But really, what if that's our standard? What if we just normalized it and say, hey, this is how it works, so that the happy path and the non happy path, they don't diverge. And so here's our very rudimentary versions of more pseudocode. But we just randomly pick between two servers, flip a coin. If it's heads, go to server one, if it's tails, go to server two and just bounce back and forth all the time. If server one goes down, then we just keep going to server two. And this works great because if one server goes down, it's not the end of the world. And if one server goes down, we can check the logs or we can look at alerts or warnings, however you've instrumented or whatever type of observability you have, and do that during normal business hours and say, okay, these are the things that went wrong, or these are the errors we saw. And to the end user, things are still working. But we know we have to fix these things, we have to change this design or even just bring the server up. And I would much rather analyze those and think about those in the afternoon than in the middle of the night. And so expanding on this, we can do other things. So if we think of our main service and the servers we talk to are all in one cloud provider. We can split them between two different cloud providers to offer more reliability. And so this type of thinking sort of changes our mindset, right? So we're no longer thinking, okay, this is the main server, this is a failover, this is the primary. We're now thinking that, okay, there's a pool of servers, and there's no difference between if we failed over or it's a normal happy path. Both the client and the server don't care and can just behave as we do doing those thing. There's also some issues that cropped up. So some pragmatic things that we did, we ended up signing requests. So you could say, this is the type of data I'm looking for. You can get into different situations and you have to manage this to your environment where different services may be out of date or need a certain version, and you want to account for that. But this is expanding on our architecture. We events went a step further and let me dig into that. So when we talk about fill over everywhere, we realized that 50 50 is kind of a naive approach and we could do better. And so we did some probabilistic modeling here. And essentially the difference here is that the choice of what service you go to becomes a function. And in that function we can have logic and decide what to do. And so you may, 90% of the time you do want to go the closest server, because in some cases, you're limited by the speed of light. How fast can I get around the world or bounce between data centers or whatever? So you have a pool of outside servers, and you may want to hit them a lesser percent of the time. This is just a fictional example, but it's similar to what we're doing right now, is we have one where we favor server pool servers and then secondary servers, but during the course of the day of normal operations, we do send requests to them and we do ping and see how things are going. Now, we came up with a sort of formalized model for this. And really what it's important is that all these probabilities add up to one, like I said, so you can have individual functions for each service and say, okay, in this 90% of the time I want to go to my local service, and then this 2% of the time I want to go to a service on another cloud provider just to make sure that that failover case works. And then maybe this 2% of the time I want to go to a service, another geographic region. If it's a service that is built like that. And this 3% of the time I want to go to another service. But those service maybe is an on premise service. And so I mentioned earlier, we are in our feature architecture and what we're doing right now is expanding to a multi substrate surface. One of the substrates may be a colocated or an on premise substrate. And how do you balance that in. Now, this architectures is good for some use cases, but there are major trade offs. And you're probably thinking in your mind, oh, this is good for this, but not good for that. So one thing is it's not good for is if your data changes often. So if you're constantly updating data, if you have a write heavy database, this is not the failover, the architecture mode for that. You really want something where changes don't happen that often, because you don't want to have to worry about replication or keeping data up to date. If you're bouncing around all over the place, this also isn't good. Similar to that, if you have a very chatty protocol. So if you're constantly talking back and forth, you don't want to be sending those messages all around the world. You probably do want to keep that located in a smaller region, and that makes sense. Similar to that, if you value performance over reliability. So if could say, hey, I'd rather have this request happen really quick or fail, then this isn't the protocol for you. This has worked best in our systems where it's been the opposite. It said, hey, reliability is the key here. If a request takes ten times as long, that's fine. I'd rather get an answer than not get an answer at all. And I can wait a little bit and that's key. So like I said, if your data is mostly consistent, if stale data isn't a big deal, or you have updated versioning system, and you really value reliability as one of the key tenants over things like performance, then this is a great architectures to choose. But it's important that like all engineering decisions, you consider those trade offs and look at what's the best. Thanks. I really appreciated talking to y'all. Feel free to hit me up on Twitter email both if you have questions or could like to talk about this further. It is an architectures that we're evolved here and I'd love to hear your thoughts and folks doing other similar things so they can really optimize for the chaos mindset. And those failover isn't something that just happens at 02:00 a.m. In exceptional circumstances, it can happen all the time, and I would love to hear more about that.
...

Ryan Guest

Software Architect @ Salesforce

Ryan Guest's LinkedIn account Ryan Guest's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways