Conf42 Chaos Engineering 2021 - Online

Software won't make a noise when it breaks

Video size:

Abstract

Systems fail, but the real failures are the ones from those we learn nothing. This talk is a tale of few such failures that went right under our noses and what we did to prevent those. The techniques covered range from Heterogenous systems, unordered events, missing correlations, and human errors.

Every time there is a failure, there is a root cause analysis, and there is a vow not to repeat the mistake. I will take some curious shortcomings that I have dealt with in the past decade of my work with Infrastructure systems and the steps we had to undertake to:

  1. Isolate
  2. Limit the spread
  3. Prevent from happening again

Failure 1

An un-replicated consul configuration results in data loss 25 hours before a countrywide launch. Took a staggering five engineers and 20 hours to find one single line of change.

Failure 2

A failed, etcD, lock forced us to re-write the whole storage on Consul and hours of migration, only to find out later that it was a clock Issue.

The above Isolation and immediate fixes were painfully long yet doable. The real ambition was to prevent similar such Incidents from repeating. I will share a sample of some of our RCAs and what was missing with each one of those versions. This section touches briefly upon blameless RCA, but the real focus is the action-ability of an RCA.

In this section, I will showcase some of the in-house frameworks and technologies (easy to replicate) built to turn the prevention/alert section of RCAs into lines of code rather than lines of the blurb of text. This section aims to advertise and advocate the need to build/adopt toolchains that promise early-detection and not just faster-resolution.

Summary

  • Most of the failures, the flavors, come in form of software, human network, process and culture. A software failure is the easiest to identify. A culture failure, on the other hand, is something that you identify pretty late.
  • Software is a byproduct of practices and cultures that we follow. If something is failing at some place, there's a very high likelihood that it fails in another place as well. Unlike product, reliability cannot be improved as one feature at a time.
  • A DevOps person had manually altered a security group, accidentally deleting the 443 as well. The real root cause here is that we do not have this culture, or we allow exceptions of things to be edited manually. Real impact was larger CTO, a degree where we can't even tell how much the outage was.
  • Around 25 hours before a new country launch that we're supposed to do, payer duty goes off. The most important question that we ask when something fails is who change what? What's the real root cause here?
  • Outage number three is typically about a distributed lock which was going wrong in production. We were using ETCD for clustered clock maintenance. We replaced it with an entirely new cluster solution which is based on consensus.
  • What we have built is a lastio nine graph. All these systems are actually connected, just like the World Wide web is. It goes through the system and builds these components and those relationships so that any cascading impact is understood. We haven't yet open source this thing but we are more than happy to.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Piyush Verma. I am CTO and co founder at Lastio IO. The talk is about an SRE diary where as steps, we are trained to be on pager all the time. I mean, I've held a pager for almost a decade now, in fact, earlier than when it used to be called an SRE as well. But this talk is slightly the opposite end of it, where we talk about all the times when the pager did not ring. And that's where I say it went. Make a sound when it breaks. Why do I say that? If you look at these photos, most of these things, the upper half, right, they would make a sound when it breaks. Starting from a sonic bubble of an aircraft to a balloon, to a heart, and then the ram and the bios. Not sure if many of us would associate with a ram breaking and making a sound, because I haven't seen modern computers do that. I've not even seen a BIOS screen in ages. So that could also be a case were people don't associate with it. But from an era were, which was prior to this cloud computing, when these things would break, they would make a sound. So it was very easy to diagnose that something has gone wrong. Probably not the case anymore. Most of the failures, the flavors, come in form of software, human network, process and culture. I'm going to talk about everything but a software failures. A software failure is the easiest to identify. There will be a 500 error somewhere which will be caught using some sort of regular expression, some sort of forwarding rule. It will reach a page of duty, a victor ops or some other tool of this manner, an incident management system, which has been set up. And that would ring your phone, ring your pager, ring your email or something like that, which works pretty well. All the other failures, the flavors of it starting from human, a network, a process or a culture, are the ones which make really interesting rcas. A software failure is very easy to identify. You find something in a log line. A culture failure, on the other hand, is something that you identify pretty late. And the good part about is, as you go from top to bottom, you realize that the error that actually shows up, or a failure that eventually shows up is very latent. A culture failure is the slowest to identify. A software failure is the easiest to identify. A human failure, yet slower than a software failure, but way faster than a culture failure. So my talk, I'm going to split this into a few parts where I speak of outages, a few of them, some of them may have heard this talk, but this is slightly different variation of my earlier talk. And these outages, they traverse from very simple scenarios which otherwise, as sres, we would bypass thinking, hey, I could use x took for it, I could use y took for it, I could use z tool for it. But they do not talk about the real underlying cause of a failure. I want to dissect these outages and go one level further to understand why did it fail in first place. Because as steps, we have three principles that we follow. The very first thing is we rush towards a fire. We say, we got to bring this down as we're going to mitigate this fixes as fast as possible. That's the first one, obviously. The second question, a very important one that we have to ask ourselves is where else is this breaking cause? Chances are, our software, our situation, our framework is a byproduct of practices and cultures that we follow. So if something is failing at some place, there's a very high likelihood that it fails in another place as well. That's the second most question we have to ask because the intent is to prevent another fire from happening. And the third ones, the real important one that we mostly fail to answer is, how do I prevent this from happening one more time? Unlike product, reliability cannot be improved as one feature at a time. It cannot be done as one single failure at a time. Failure happens, I fix something, then another failure happens, I fix something. Sadly, actionability doesn't work that way because customers do not give us that many change. The first outage that I want to speak about is a customer service is reporting login to be down. We check datadog, paper trail, neuralink cloud, watch. Everything looks perfectly okay. Most of these trying to hint towards the fact that breaking looks okay, servers look okay, load looks okay, errors look okay. But we can't figure out why login is down. Now here's an interesting fact. Login, if it's down, means that it is not accessible. So requests that doesn't reach us are inside out monitoring systems which sit in form of a setting of a sophisticated Prometheus chain, a datadog chain, a new relic chain is not going to buzz as well because something that doesn't hit you, it doesn't create an alarm on a failures. You don't know what you don't know. So what works, what gives? And meanwhile, on Twitter, there's a lot of sound being made. I mean, surprisingly that I've really seen that my automated system alerts are usually slower than people manually identifying something is broken. So on Twitter, there's a lot of sound happening. What was the real cause? After a bit of debugging, we identified that a DevOps person had manually altered a security group, accidentally deleting the 443 as well. Quite possible to happen because ever since the COVID lockdown started, people started working from home. And when they start working from home, you are always whitelisting some IP address or the other. And those cloud security group operation tabs usually allows you to make just one single press which you can accidentally end up deleting a rule that you should not have, which results in a failure. What was the real root cause here? How do we prevent this from happening again? Well, the obvious second question that we asked was, okay, where else is this happening? We may have deleted other rules as well, but what's the real root cause? If we have to dissect this, it's not setting up another tool. It's certainly not setting up an audit trail policy because, well, that could be one of the changes as well, that you set up an audit trail policy. But then somebody may miss to have a look at it as well. I mean, the answer to solve a human problem cannot be another human reviewing it every time. Because if one human has missed it, another human is going to miss something as well. Then what is those real root cause? The real root cause here is that we do not have this culture, or we allow exceptions of things to be edited manually. Now these cloud states, if they were being maintained religiously using a simple terraform script or a pilumi script, it is highly likely that a change would have been recorded somewhere and a rollback was also possible. But because that wasn't the case, and we had these exceptions of being able to manually go in and change something, even if it was. Hey, I just have one small change to make. Why do I really need to go via the entire terraform route? Because it takes longer. Because every time I have to run an apply operation, it takes around ten minutes to just sync all the data providers and fetch the real state. And then tells me that here's a diff. So it comes in the way of my fast application of a change. Now the side effect of that is that every once in a while were going to make a mistake like this, which ends up resulting in a higher outage. In those particular case, it wasn't just that the slash login is down. Imagine deleting a 443 rule from an inbound of a security group of a load balancer. It's not just login is down, everything was down. Just that login was reported at that point of time was one of the endpoints because one of the customers had complained about it. So real impact was larger CTO, a degree where we can't even tell how much the outage was, unless now we start auditing the request logs from those client side agents, et cetera, as well, which itself at times are missing. So the real impact, we don't even know how much business you lose. We don't know. We just don't know anything about it. So the only way to overcome the situation is to built a practice where we say that no matter what happens, we are not going to allow ourselves to make these manual short circuits. This still looks pretty simple and easy. I want to cover another outage, which is outage number two. This is when we were dealing with data centers. There weren't really these cloud providers here. Not that it changes the problem definition, but it's an important addition to the scope of the problem. Around 25 hours, just before a new country launch that we're supposed to do, payer duty goes off. We check in our servers and we see that there are certain 500 which do report on elasticsearch. Now, because we are in a tight data center environment, default log analysis isn't really a luxury cause you go over a firewall, hop, et cetera. So what we do is one of us just decides to actually start copying the logs. Comes in really handy a few minutes later in the script, a few minutes later, what happens is that 500 just stop coming in. Pager duty is auto resolved. Everything looks good. Though to our curious mind, we are still wondering that why did something stop working? And how did that thing auto resolve? So we are still debugging into it. Five minutes later, pager duty goes off again and again. The public API is unreachable. We do get our pingdom alerts, everything that we had set up alerts go in. Looks like something fishy because this is clearly before a public launch that is happening. So we start checking rundeck, because what is the most important question that we ask when something fails? The very most important question that I have asked when something stops working is that, hey, who change what? Because a system in its stable stationary state doesn't really break that often. It's only when it is subject to a change is when it starts breaking. The change could be for a time of a change, could be something was altered, a change could be something was a new traffic source was added. But it's one of these change that actually break a system. We also ask ourselves this important question. Hey, is my firewall down? Why is firewall important here? Well, because it's a data center deployment. So one of the changes could have happened on a firewall as well. The inbound connection could have gone up, but that wasn't down either. Okay, standard checklist, check Grafana. Nothing wrong. Check stack driver because we were still able to send some data there. Nothing wrong. Check all servers. Nothing wrong. Check load. Nothing wrong. Check docker. Has anything restarted? Nothing restarted. Check APM. Everything looks okay now. While we had copied the logs, we start realizing that some of the database errors, there was only some database errors on some of the requests, not everything. So which looked really suspicious if we take a look at it. We had all the tools that we wanted. We had elasticsearch available, we had Stackdriver available, we had sentry available, we had Prometheus available, we had steps available. One of the ways to solve a problem is that we throw more bodies at a problem. We had really all the sres that we wanted. We were team of ten people, but we couldn't find the problem. 20 hours later, after a lot of toil, we found that the mount command hadn't run on one of the DB shards. Now why did this happen? There were machines, there were data center machines that we actually provisioned. And one of the machines earlier night was coming in from a fix. We had an issue where a mount command used to run a temporary file system and it wasn't really persisting it on a reboot. We applied a fix across ten machines, but the machine that was broken at that point of time did not have that ansible check on it. That did not run ansible on it. Just before the country launch, we decided okay, let's insert this machine and we will have the machine available in case a load arrives as well. So when the machine went in, it did not have that particular fix, that just one fix that we had applied. So data goes onto a shard as well. The machine wasn't fixed properly, so it rebooted again. And when it rebooted, that data got failed. It was just a slice of data. So every time request would hit that slice of data, they would result in an error which would result in load raising of too many errors. Load balancer would cut it off. Send a pager duty breaks the traffic. Health comes in. Health obviously is not checking that data. So the machine would come back in circulation and this cycle goes on and on. Pretty interesting story, but what's the real root cause here? Those real root cause, if you understand, if you try to dissect this further in this particular case was not that we were not running any automation, we had state of the art ansible configuration. What we could have avoided this was, if I ask myself, could have I avoided this outage? Probably not. What did that outage teach me so that it doesn't happen the second time around? This is what is important. Most of the times you won't be able to avoid it or avert that outage from the first time because they're so unique in how they happen. It's going to happen. But how can we prevent this from happening again? When I say this, I don't mean this particular errors, this class of errors from happening again. And the only way to solve that is we realize a simple script could have done the job only if we had a startup script of an OS query which would just check the configuration drift of the machine. Probably we could have saved this. Now one of the big questions that we ask while something is failing is how was it working so far? This is one of the most interesting question. Then the next question is what else is breaking on the lines of these two outages? I want to cover another one, which is an outage number three. This outage number three is typically about a distributed lock which was going wrong in production. We used to have sort of a hosted and amounted terraform which would allow multiple jobs to run at the same time. And there was a lock in place. But despite the clock, multiple jobs would come in and the contention of a lock was really failing almost to a degree that it looked like this. There was a lock, but was pretty useless. So if I'm to define the ideal behavior of a lock, we were using ETCD for clustered clock maintenance, and the standard compare and swap algorithm is being used. How does those compare and swap algorithm works? If I just quickly walk through, right, I set in a value which is one. Using the previous value of one I set in a value of two. Everything works well. If I now try to put in a value of three, it will say that key already exists because it already exists. Or if I try to set a three on a previous value of two, it would say the compare operation failed. So it's a pretty standard API of compare and swap which works. But what was really happening here? We started with a default key which is stopped, and both processes are trying to set a key of started with those previous value of stopped. The behavior is that only one should win that previous value is stopped. I set a new value of started. Only one process should go through. But what's really happening, what's happening is that run a acquires a lock, and with a TTL, TTL is very important. Cause every time you have a lock based system, in absence of a time to live expiry, what ends up happening is that the process which locked or acquired the clock may die. So it's extremely important that the process which has set the lock also sets an expiry with it, so that in case the process dies, the lock is freed up and is available for others to use, which works well. Those only difference being that when a tries to update those status, we get keynote found, b tries to update those status again, then we get keynote found, which is very weird, because the process which had set the lock should be able to unset it and free it, or either case that both of them should be able to go ahead and find the contention. But this is very unique here, because both of the processes say that I've not found the key. Well, keynote found, b says keynot found. So this is where we all become full stack overflow developers. We start hunting our errors, we start hunting the diagnosis, and we land on a GitHub page, which is a very interesting one, where it mentions that ETCD has a problem, that ttls are expiring too soon, which is related to an ETCD election. It looks like a fairly technical and a sophisticated application, and we say, all right, those makes sense, and this must be it. So a GitHub ticket, open source on a system and fairly tells us that hey, eTCD is a problem. Our solution is, well, we have to replace eTCD with a console for reasons of a better API or a better it has better state maintenance, et cetera. And we almost convince ourselves that this is the right way to go. We make a change, it's a week long sprint. We spend a lot of time at it and effort at it and replacing it, and things go back to normal. Is that the real reason though? One of us decides that, look, if this was a reason, how was it working earlier? That's the most important question, right? That we ask, how else was it working earlier? We're not convinced that, look, this could really be the way to go about it. So we dig it deeper, and one of us really dug it deep to actually find out that the only problem that existed on those servers was the clock was adrift. One of the machines and the other machine of the cluster were running behind in time. Now why this would fail is because the leader at that point of time, which was actually setting a TTL because a TTL was very short and it was shorter than the drift. By the time those cluster would get that value, the time had expired. So the subsequent operations that went into the cluster would say that the key is not found something as elementary as or a basic checklist. This is something that we all know, clocks are important, NTBD is extremely important. But this is where we forget to have basic checklists in place. And we almost ended up convincing ourselves that hey, etcd was a problem. We should actually replace it with an entirely new cluster solution which is based on a new consensus algorithm. Worked well, but it just wasted so many hours of the entire team. So simple thing, which is as simple as, I mean, it just baffles me today that a very elementary thing, that is the first thing that we learn when we start debugging servers is that a time drift could actually cause a lot of issues. And this is the first time that we really experience it. Those is one of those bugs that you hit every five years and then you forget about it. But what was those real reason here? The real reason was not automation. The real reason was not any SRE tooling. It wasn't obviously for sure running on Kubernetes as many a times. I find that as a solution, it was a simple checklist of our basic first principles could have done the job, a very simple one which says that is an NTP check installed or not, and that could have saved this. And that's the real root cause behind it. And this is what I want to highlight here, that most of the time our SRE journey is assumed that another monitoring system, another charting system, another alerting system, could have saved the problem, either of these problems that we just described, right, and these are the real business cause failures here, because each one of this led CTO either a delay in our project or it led to a downtime, which really affected the business. And this isn't about slos. This isn't about SLis, because that's what typically what we find was the material when we talk about SRE on the Internet. This is about getting the first principles right. The first principles of, hey, were not going to make any changes which are manually done to a server no matter what. And if it is because of a reason that we are being too slow in application, if terraform is too slow, we need to learn how to make that faster. But we are not going to violate our own policy of making manual changes. It's not about anomaly detection or any fancy algorithms. There it is just the second outage is probably a derivative of those fact that we did not have a simple configuration manager validator. A simple basic tool like OS query which runs at the start of a system checks if my configuration is adrift. Could have been the answer. The third one is the simplest one, but the toughest of all. We spent our time checking timestamps in raft and paxos, but we didn't even have to go that far. The answer was our own server's timestamp was off. Could have been saved if there was a simple check which says hey, is NTP installed and active on a system which could have been done by using a very simple elementary bash script as well. The fact that we don't push ourselves to ask these questions and we look for answers in these tools is where most of the time were. We'll forget is that what we see as errors are actually a byproduct of things that we have failed to do or failed is a wrong word or strong word or we have ignored to take care of at the start of our scaffolding on how we build this entire thing. Which takes me to the outage number four, and this isn't a real outage. This is the one that you are going to face tomorrow. What will be the real root cause behind it? The real root cause behind that one is that we didn't learn anything from the previous one. A very profound line that Henry Ford said, the only real mistake is the one that we'll learn nothing from and that would be the reason for our next outage. Why is those important? And using all these principles is what we cause at last nine. What we have built is a lastio nine graph. Our theory and hours thesis says that hey, all these systems are actually connected, just like the World Wide web is. If you try diagnosing a problem after a situation has happened, and if there is no way of understanding these relations, it's almost impossible. CTo tell the impact of it. Example, my s three bucket permissions are public right now and this is something that I want to fix. Yes, while I know the fix is very simple, but I do not know what the cascading impact of that is going to be on the rest of my system. This is extremely hard to predict right now because we don't look at it as a graph system, we do not look at it as a connected set of things. And this is exactly what we're trying to solve at last night. And that's what we have built the knowledge graph for, which basically just goes through the system and build these components and those relationships so that any cascading impact is understood very clearly. To give an example of how do we put it into use, we basically load a graph and we try to find out hey, what does the spread look like for my particular instance right now? And if I run that query I realize that hey, quickly it tells me that hey, the split is four four three. Clearly there is an uneven split happening across one of the availability ones which may result in a failure if it was supposed to failed. Well, in this particular case because I wanted odd number of machines so this is a perfect scenario. But was this the case a few days back? Well, clearly not. What did it look like? It was two those and three, which is an even number and probably not the right way to actually split my application. Answers like these are what the questions like those, if I can answer them quickly, is what allows us to make these systems reliable. That's what we do. At last, lines a, we haven't yet open source this thing but we are more than happy to, if you want to give it a shout out, you want to try this out, just drop us a line. And we are extremely, extremely happy to set that up. And obviously there's a desire to actually put this out in all open source domain once it matures enough. That's all for me,
...

Piyush Verma

CTO @ last9.io

Piyush Verma's LinkedIn account Piyush Verma's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways