Conf42 Site Reliability Engineering 2020 - Online

- premiere 5PM GMT

SREs love Chaos Engineering

Video size:


What’s Chaos Engineering? Is it part of SRE? Is it breaking things randomly in production? In this talk, we’ll try to settle these questions once and for all, and to give you a life-like demo of what Chaos Engineering looks like in practice!


  • Mikolaj Pawlikowski will talk about chaos engineering and how it overlaps with sres. He will also use this opportunity to talk about his latest book, Chaos Engineering. Two demos will show how chaos engineering can really help the sres in practice.
  • SRE, the site reliability engineering. It's a concept, a term that was coined at Google. If you go to an SRE pub, imagine what you're going to hear about. These are people who deeply care about making sure that things run smoothly.
  • Reliability is kind of this encompassing term that is a little bit vague. It might be performance based, it might be latency. Then you have all the monitoring, alerting, on call. SRE is software engineering being used to remove the bad bits, kind of automate them away.
  • The real magic happens when you start using chaos engineering for your SRE purposes. You can apply chaos engineering techniques to pretty much any system. It can be as simple as a single process running in a single computer. Test your applications.
  • A new tool called Pumba wraps up TC traffic control Linux utility in a really cool way. Lets see what happens if we introduce some slowness, some delay between the two at runtime. And as it turns out, it's pretty simple.
  • Second demo shows how useful chaos engineering can be for sres to verify their slos and to detect breaches. Uses tool called powerful seal for chaos engineering for kubernetes specifically. Shows the different policies that you can implement using powerful Sil.
  • Seal allows you to continuously verify that the assumption that you make the assumption being all pods sre running is actually true. You can configure powerful seal to either fail quickly if this happens, or if you want this kind of ongoing stuff, it can produce metrics that you can later scrape.


This transcript was autogenerated. To make changes, submit a PR.
Hello, everybody. My name is Mikolaj Pawlikowski, and I would like to thank you for coming to this talk. I'm going to talk to you about something that really excites me. I'm going to talk about chaos engineering, and in particular how it overlaps with sres and why sres really should love chaos engineering to begin with. Here's the plan. For the next half an hour or so, like any good scientist, we're going to start with defining our terms. We're going to see what SRE is, we're going to see what chaos engineering is, and then we're going to focus on the actual overlap where the two come together to create an extra value. I'm also going to use this opportunity to talk about my latest book, Chaos Engineering. Crash test your applications, and I'm going to finish with two demos to kind of illustrate what I really mean and kind of show you in practice where chaos engineering can really help the sres. Sounds like a plan. Well, hopefully. So let's start with SRE, the site reliability engineering. I'm pretty sure that some of you at least, are familiar with that. It's a concept, a term that was coined at Google. And in one sentence, it's basically what you end up with if you give the questions work to software engineers. Right? So the way that I'd like to think about is that if you, on a Friday evening, go to an SRE pub, assuming that kind of place exists, imagine what you're going to hear about. You're probably going to hear things like reliability, like performance, like latency, like availability. You might hear someone in these corner ranting about on call and alerting and monitoring. You're probably going to hear someone talking about business objectives and slos and slas and stuff like that. You might even hear someone mention latency, and that's all good. That's basically what you gives you a good idea of what these people actually care about. If this is what you talk about on Friday evening, whats means that they must deeply care about and this know kind of like the kernel of what SRE really is. These are people who deeply care about making sure that things run smoothly, right? So it's one thing to kind of write software and features, and it's kind of like a different problem to actually run that and run it well and at scale and without hiccups, right? And this is where we get to the core of what SrE is, right? So all these things that I mentioned in the virtual pub, they all kind of have to do with reliability. Reliability is kind of this encompassing term that is a little bit vague and kind of can involve pretty much all of those words here, right? So it might be that your system is performance based. It might be that, for example, the value that your system provides depends on how many operations per second it can produce, which case these reliability will be to make sure that that performance can be sustained and that it can be sustained long term. Or it might be that it's much more important for your system to be highly available because if users can't use it for even a small fraction of time, they're going to be much more upset than if the performance is degraded. So that might be your reality of reliability. It might be latency. It might be that people SrE browsing funny videos and they really care about latency. And if the video is too slow and they're going to go and spend their money elsewhere or watch the funny videos elsewhere, so that latency, you can measure that and this is your reality of reliability to make Sre that that latency is maintained. Right. Then you have all the monitoring, alerting, on call. These are the things, these are the tools that show us the visibility, give us this visibility into what's actually going on under the hood in the systems and alert us when our attention is needed. The dreaded on call and the paging where people sometimes have to wake up to fix something, these are the people who do that. These are the people who set up the alerting and set up the monitoring so that they know that something's wrong. Right. The capacity planning, that is also part of your reliability. It's not the most glamorous. Hardly anyone likes to talk about a massive excel file when they try to predict how much resources they need. But it's essential. Someone needs to do that. Someone needs to think, okay, if we're going to grow by this many users in the next quarter, we're going to need more disks and more cpus and more ram and more x and y, right? This is part of the reliability or the slos, right? Probably heard of the slas. You might not have heard of slis and slos. SLI stands for service level indicator, and it can be any quantity that's measurable that you care about in terms of verifying that your system runs well. So let's say that, for example, you run some kind of API. It's an API where people can send a request and they get the right meme. Right. Your meme API and the SLI that you might care about is for example, the speed of the response, let's say whats the ninety ninth percentile of the response time is something that you care about, right? With that you can create design an SLO which is an objective, service level objective, which is basically a range of a particular SLI that you care about, like these mentioned ninety ninth percentile of response time. You might want it to stay within one hundred and two hundred milliseconds, for example. Right. That's an objective. And then SLA is an agreement. It's basically a contract between two parties when one of the parties provides some kind of service and they promise that this SLO is going to be satisfied or else, and typically the or else is some kind of financial compensation or something like that. It can be a cool t shirt, can be anything really. But the idea is that we're going to do whatever we can to make sure that that SLO is satisfied. And if it's not, we're going to make it up to you somehow. Right. So this really is kind of an example, a sample of what the SRE is. And some of you will be like, okay, well that kind of sounds like operations. And yeah, bingo. That's basically what it is. It's the operations just with more software engineering to it. The software engineering being used to remove the bad bits, to kind of automate the bad bits away. And the bad bits have a really funky name, a nice name of toil. Right? We typically speak about toil when we talk about SRE. So the deal for the SREs is that they're going to spend less of their time dealing with the actual operations, these on call, the crisis management, and spend most of their time using their software engineering skills to roll out automation to automate the toil away out of these equation. Right. So they're going to write software to automate things so that there is less ops and on call to do to begin with. Right. That said, there is a lot of hype about that, but at the core of it, this is what it is. A lot of the systems are pretty big and pretty amazing and pretty everything. But you don't have to be at Google to be an SRE. And there are plenty of systems that need the same kind of treatment to verify that it runs smooth as silk. Right. Okay, so kind of defined SRE. What about chaos engineering? I'm guessing whats some of you probably first came into contact with chaos engineering in the context of chaos monkey. And that's great, that's fantastic. But it also kind of gives chaos engineering the bad rep if you google whats chaos monkey was randomly taking down vms for Netflix. So whats these can detect things that they didn't detect with other testing techniques. But if you google that, you're probably going to end up with something along the lines, let's breaking things in production and breaking things on purpose. And that's great. But the kind of breaking things is not really what we're after here. What we're after is experimenting to verify hypothesis about our system's behavior in the presence of failure. So yes, we want to inject the kind of failure, introduce the kind of failure that we reasonably expect to happen. But we don't really try to break the system. We actually try to learn whats either the system behaves the way that we expect, that's the hypothesis, or denying that and learn that it needs to be fixed. Okay, so at the core of it, it's much more scientific than just going and randomly smashing things in production. We actually want to very finely control the amount of failure and these type of failure that we inject most of the time to verify that what we think is going to happen is actually happening, right. That we are right. Thinking about properties of the system. And this is really where the value comes from. And sure, there is the aspect of kind of like on the verge of gas engineering and fuzzing techniques where you want to create this kind of like half random pseudorandum situations where you can end up with combinations that you didn't think about yourself to test out, but this is just a part of it, right? So next time you see that it's, oh, let's randomly scratch things in production, that's not really what it is about, and at least it's not really what it's about for everybody. If this is a good idea for you, if your production system is of a nature that allows you to do things like that, that's great. That's absolutely stunning. If you are at the maturity level these, you can actually run this kind of thing in production. That's great. Because if you think about that, you can never really one hundred percent attest anything before it hits production, right? Because you can try very hard to reproduce the kind of failure that you expect. You can try very hard to reproduce the same environment in some kind of test stage, dev stage, pre production stage. But at the end of the day, there will be things that will be different. It might vary very slightly, it might just be like a user pattern, but it's technically either impossible or prohibitively expensive to actually do something like that right. So if you can, this is like the holy grail, when you run things, kind of things in production and you can verify and uncover real problem on a real production system, but it also gives it bad rep because when people read about that, they stop taking it seriously. It's like, well, yeah, okay, cool. We would never do it here. So just kind of want to remind you that this is the case. So now we have these two concepts. We have the SRE on the left hand side and we have the chaos engineering on the right hand side. And I would like to argue and spend the rest of this time that we have together now to kind of see show to you that where the real magic happens, where the love happens, is when you start using chaos engineering for your SRE purposes. Okay, I'm not just saying that I deeply believe that. In fact, I believe so deeply in that that I wrote a book about it. It's called chaos engineering crush. Test your applications. It's available right now in the early access from money. If you go to money com looks chaos engineering. And it's trying to show you that you don't need a massive distributed system. And if you have one, that's great. But you can apply things, chaos engineering techniques to pretty much any system. It can be as simple as a single process running in a single computer. I have a chapter. These whats shows you how you can treat that single process and that single computer as a system and verify things like for example, block some system calls and verify that actually this process as a system might not behave the way that you expect it to behave. It might not have the error handling or the retrieves that you expected it to have. It might actually work differently. And then it kind of builds up from the small examples, looks into how to introduce failures between components, how to introduce slowness in between components. On the networking level, it talks about introducing failure through modifying code. On the fly. If you happen to be running something that executes in a JVM, there is a chapter pardon where you can learn how to inject bytecode into your classes without actually modifying the source code. So you can take someone else's code and inject the type of failure that you expect and then verify that the system behaves as a whole in that manner that you expected. And these, it goes all the way, builds up all the way to things like Docker, where you test out things running Docker, or you test out Docker itself. And kubernetes, if you have larger systems that are running these distributed and kind of anything in between, or even if you want to test chaos engineering, test the kind of failure that you expect in your front end Javascript. It's really a great tool and I'm really wanting to show people that this is something that you can use in many situations, and it's not just for Netflix and for Google. This is much more broad than that. Okay, so these are the things that we just discussed. And as you can see, apart from snarky t shirts that don't really need an improvement, typically sres have them on point. All of these things can be helped with the use of chaos engineering. All of these things can be designed, experiments on, and can be verified through this experiment, verified in terms of hypothesis and assumptions that we have about these things. Okay, so just to show you these overlap is big, I kind of felt like the previous slide was showing the overlap as a little bit tiny. So I kind of zoomed in here and I would like to now show you a little bit more in practice, what I actually mean about when I say that the case engineering can be leveraged for SRE. So let's jump to my first demo. What you're seeing here is the VM that comes with my book. It's more or less vanilla ubuntu with all these things that you need for KS engineering pre installed here. I'm going to use one of the examples that I have here. This is coming from a chapter on Docker, and it's a descriptor for Docker Swarm or Docker stack that describes two services. One of them is called Ghost and it basically just runs the image for Ghost and then provides some configuration for the database and the database itself, which is mysql five point seven with not a particularly safe password. So the thing that what it does is basically start these two containers with the configuration two point each other so that we can run ghost. And if you take a look, I actually already have it running. I already run the docker stack and you can see my ghost container and you can see my MySQL container. If you're not familiar with ghost, it's a blogging engine. It's a little bit like WordPress, but just a little bit more modern. Also note that the names of the containers, one of them starts with Meower Ghost and the other starts with MeowDB because we're going to use these names later on. So first thing we should do is actually verify whats this thing is working. So we should be able to go to one hundred and twenty seven to port eighty, eighty and we should be seeing the application. And boom, looks like it's actually working. Okay, so this is great. We get something, but what we actually care about is some kind of sli, some kind of metric that we care about. And we want to make sure that we satisfy because that's how we do as sres. Okay, so one of the most basic things that we can do is use something like Apache benchmark to basically create a lot of requests and verify how quickly this request return. This is just running ab with ten seconds and concurrency of one to the one twenty seven, zero, zero, one eighty, eighty. So that we get an idea. So you can see that during this ten seconds we got one hundred and five requests that were complete and we got no failed request, which is also great, which translates into ninety five milliseconds per request, which is, let's be honest here, running a local host, not the record of the world, but it's not too bad either. Okay, so this thing, what we just did in the chaos engineering parlance would be called steady state things, is what the metric that we have, which is time per request. An Sli time per request is roughly ninety five. If we run it again, we're probably going to get slightly different number. But I would expect that this is not going to be very different because we didn't change anything. Okay, so let's just finish the ten seconds. Actually, now it's fifty milliseconds. I guess it was warming up a little bit, in which case I'm going to run it again so that we can actually verify that the steady state is okay. Ten seconds is not particularly long. So it looks like there was some cache warm up. And now we have about fifty milliseconds time per request. Brilliant. All right, so it would be a shame now if someone went ahead and introduced some failure, right? So we have two components. We have the database and the engine, the blogging engine. And what we might want to do is just see what happens if we introduce some slowness, some delay between the two at runtime. Right? And as it turns out, it's actually pretty simple. We can do it pretty easily. One of the things that make it easy is this tool called Pumba. It's an open source resilience tool that essentially, apart from doing things like killing containers, wraps up TC traffic control Linux utility in a really cool way. So I'm not going to go too much into the detail, but I just wanted to know that what you can do is you can ask Pumbaa to actually run these TC from inside of a container that's attached to the container that you care about to introduce slowness. So what we want to do is we want to introduce slowness on this container and we're going to go ahead and actually introduce that to all the networking. And with Pumba that's pretty simple. Actually. I already have a command here, whats I use, but I use this NETM subcommand. We're going to run it for one hundred and twenty seconds the entire experiment. We're going to specify a TC image because Pumba allows you to either rely on TC being available in that container that you're targeting, or if you're using someone else's container or you just don't want to have TC there, you can start another container like I just described. Like for example this one, that chaos TC built in to connect to that other container and execute TC in there. And then we're going to add a delay time delay of one hundred milliseconds. And we're going to ignore jitter and correlation for now both set to zero so that we can just sre the results more easily. And then the nice feature is that you can specify these the container name or if you prepend it with re two colon, you can use regular expression. So for example, my Meower underscore DB is going to match everything that starts with it or includes that. So in particular these name of Meowerdb that we were looking at before, the one here is going to be matched. All right. So I'm going to go ahead and run it. And what it should do is start that other container, execute the stuff and end. So I'm going to start another tab so that we can see what's going on. And the funny thing is, an interesting thing is that you see these actual container being created and exiting nineteen seconds ago. And that container executes DC. And when this command is done, it's actually going to go ahead and execute another container that's also going to be visible here. So I'm going to go ahead and I'm going to rerun the same AB on port eighty eighty to verify our state right now with the one hundred milliseconds added. And boom, look, we went from fifty milliseconds roughly to five hundred milliseconds. So from roughly two hundred requests in ten seconds to just eighteen. So what happened here, right, you could have expected, we can also verify that rerun whats just to make sure that we get consistent results. But what happened is whats you might have expected, the one hundred milliseconds in between the database and these ghost container to translate into an extra one hundred milliseconds of delay to the user. But what actually happened is that we got almost five hundred milliseconds of delay. So we got a multiple of that. And the reason for that is that ghost probably talks to the database more than once, even for the index page that we are querying. And that means that if that container's networking is slowed down by one hundred milliseconds, what we're actually going to see is closer to five hundred milliseconds delay. So just to confirm these, I'm going to run it again so that we can see whats we're back to fifty milliseconds because the pumba setup is done. So about sixty milliseconds, which is good enough. And then if you look at the Docker PSA, you can see that we have the other one, which the first one was TcQdisk add, and then the next one was TCQdisc delete that exited thirty nine seconds ago. Okay. This is how Pumba was able to actually affect the networking of the container that was running an image that we didn't instrument in any way. So this is like a really very short version of this demo. But my goal here was to just kind of show you how easy it can be with the right tooling and the right knowledge to verify things like that. And now we know that if we can expect reasonably, the database networking to have delays like one hundred milliseconds, that will affect our overall delay for the ghost setup by much more than one hundred milliseconds, and in particular by something closer to five hundred milliseconds. So if the delay rose to a second, we could probably expect to actually sre something closer to five seconds rather than just one. Okay, and this is my second demo where I would like to show you a little bit more on the Kubernetes side of things for all the people who are using Kubernetes to deploy their software. So let's take a quick look at slos and Kubernetes. The purpose of the second demo is to show you how useful chaos engineering can be for sres to verify their slos and to detect breaches in their slos. So I've got here mini cube set up, just a basic one with a single master here, and I've got a bunch of pods running. Also, nothing out of extraordinary. This is just the stuff that minikube starts with. And what I'm going to do is I'm going to use a tool called powerful seal. It's something that I wrote a while back, and it's a tool for chaos engineering for kubernetes specifically. If you've not used things before, I recommend going to GitHub and giving it a try. But basically what it does is that it allows you to write this yaml descriptions of scenarios. And then for each of these scenarios you can configure a bunch of things that you can do to verify that your assumptions are correct. And if they're not correct it's going to error and you can alert on that. So if you want to get started with that, there's a get started click quick little tutorial. But the kind of most important stuff is about writing policies in things section here, where you can see the different examples of the different policies that you can see that you can implement using powerful Sil. And if you are wondering what the syntax looks like, there is an up to date, automatically generated documentation that shows you what kind of things you can do. So if you do scenarios sres, then you can see the kind of things that are available to you. So probe, HTTP, kubectl, production node, action weight and stuff like that. But this is for another day. I just wanted to kind of give you a quick insight into where to look for those kind of things. But let's take a look at whats actually looks like in action. So back to our little mini cube setup. I have a seal already available here that I preinstalled and I also prepared two little examples of a policy. So I'm going to start with these. Hello world. And this is whats it looks like. It's a simple yaml with scenario, a single scenario called count pods, not in the running state. And what it does is that it has a single step with pod action. And inside of the pod action there's always these things that you can do. You match a certain initial set of pods, you can filter them by whatever property or whatever filters that you feel like. So in our case, I'm going to match all the pods from all the namespaces, and then I'm going to pick the ones that have the state property that is negative, not running. And then I'm going to count these and I'm going to verify that the count is always zero. So what it's going to do for me, we show the git pods, all of them were running. So this kind of verification, very simplistic here, shows how you can kind of continuously verify that the assumption that you make the assumption being all pods sre running is actually true, and you can do that with the twenty lines of yaml. So in order to run this, we're just going to do seal autonomous. To invoke the autonomous mode and these we need to specify the policy file and this is simply done by the policy file flag. If we run that powerful seal is going to connect our cluster. And whats you can see here is that it matched the namespace star, so it matched all the namespaces matched eleven pods in that corresponding to the pod that we found here. And it found an empty set after the running negative true. So the filtered set length is zero and the scenario finished successfully by default. It's also going to go ahead and sleep for some time and retry that later on, which is also configurable. So if we just wanted to verify that this is actually working, what we could do is remove the negative and verify that it's failing. If we try to count the running state and the checkpoint, the count is not zero. So if we run it now, this should fail. Complaining whats we got? Eleven pods instead of zero, which is exactly what we saw here. And you can configure powerful seal to either fail quickly if this happens, or if you want this kind of ongoing stuff, it can produce metrics that you can later scrape. So with just a few lines of yaml we're able to verify NSLO, which is kind of silly, all pods running, but gives you an example of what you can do with this kind of thing. So let's do another example, a little bit more complex than that. I prepared another one called policy one for you here and let's take a look. So this time we actually specify these run strategy. So we want to just run it once we got an exit strategy fail fast. And the scenario is a little bit more complex this time. So what I'm trying to verify here is that the deployment SLO is that after a new deployment and a service are scheduled, it can be called within thirty seconds. So let's say that I designed my Kubernetes setup and I designed all of the bricks in a way that I am fairly confident that at any given time, when I schedule a new deployment and a corresponding service, within thirty seconds everything will be up and running and I'll be able to actually call it. So the way that I implement that is through the Kubectl action. Kubectl action lets you more or less specify the payload and these action. So apply or delete. It's an equivalent to Kubectl, apply f of the standard input. It also allows you to automatically delete at the end so that you can clean up and so that you don't leave some kind of artifact after you're done with that. So the payload here is a little bit more complex. It's actually deploying another application that I wrote that is very useful for kubernetes. It's called Goldpinger. And Goldpinger allows you to basically deploy an instance of Goldpinger per node by using a demo set typically. Or you can deploy it more or less whenever you like, whichever way you like. But the default use case is that use a demo set so that you run an instance of Goldfinger per node and these, they continuously create this full graph of connectivity between these nodes. Whats you can use to verify whether there is any issues connecting on whether your networking is slower between certain nodes and stuff like that. So this is like a drop in that you can run on your cluster and you can verify this kind of things. It also produces metrics and things like heat maps and stuff like that. But going back to our example, in order for that to work, it is a service account so that it can list pods, so that every Goldpinger instance can actually see what other Goldpinger instances are there to send pings to. And then we've got the deployment and the deployment is fairly standard. Right now I only have a single node, so I'm just going to deploy a single replica. It has a selector, it uses service account that we just set up and a bunch of variables here that are not particularly relevant to us right now. This is just to make sure that things working. It also comes with a liveness probe and readiness probe, so that we know that if we can ping it, whats means that it was able to verify the probe initially. And finally we've got a gold pinger service, a service that we're going to use to actually issue a request. And then after that, this is where our slo kicks in. We verify that after thirty seconds. We expect that. So the magic number here, our magic range is between zero and thirty seconds. And finally this is where the verification happens. We have an HTTP probe that calls the helps the endpoint of the Goldpinger service in the default namespace, which is the one that we defined just here. So all in all what it's going to do, it's going to go create the thing, wait thirty seconds and then issue these HTTP request to verify that it gets a response on the particular port. Okay, so with that we can go ahead and this time instead of hello world we're just going to run the policy one. But before we do that actually just to show you, we're going to do get pod aw in the background so that we get all the new pods that come up and all the paths that are being terminated. So whats it's actually visible to you too. So again our seal and we have these policy one yaml I'm just going to go ahead and run it. So it starts, it read the scenario. You can see that it started created the deployment and here our kubectl in the background is actually displaying these new pod that is already running after four seconds and now we've got about twenty five seconds to wait. So if there was some kind of elevator music that will be good. It's running for twenty five seconds so we're not that far off. And these making a call, powerful shield. Try to make the call. It got a response. You can see the response generated by a gold finger scenario finished, cleanup started. As you can see that's the thing, whats I was describing before the auto delete, it deletes all the things the pod gets terminated and powerful seal carries on. So if we list our pods again we can see that it's actually terminated already. The Goldfinger. And if you run this continuously you'll be able to verify that your slo of thirty seconds for a new pod coming up is actually being satisfied or not depending on what's going on. So I don't want it to be too deep of a dive but if you want to dive deeper that's absolutely great. I would recommend going to the powerful seal documentation back in the browser here and just at least go through the different examples here we have like the new pod startup, we get the pod reschedule where we actually go ahead and we kill a pod and then we wait a certain amount of time and then we verify that the pod is running. Powerful silk can also integrate with cloud providers. So things like Aws, Azure, OpenStack, Google Cloud, there are drivers for that. So you can say things like node action like this. You can say for example pick all the masters, pick the masters that are up and take a random sample of size one to just take a single master that is up and stop that thing. And then we can verify that things continue working the way that we want it. And if you want you can put them back up explicitly like that. Or there's also in the stop action you can do auto restart et cetera. Et cetera. And that's all I had for you today. Once again, go grab my book. If you want to reach out, there's my contact details available there. If you have any questions, I'm happy to chat. And hopefully I'm going to just leave you with this new tool that you can use. And if you are an SRE, you should be using it. If you're not an SRE and you would like to become one, this is something that's going to help you with that. Thank you very much and see you next time.

Mikolaj Pawlikowski

Software Engineer Project Lead @ Bloomberg LP

Mikolaj Pawlikowski's LinkedIn account Mikolaj Pawlikowski's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways