Conf42 Chaos Engineering 2020 - Online

- premiere 5PM GMT

Chaos Monkey for Spring Boot

Video size:

Abstract

Everything you want to know about the useful and popular chaos engineering tool Chaos Monkey for Spring Boot (CMSB) from two of its maintainers.

Featuring: - How to easily get started with your first Chaos Experiments. - More exotic applications like dual attacks. - How to integrate CMSB with automation tools like Chaos Toolkit and Chaos Mesh in order to run tests in your build chain. - An overview of the history and the changes in the latest version. - Who should get involved in the project, and how. - A sneak peek into the next release. The talk consists primarily of live coding.

Summary

  • Fletcher: We've got a software tool for you called chaos monkey for spring Boot. He shows you how to add a dependency to your existing application. Using Postman you can then attack different services using different latency levels. Fletcher: If you want to do real experiments, you can restart the application every time.
  • You can attack assault controllers, rest controllers, service repository and component. About every second request gets attacked. Let's see how it affects our endpoint.
  • The experiment uses chaos toolkit which talked. Ross Meyer about. It's kind of asserting like testing and now are actually testing. What we do is like first we enable chaos monkey. Then we attack every request. No matter what's happening, our system should be responding to that request with 200.
  • Even when the database is down we can't get the movie which the user might be recommended to the user based on his viewing preferences. If there is any exception throw, we just return our fallback movie Titanic. If you want to run that more frequently, you can run it inside of a script or build chain.
  • Find our demo case in this GitHub URL. Chaos chaos Monkey for spring boot. Is this thing plugged in? Yeah, it's supposed to work. Well, we can use the. Hold on. Yeah, you can use them.
  • The project was started by a colleague of ours called Benjamin Wilms. With the help of this plugin, he was able to test his resilience patterns. He has now gone on to start a startup about chaos engineering. If you're interested in chaos engineering, one thing you could do is have a look at our project.
  • Boot the roadmap. New feature was like having cron expressions to schedule attacks. On each request, you kill your app memory assault. This assault is a little bit flaky because different jvms and different versions of Java. But just be warned, it doesn't always work perfectly.
  • We want to attack those calls going out over the network and introduce latency or problems and things onto those calls. And the other thing is reactive. I've done a bit of work with reactive applications in spring boot. Some of the things we're doing there will work directly with reactive as well.
  • You don't need to install any tools to get started with chaos engineering. Even the dependency step is just optional. You can even include it in just like comment line. If you're not using Java, or if there might be other reasons where you're going to need some other tools.
  • Another more cooler tool is Pumbaa. Pumba attacks docker containers mostly. My personal favorite is cube invaders. Recently they published Chaos Blade. You need to have knowledge over a lot of different tools. You're trying to create resilient systems.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
So we want to turn these surprises into non surprises or something. How are we going to do that? We've got a software tool for you. It's chaos, chaos, chaos, chaos, chaos, chaos monkey for spring Boot, Fletcher, I'm Manuel Wessner. We work for codecentric AG and we in Germany and we are maintainers of this piece of software. So we want to kick straight off with a live demo and show you a few things about how you can use it. This is just a really simple application we wrote. If you want to use a spring boot plugin, then you're going to need a spring boot application to test it with. So we've written a very simple application which men will explain. Yeah, it's just a very basic spring boot application. We got an endpoint, which is just movies. The app is about recommending a movie for a user. Just take any streaming service like do you want to watch inception or some other movies? So it just has a normal rest controller with one endpoint. This endpoint calls a service get movie. This in terms get recommended movie, which is just a plain list of predefined movies, just very basic. And for simplicity's sake we just take a random movie off that list and the movies are provided by a database. Yeah. So in theory it's getting it based on user preferences or something. It's a recommender. Okay, so now if you could just show us with Postman. We're using Postman if you're familiar. Yeah, we want to start application. Exactly. And we're using postman if you're not familiar with it. It just helps you to make rest calls and see the JSOn output that comes back. So we just make a request to this endpoint. Okay, so application functioning normally. Now let's see what we can do. Chaos, chaos, chaos, chaos, chaos, chaos monkey for spring Boot started with that. Yeah. Simplest way is like to add a dependency to your existing application. So either it doesn't matter if you have gradle or maven, just add it. So we have one dependency, chaos monkey, spring boot. You add that. So now you got the library. It still needs some config. I prepared a little bit of config. I will explain you. Right on. So that's our application properties. So just having the dependency in there is not going to change anything. Initially you need to activate the spring profile. Chaos monkey first of all. And then you're going to need to activate, you need to enable chaos monkey second line, you need to enable it actively. So then we have kind of watcher. This means what you're going to attack, you have different automation. For example, in this case we attack service. So we attack this class which is annotated with service. We have an assault level. John can. Yeah, better than mine. The level means how many of the incoming requests will be attacked. So one means every single request. If you put five there, it'll be one out of every five requests. Approximately, it's done. Not every fifth, but random about roundabout every fifth will be attacked. So yeah, and the last line is like we attack using latency, so we inject latency to our service. That's kind of different assaults we have. So let's see how it goes. We move to Postman again and do the same request where we had a few milliseconds. Now it turns out to be 2 seconds. I do another one. Yeah. And if you see the requests will actually take slightly different times because the latency is by default it's a random amount of latency. It'll be added between one and 3 seconds. Yeah, so we can change that. Yeah, we can change that. Like we can introduce a fixed delay like by just defining a couple of additional properties. So basically range, start and end just define the same amount of time. So we assume 3 seconds it's going to take to get a movie out of our endpoint. So our application started, first request takes a little bit longer. Yeah, this one will take a bit more than 3 seconds. That's four, but the next one. So yeah, 3 seconds. All right. Yeah, well we've shown how obviously you've seen we're restarting the application every time, but if we want to do real experiments, it is a bit inconvenient to have to restart your application to start the experiment and stop the experiment. So it would be nice if we could just have the application running and at runtime we could tell, okay, now we want to introduce, we want to have this assault running and then we want to turn it off again. So what we can do is enable actuator which is like some endpoints where you can by an tests endpoint enable and disable certain things like chaos monkey for example. And we can configure our attacks and our watchers like using postman for example. Yeah, my tests core. Yeah. So actuator is just a built in part of spring boot. So there are lots of different functions of spring boot which use actuator chaos chaos monkey for spring boot plugs into that we just need two add the settings which expose this tests endpoint for configured by any tests client. All right, so let's see now we have it enabled so we can configure now this stuff using postman. Yeah, so now we're coding to configure an exception assault. So we saw the latency assault just before and another assault which we can do is exception which means basically the requests are going to be coding to throw an exception. When the request comes in the application will throw an exception. Yeah. So we have this assaults endpoint which we pass a config JSON config. Every second request gets attacked. We don't want latency, we just want exceptions two throw. So we activate that and we want to have it on rest controls. So we have defined it here, we just want to have it on tests controls, no other type of annotation. So we now don't need, maybe we could just take a second to look at that, what we're seeing there. So they're the different options for what you can configure. So you can attack assault controllers, rest controllers, service repository and component. Obviously component actually technically speaking every service, for example is a component in spring, but it will attack the things that actually have that annotation on them when you enable that particular thing. So let's see how it affects our endpoint. As we see, first request we got an internal server error, we got a chaos monkey runtime exception. So second request goes well. So we have level two. So about every second request gets attacked. Yeah. Cool. All right then let's say we want to talk about like our application is talking to a database. So we want to say maybe a theoretical thing could be that we can't access the database, right. We, we don't get a connection to the database when we want to get the user's preference. Now because we can attack repository beans we can run an attack for that. Let's do that. So we just configure now only the repository, not just tests controller, just the repository because it's actually our database or access to our database. And look how it goes. We assume like the same result. Yeah. Okay so now we wanted to put that into an experiments. Okay so let's say it's all very fun. We've been clicking around here and it seems to be working but in reality it would be nice, we don't want to be doing that much clicking. It would be nice if we had a kind of a script which could just execute it for us. So let's enable that, attack that assault, let's see how the application responds and then let's disable that and return to normal functionality. Yeah, we prepared a little bit kind of experiments. It's using chaos toolkit which talked. Ross Meyer just talked about us. It's kind of a JSON description how you want to have your experiments. It got a title. So we assume movie recommendation when database is down, quite important. A steady started means like if everything is all right, we suppose a movie is recommended. So how we can check that we can request a movie on this endpoint should get a result in less than a second and it should return HTTP status code 200. That's that. So we're basically saying that at any point in time, no matter what's happening, our system should be responding to that request with 200 and should be giving some movie. So it's kind of asserting like testing and now are actually testing. What we do is like first we enable chaos monkey. The good thing is Chaos toolkit has an integration for Chaos monkey so you don't need two actually make the post request. We just saw it postman, you just can pass the actuator URL and you chain use the enable chaos monkey function. After that you configure the salts. So what we do now, we attack every request. We want to have exceptions. And to make it a little bit more realistic, we have a connect exception and the arguments are like the parameter of the exception we pass here. So we pass a string which connection timed out. So to make it more like the exception you would actually get if you can't connect to the database and we want two have that now a regular post. Because that feature isn't implemented, it would be probably great for full request. We need to be able to update the watches from the Chaos toolkit because previously it's a more recent update. Chaos Chaos monkey for spring boot after you start spring the application, you weren't able to change the watches. You weren't able to say that now the service watcher is active or now the red, you had to do it at started. And that's why I think the Chaos toolkit doesn't support chain that in the experiments. But now, oh yeah, if you call the changing of those values, as long as you can change them at some point for the API call, then you should be able two change them inside the method if you want to change them during the execution. Really? Okay. I don't know if the underlying API has that in there yet. Yeah, I think that's the point. I think it's still possible by chaos toolkit, but it's not like with the function change or soil configuration because it's quite a new feature I guess to changing that watches at runtime before it wasn't possible and that's probably the reason. So here we pass the same arguments like we just want to have it on repository and it's good to go. All right. So I will just make a first request to see how it goes. Now I will run the experiment. Right. So at the moment the Application is running with no chaos attacks, assaults configured. Is that right? I think it's still from the old one. Like the old exceptions. Yeah. Okay so maybe do I need to reset it? Let's see. I think this is going to fail. Let's see, let's see. It's live coding so we assume it can fail. So yeah how you read this is like in the beginning you look at a steady state movies recommended probe. We actually try it. It says it's not in a given tolerance, means it took longer than 1 second to answer. Then in the end we roll back. I haven't showed you that there is a rollback function which is actually like at the end of each run you do that, you disable chaos monkey. Yeah. So this one didn't actually even start the experiments because the steady started wasn't valid even. Exactly. Now it's disabled chaos monkey. So now it will run. Yeah. So we disabled chaos monkey. We don't have any attacks. So I suppose now the steady state works. And what do we have? Now it looks good. We have steady state. The steady state is met. So actually it can request a movie in less than a second. Then we enable the chaos monk, configure the text and the watchers. So we do that again. Steady state is not in a given tolerance because we now get exceptions. The repository means throwing exception because you can't connect to the database. Exactly. So we roll back everything. We disable Chaos monk. In the result of chaos toolkit steady started has deviated, a weakness have been discovered and we have a weakness in this case. Yes. We cannot recommend movies. We can show that in postman two. It's like now crashing with no, because you've disabled chaos. But it doesn't matter. It's not requesting movies. Correct. It's not delivering movies but we want two. We want to deliver movies all the time. Even when the database is down we can't get the movie which the user might be recommended to the user based on his viewing preferences. But we want to recommend him some kind of a movie that everybody likes. So your favorite movie probably. I know what it would be. Titanic. Titanic. So we prepared a little bit of fallback logic. It's on our master branch. So on our movie service, we have now Titanic as our fablest fault bank movie. So the get movie method, we just adjusted. So how does you read that? It's actually using a library called Waiver. It's part of a resilience library. Resilience for J. Basically it's just a simpler method of writing. Try catch exception just at one line. So what is this we call get recommend movie? And if there is any exception throw, we just return our fallback movie Titanic. All right, so I think we just still configured the attacks, right? So we can. We can run the show it in postman first. Oh, let's see, let's see. You have to restart the app first. I did restart it, but it's still. You would need to reconfigure the attack. Yeah, true. So let's run the experiment. That's right away. So let's run it again. So now we see steady state is met. We enable chaos mong. We configured ourselves again. The repository watches, and it's still going. So we now return Titanic, and it's a movie recommendation somehow, at least. And we disable chaos monk again. And now we have an experiment completed. Cool. So if you want to run that more frequently, if you want to run that inside of a script or build chain or something, that chaos toolkit won't give you sort of like a zero or one kind of response, like a return code. But you can grep for certain text. At least that's what I got to. You can grep for text, which will tell you, because it prints certain text if your experiment fails or doesn't or passes. And you chain grep for that. And Grep will give you a response code to tell you whether that worked. Unless there's been an update from Russ's side. Not yet. So we talked boot before. Yeah. So that's one way you can do that. Yeah. So is that the end of what we wanted to demo? Yeah, I think so, as far as I know. Okay, so we've got a few slides that we want to share with you as well about basically where the project chaos chaos. Chaos chaos Monkey for spring boot. Find our demo case in this GitHub URL. Is this thing plugged in? Yeah, it's supposed to work. No, not anymore. Well, we can use the. I can. Hold on. Yeah, you can use them. That's fine. I know why there's this on off switch on the side you chain. Put it to on. Okay, here we go. So a bit of history this project was started by a colleague of ours called Benjamin Wilms. And he was, a few years ago, he was building fallbacks and things. He was building like circuit breakers and other resilience patterns into his applications, but he wanted to test whether that stuff actually works. I've been, by the way, in various places where they're building this stuff, and nobody ever tested whether any of these things actually worked. So maybe you've experienced that too. Well, Benjamin wanted to know whether the stuff he was building actually worked, and he had no easy way to do it. And he found out there are a few tools out there which he had to install, and it was the various complications in using them. So he said, well, why don't I just write something for spring boot? And so that's why he started this application. And with the help of this plugin, he was able to test his resilience patterns. He didn't need to install anything on the servers. That's one of the things which is great about using this plugin. And we didn't need any permissions from anyone. He could just get up and get running with that. So it was successful for him. He actually has now gone on two start to work on a startup. So our company, Codecentric said to him, management said to him, why don't you go and start a startup about chaos engineering? So that's what he did. That's kind of cool when management says that to you and sends you off. And so he actually passed it on to us. The project said, can someone else maintain it? So there's a few of us from codecentric Ag that are maintaining the project. And we're pretty responsive to issues and basically we're pretty active there. Yeah, that's our little advertising. Please get involved if you want. If you're interested in chaos engineering, don't know where to get started. Well, one thing you could do is have a look at our project and maybe commit something. As said, there's a few of us involved and we're pretty active, so any helps. Appreciate it. Yeah. All right. Talking about recent changes, we made chaos, chaos, chaos, chaos, chaos monkey for spring Boot release. And we got a new feature. So we want to introduce a couple. So we have two different types of assaults. One is like the request assaults we just saw. On every request we do something like latency or other stuff like exception. And we also have like runtime assaults, which means, for example, with this config, you see, you can kill our application every hour. So that's the config and our new feature was like having cron expressions to schedule attacks. It's like the original Chaos monkey bit, which from Netflix we just turned down your application at some random. Yeah, it's like that because it doesn't really make sense to have this kind of assault based on incoming requests, but rather based on a timing and the next one as well wouldn't be good. On each request, you kill your app memory assault, which here we consume memory to a certain amount you configured. So, for example, with this coding, we fill 5% of memory every second until we reach 95% of memory and hold it for like 40 seconds. You can test things like out of memory exception. How does your application behave with that? This assault is a little bit flaky because different jvms and different versions of Java and things act differently with garbage collecting and stuff. So we had some pain with it. So definitely play around with it. But just be warned, it doesn't always work perfectly. Quite hard. Boot the roadmap. Yeah. So what's upcoming? Well, one of the things which, I mean, we showed you a few demos. Now, to keep it simple, we just kept it all inside one application. But obviously one of the main things you're interested in is applications talking to each other over the network. And often maybe you'll be attacking a system which is a back end for a client and you want to see how that client reacts or something. Well, in this case we thought about outgoing HTTP calls. Now, in spring you use rest template or web client. They're the two classes that spring gives you to make HTTP calls going out. So we want to attack those calls going out over the network and introduce latency or problems and things onto those calls. And the other thing is reactive. I've done a bit of work with reactive applications in spring boot, and so I think some of the things we're doing there will work directly with reactive as well. Makes sense, but it's a couple of things which we need to rethink how we're doing that for what makes sense if you're doing chaos engineering with reactive applications. So that's also what's coming up soon. Yeah. So now that was chaos, chaos, chaos monkey for spring boot. Think that it's a great way to get started, because if you're wondering if you're using spring and you want to get started with chaos engineering, you don't need to install any tools, so you don't need any special permissions. You can just get up and running in a few minutes like we just showed you. This is a fantastic way. Two, get running to get started with chaos engineering. Yeah, even the dependency step is just optional. You can even include it in just like comment line. If you put Java minus char, you can put it on a class loader. So you don't need to even have this chaos monkey dependency in your production environment in your palm. Yeah, so those are some of the advantages. Chaos, chaos. Chaos monkey for spring boot. Obviously, if you're not using Java, or if there might be other reasons where you're going to need some other tools. So we thought we'd just point out a few other things that are out there. Traffic control. You want to do some low level Linux kernel stuff, you can use traffic control. Look at that crazy thing. We showed you how the nice way to introduce latency with our application, but this is at an infrastructure level. This crazy command here is going to introduce latency on the ethernet e zero interface, basically. So, yeah, maybe here a letter off will take some problems too with it. All right, we have another tool which is called stress cpu. It's also a common line tool. It's in most Linux distributions, included here again, we have like producing a high cpu load. It's quite a pain to get the hang of it, of all these options. So here we like for 10 seconds, I think on two cpus we introduce 128 megabytes load. Another more cooler tool is Pumbaa. Pumba attacks docker containers mostly. It has these commands available. So you can make the same like emulate network delays, you can pause maintainers, you can stop and kill them and even remove. Yeah, for comparison, there's the latency thing again, but with Pumba. So introducing latency onto connections for that particular docker. You can even install it on kubernetes, as demon said. So you have Pumbaa available on your whole cluster, if that's what you want. It's kind of dangerous sometimes. Then it's kind of cool. If you look out there for chaos engineering tools. There's some crazy stuff out there. Like a lot of people have just sort of mucked around a bit and started some hobby project. And it sort of half works or doesn't work. There's better stuff and not so good stuff. There are not so a lot of tools in general in chaos engineering, but my personal favorite, cube invaders. So it's kind of playing space invaders on. You connected space invaders on your kubernetes platform. And the aliens represent pods, which are containers. So as your spaceship, as you kill the aliens, your Kubernetes pods will get killed. So don't do it in production, I guess. Yeah. And then Netflix was the one that kicked this stuff off. And then all the kind of big cloud people like Amazon and everyone there, they're all doing chaos engineering. Well, there's another big cloud provider called Alibaba, and these guys are doing chaos engineering, too. So recently they published Chaos Blade. We haven't investigated or we just found out about it just recently, and I don't even know if you want to use it. It looks pretty good from what it says in the documentation. Yeah, it has like, different things. It can even attack c plus plus application, it can attack Java applications. It can attack Docker like the Pumba thing, and it even can attack some cloud stuff. Sadly, we're not so fluent in Chinese, to be honest. The documentation is mostly in English, so we just picked it up. I like this diagram. See these things here? Either it says it works with this, or it says, installs. Bitcoin miner, we're not sure which. Yeah, there we go. We've got someone that speaks Mandarin. Excellent. So there you go. Allegedly, we don't know whether it works at all. We never tried it, but that could be at least from videos and screenshots, it looks quite well, to be honest. Yeah, we don't know. We need to try soon. Then you can do chaos as a service. Okay. We had, Russ just mentioned his chaos IQ stuff. So they were talking about platforms that will give you. So you don't have to muck around with that, like tc minus minus, network minus, root minus, I don't know what else. And getting your things wrong and getting one thing wrong and destroying your entire production cluster or whatever it is. Okay. We're talking about platforms which help you via kind of a web interface, two schedule. You need to have knowledge over a lot of different tools, how you call them, how you use them. So sometimes it's easier to do because in the end, you're not trying to become an expert on low level tools. You're trying to actually create resilient systems. So you can talk to Russ about his platform. There's chaos mesh, which is the startup we just mentioned from our former colleague that he's just kicked off. I mean, that's literally, the website went up properly a few weeks ago, but he's already got a couple of people on the platform, and that gives you a screenshot. There's a couple of screenshots on his website, can show point and click a few things. And that's what I want to do. Go see the history of what I've run before and everything else you can integrate it into your build chain. Like there's a hook that your build chain will trigger stuff on the software platform, on the platform. So that's all out there, too. Pretty cool stuff. Yeah, it's cool. So you don't need to know how stress cpu command is. Stress memory. You just can click it on your platform and does it for you. That's it? That's it. Thanks for watching.
...

John Fletcher

Chaos Monkey Evangelist @ Codecentric AG

John Fletcher's LinkedIn account John Fletcher's twitter account

Manuel Wessner

Chaos Monkey Evangelist

Manuel Wessner's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways