Conf42 Chaos Engineering 2021 - Online

Clear the ring for Chaos Engineering at Vertrieb Deutsche Bahn! One year sensations and attractions!

Video size:

Abstract

Let’s talk about some experiences, learnings and failures. Your asking about our learning environment? Well there you go: 300+ microservices, 100+ developers, 100+ Gamedays, countless experiments with various outcomes :)

Have you ever been to the circus? As software developers, we have much more in common with circus trainers than we think. On the one hand, we have to tame and maintain a steadily growing zoo of technologies and on the other hand, the undesirable audience expects us to show more and more astonishing features within a short period of time. On top of that, we’re also shooting right into the circus ring with our big CI/CD cannon. That can lead to interesting effects, because we’re shooting while a show is running. In this session you will learn about our top 5 excuses not to do Chaos Engineering and how even the most hostile environment can be used to do Chaos Engineering. Whatever your role is, product owner or developer, we will share our experience in establishing a chaos engineering centered culture at DB Vertrieb for future ringmasters to withstand turbulent conditions in production and ultimately to satisfy your customer’s expectations.

Summary

  • This talk focuses on the implementation and operation of chaos engineering. Deutsche Bahn is the biggest train operating group in Germany. Chaos engineering makes my system capable to withstand turbulent conditions in production. Why we are rolling out chaos engineering at Davy for clear is the last topic we cover.
  • A new feature which connects the new platform and the old agent monolithic went silently down. Customers started complaining online. Concerns regarding our new platform were expressed from all directions. Only knowing who is responsible will not prevent us from future outage.
  • We have a gap between Devon Ops and architecture. We wanted to get faster with lets errors and a better ux. So we used chaos engineering to mitigate those problems. Without plan, we felt like chaos Engineering has won the worst initiative of the year.
  • You just need to start doing chaos engineering. It helps you verify the non functional requirements. It increases cross team communication to reduce friction. And it increases the overall resilience of the system.
  • Five top five excuses not to start with chaos engineering. Invest in a great application performance monitoring solution. Stress out the value of chaos engineering for the product. Host some game days.
  • Get stakeholders together in game days to test the system and to learn more about it. Start small and aim for production. Share your findings with other teams. Even big organizations can do chaos engineering if they want to.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, this is Oliver. I'd like to welcome you to our chaos Engineering circus show. If you now think. Whoa, whoa, wait. Circus? What circus? I was thinking this is about chaos engineering. Sit back and relax. I assure you are in the right talk. It's all about chaos engineering today. So why are we talking today about chaos engineering? Yeah. Simple answer. We do chaos engineering for over a year now, and we would like to share our experiences. And there's always a long story. Our it systems are historically grown up to a point where we had to renew them, so the support is discontinued. We had a lot of brain drain. Some ancient technologies like CobOl, and the systems are not cloud ready. Now we have a brand new maintainable platform and the possibility to roll out stable features and changes within hours, worthless weeks or months. But that comes at a cost. We have more teams, new technologies, new processes, and of course, the old monolithic world still plays a big part for now. New challenges everywhere. So in this talk, we focus on the implementation and operation of chaos engineering and how we integrate chaos engineering in our daily work. On our chaos engineering journey, we took a lot of detours. We made errors and had burdens and hazards. So we talk about how we get on track again in the hope that you will not being caught by our mistakes. But what's this circus thing now? Were we see parallels between the organization and operation of a circus and how software development might be organized. Please let us know afterwards if you also see some analogies. Well, who is Deutsche Bahn and what is Deutsche Banfatrip? Let me give you some context. Deutsche Bahn is the biggest train operating group in Germany. And Deutsche Bahn Fatrip is the interface between the customer and his train ticket. Basically, we sell train tickets. And for that reason, we develop software systems to support that. Customers can buy tickets over multiple channels, vending machines, mobile applications, online, and of course, in person, when were is no pandemic. We sell a couple of million tickets per day. Now you know what we are doing. And let's look about today's topics so how the show evolved. This is about the roots of our historical systems and what kind of systems we have now. The next point, we try to discuss the point if Chaos engineering is some sort of a magic band which turns every system into a super robust and resilient platform. And we talk about fire drills. Fire drills are practiced everywhere. On ships. The firefighters do fire drills, of course, and in the circus. So why don't do fire drills? In the development, we call a fire drill. In the context of Chaos engineering, a game day. We explain later the concepts behind game days and other chaos engineering practices. And there's always a reason not to do things. Here we talk about the most frequent and absurdly excuses and how we react on them and the most painful experiences. Why we are rolling out chaos engineering at Davy for clear is the last topic we cover in this talk. So, chaos engineering, what is it? So basically it states that chaos engineering makes my system capable to withstand turbulent conditions in production. So that sounds great. Why don't do everybody chaos engineering then? Well, that's a good question. What I know is the value that chaos engineering gives to us. It helps us to discover unknown technical debt depth that we were not aware of. It helps us proving and verifying our nonfunctional requirements. And it lowers time to recovery and increases the time between failure. For proving the success of chaos engineering, we advise to measure these metrics from the beginning. Finally, by the adoption of chaos engineering, we gain more resilience and robustness as well as a better understanding of the process in case of failure. And now let's start. Welcome to the show. Let's have a look at the actors now. Yes, these were the old times, monolithic age. The show was well known and no surprises for the audience. This show was very production. We only had one or two. Lets painted blue and a sign that state websphere. This was very sturdy and durable. Yeah, the last decades. But the surrounding parts of the system were much older and programmed in languages, which I either haven't heard of or treated the name a fairy tale. So with this setup, we could not fulfill the demand of the new audience because it took us three months or longer to change a show or introduce a new act. The audience requests 24/7 shows today or even multiple shows in parallel. But on the other side, operations loved the show. It was very stable and known. No surprises. In contrast to business and developers, they want change, they want new features delivered to the show. Of course at a motor and cadence. So business has the money. So this is how our show looks like today. Now we have more tents, 24/7 shows, even multiple of them in parallel. And best of all, shows are updated while they are running. What we have done to support that, we changed everything in our software delivery process. We moved to a cloud ready technology check stack. We scaled up the number of teams we used some sort of domain driven expects to separate our concerns. Continuous delivery and continuous deployment enabled us to change the show while it is running. Our quality assurance had a mind change from manual testing procedures which took a lot of time towards fully automatic testing. Also new processes were set up to support these changes. For example to have a new incident processes and operation procedures. So having a powerful CI CD cannon which allows you to shoot features directly into production systems, imagine what could happen if you change the actors jiggling to can source within the git commit. So the question is, what could possibly go wrong? Well, they did. The first feature we released on our new platform was KCI, also known as comfort check in. Well, this nice little feature allows you to check in at a seat in a train and you will not be bothered by the conductor asking for your ticket. Now you have quality time while traveling by train. Have a nap. This was the first feature which connects the new platform and the old agent monolithic. Guess what happened at Friday 05:00 p.m.. Yeah, the new feature went silently down. Customers start complaining online. Social media teams weekend was moving far away. So what chaos happened here? No time for analysis. Restart the services and cross fingers. Yes, we are online again. Pooh. Services started working, but the error has had side effects and was not only affecting the new feature KCI. We encountered a cascading error and customers could no more download their already bought tickets anymore. This was a serious problem and a bad thing. At Friday 05:00 p.m. People usually want to get home by train as a result. So more complaints, more headlines about the issue on big newspaper websites. Well, this time we did not make it to the primetime news. This time, the only reason for that was that the incident managers did a great job coordinating the fixes of the issue. But uncomfortable questions were asked. Concerns regarding our new platform were expressed from all directions. The most important question was asked. Maybe you guess what the most important question is. I bet you know it. Yeah. Who is responsible for the outage? Who is the guilty guy? So a big force mortem revealed it very efficiently. Well, job done. But maybe we asked the wrong question because only knowing who is responsible will not prevent us from future outage. It's more about asking why did we fail instead of who's responsible. Yeah, this was a process we had to learn. As a quick fix, we have added the adjective blameless in front of postmortems. Yeah, conditions, improvement already started. Remember, we have had changed everything. Technology, culture, responsibilities, and of course the processes and aside effect. We have this monolithic system which is still running and which is operated completely different. So maybe something has to change. That's what we all agreed on. But where should we start? What has to change? Mike will tell you about some of the ideas we came up with. Thanks, Ollie. Yeah. Well, so we put our heads together and if you're a big company, you have many experiments and they all know something. So architecture, for example, was asking for more governance, and the developers were asking for more coverage, and Ops was asking for better tracking, quality assurance, for better documentation. And, well, overall, it was kind of chaos. But at some point we agreed on that we need more tests. And these tests were called technical approval or technical acceptance tests. So we introduced them. And, well, you may be asking, do we need more of those tests? Well, we didn't know better, so we introduced them because we thought they will fit our goals, which were basically getting faster with lets errors and a better ux. So basically making the customers more happier. And to explain a little bit more about those tests, here's a slide deck that outlines it pretty well. So we introduced those. And what you have to know is in our organization, we have a gap between Devon Ops. In fact, they are different departments. And I will come back to the two tests outlined on the slide in our first game day story later on. But here you can basically see how that worked. So the developers were doing all the performance testing, and at some point they threw over the artifact to Ops and they were doing a rolling update and tested whether that works or not. And we thought, well, that will help. And the idea was simple. I mean, right before going into production, these tests will make sure the deployed services work and Ops will obviously take care of these because we thought they have a high interest that stuff works in production. Well, let me say that much. That didn't really work out either. So this time we thought, well, something has to change. And I mean, this time for real. So we did what every good company does and we asked ourselves, what would Netflix do? Right? And that sounds mad, I know, but let's look at it. We do have lets microservices, point taken. But we still have them. So the complexity is probably not too different. You have to solve similar problems. Why Netflix? Well, because compared to on Prem, Netflix noticed at some point that different things are important with microservices in the cloud. So Netflix developers practices and methods, you all know that chaos engineering to mitigate those problems. Well, and that led to a higher stability of the whole system. I know we're not Netflix, but let's take this graph here, for example. This is a part of the system. It's only the teams. So there are many more microservices below that. And as you can see, it's a big older now it's like one year old. It's not a death star, right? But the problems are looking the same. We have a complex technical system. And the funny thing here is, I mean, it's a complex technical system, but there is also a complex social system behind this with many teams, different responsibilities, slightly different approaches, a different culture in every team a little bit at least, and a different experience. And even though we decided to have guidelines for all the teams, like build robustness and resilience, how would you even test that? So we thought, well, chaos monkey to the rescue, right on the shoulder of giants. Let's use the chaos monkey for kubernetes. And we even got operations to agree on that. So we put the monkeys to good use and fired kavoom. So without knowing much, we deployed the monkeys. And I tell you, they did a great job. They killed ports and containers in one of our gazillion environments. But to be honest, we didn't really know what we were doing. A fool with a tool is still a fool. And that's how we felt. Yeah, it ended in tears once again because the microservices were not prepared and we started way too big. But what do you guess? What's the worst? Yeah, the worst was nobody noticed really. So we didn't have the observability to even detect the errors. So something went wrong and the teams were upset that something is not working. But we had no observability. What was going wrong here? So we were not prepared. We didn't do our homework, we didn't communicate enough. We didn't have the needed observability. Without much of a plan, we did really dangerous things. We felt like chaos Engineering has won the worst initiative of the year award. There was only one last chance to recover, do our homework and start over at the drawing board. So let's have a look about the site from Rosmeitz. What is chaos engineering about? So here we didn't think plan or experiments, so we just deployed. We violated every principle of chaos engineering. So we started over again with chaos engineering. Well, that sounds great. So what do we really need to do that? All these monkeys and tools need new cages. And of course we need someone who can train and keep track of these tools. So we need specialized trainers. Get a bunch of consultants to develop excel based meal and training plans. Yes, get all departments at the roundtable and design a process that printed out hardly fits on the walls of a meeting room. So this is really essential. And of course, what cost, nothing is no good. So open the treasure chest and bring a big stop stop, Ollie. I don't think you will need all of. I mean, let's focus on what we maybe really need. I think you just need to start doing chaos engineering. And what do I mean by that? Basically, first of all, you have to pick an aspect of the system you want to experiments on, right? And then you plan your experiment, you prepare your environment, traffic monitoring, access rights, all that stuff. And then you measure the steady state, monitor your system and conduct the experiment. Essentially break something and document your findings. And as you can see, I didn't talk about budgets, new cages or trainers. It's basically about doing it. So how does it look like? Well, this is an actual screenshot from one of our earlier game days. And as you can see, there are many different departments in one room, which is one of the biggest value of a game day in our opinion already, because you have ops there, architecture, you can see Ollie and me, there is business operations and developers. And someone who took the photo, which I think was also a developer, is also a developer. So then, as you can see, I mean, let's come back to our little first game day story. Well, our first experiment was let's do a rolling update, but with load. And the funny thing was we had a development team there and ops people and people there, several people there had more experience actually like 20 or more years in development or operations. So we all were pretty confident that nothing will go wrong. And we asked them beforehand whether they think if we will find any problems. And they all agreed that we will not find anything. So we designed the experiment, rolling update on the load. So we basically just combined the slides we showed you before, like the performance testing and the rolling update. And what do you get happened? Well, it ended almost in tears. The error rate was about 100% for 15 minutes. And to be honest, it was quite depressing for all of us. We expected some errors, but not a complete failure in broken services. And even though we used an orchestra, like the big kubernetes thing, you probably all know we didn't expect that the service went down for 15 minutes, but that was also good. I mean, it was a wake up call for everyone. On a side note, after that we never had any problems to get the buy in from management. So we just showed them that there is something wrong and that there are hidden problems. Well, what happened, you guessed, because we have this gap between dev and ops. They never tested to update the service under load. They just deployed it without load on a test environment and did a rolling update. And in retrospective this. Well, we could have known, I guess, but we didn't. So chaos engineering revealed that, and we learned something. And this is a big value already. So what's the value of chaos engineering? Well, it increases cross team communication to reduce friction. So we use a game tape to gather many teams in order to validate functional and non functional requirements. And we asked ourselves, what would the customer see if we break this or that? It helps you verify the non functional requirements, basically. And even if they're not really fixed, you can still guess the most important ones, I guess. Next thing on the list, you will find the unknown technical depth, right? The technical depth you don't know right now. The technical depth you know is the technical depth you have in your jira or your issue tracker. But the real value is to find the value to find the technical depth that your customers will find. So if they click on something and your whole system breaks. Spoiler usually on Saturdays, it reduces the time to recovery, because if you practice failure, you will know what to do. You're not stressed. And by that also, I mean if you mitigate issues, or if you're able to mitigate issues, you will increase the time between failure and which for us, increases the overall resilience of the system, which is good. Let's talk a bit about tooling. What tools could you use? Well, basically, you should invest in a great application performance monitoring solution. You could use instana, but any other solution in that space is good too. In the beginning. Well, we mostly use Pumbaa and chaos monkey for spring boot. At some point it makes sense to look at more sophisticated tooling and automation right now. Were evaluating steadybit, but you could also go for Gremlin or any other open source platform which will help you. But this is only a small part of the tooling you could use. So there is a good list on GitHub. Awesome. Chaos engineering. So go there, check out the links. There's also information on how to start with chaos engineering and other stuff. So, well, we have all the protective gear in place. Everybody's informed and everybody's there, as you have seen. But we still had problems to get started. So let's look about these five top five excuses not to start with chaos engineering. Number five, go live. Within a few days, you are messing up the test plan. So yes, we are having crunch time. Please do chaos engineering in the next release. So integrate to cope with that excuses. So maybe you have to integrate chaos engineering into the test plan. Or even better, shift left and integrate chaos engineering into your daily development process. In any case, keep talking and make QA a friend. They are on the same side. Everybody wants a stable system and convinced that chaos engineering may have experiments which can be real tests in the future. So this is a win win situation. Just keep on talking to QA. Number four, we don't have enough access rights. Yeah, often this actually means that there is a technical gap. People often do not know where to find the information and this is indicating that they are not aware of how to access the systems. Maybe you can ask questions like have you tried access the development database? And often the developers say, oh, well, I can if you show me. And did you even try it? Of course you have to try it. Often these problems disappear when talking and discussing with Ops. Number three, we can't think of any good experiment that are useful to us. Whoa. Yeah, this is an alarm. Your effort in implementation clear. Operating chaos engineering was not enough. You have to find out the reasons. Maybe the team just don't know or is ignorant to chaos engineering. So give them more coaching and explain the goals from adopting chaos engineering. It's also in their mind try to push the experiments to reflect use cases instead of thinking only technical aspects like DNS or rolling updates. So give them other formats for game days. Maybe red teaming is something they do like more and supply standard experiments so they can derive their own experiments from them. Number one. Number two, sorry, there's no time because right now we are only doing features. We take care of technical things later. This is a nice one because the use cases, technical and non functional requirements are inherently connected. Dividing them is not a good idea. Sometimes the developers will be forced to make this decision or this distinction. So insistence that everybody knows what will happen when you accumulate too many technical debt, it gets poisoned and will eventually be ignored until they hit you in production. This approach is very, very risky. There are many reasons for that. What we often see is that product owners are more like backlog managers than a person who feels responsible for the whole product lifecycle. Number one, we would love to do chaos engineering, but the product owner is not prioritizing it. Remember the last argument about the job of a product owner? Well, the product owner might have opposing aims and maybe he is measured by the cadence or count of delivered features. Sadly, the product stability suffers from this approach and often this is pushed by management. But that's another story. To mitigate his or her possible fear and uncertainty, stress out the value of chaos engineering for the product. For example, show them the value of chaos engineering by doing at least one game day which has a small scope and in addition, recruit expectations and or security as a driving lever or force. Regardless of the excuses, we did manage to host some game days and do some chaos engineering. So more than 80 game days done up today and more important, the number of production relevant weaknesses. So about 120 plus these will not hit us in production anymore and we gained enough trust to host the first game day in our production environment. Yeah, this is great. What we are very proud of is the fact that multiple teams hosted hosting their own game days. Yeah, that's a great success. And also we started other game day formats, red teaming, wheel of misfortune, playing the incident process and what was very useful to us is having a place where all our documentation about our experiments will be collected and be accessible for everyone. And this helps teams to get ideas for new experiments as well as the possibility to share their findings. Because problems are often the same regarding health checks, DNS rolling updates, all this stuff and build up community, promote the finding of the month. Have a place where teams can talk about chaos engineering and share their experiences, write some blog articles, record podcast, all this will help to make chaos engineering more visible and success. Let's get to our top five learnings. First of all, people, processes and practices are key factors. You have to talk, talk, basically communicate with stakeholders, get them together in game days to test the system and to learn more about it. Don't just deploy the chaos monkey, tooling is the least of your problems. We started with a pretty simple game day and from that we continued. More importantly is to share your findings with others so they can learn from them too. It took us half a year till we even thought about automating some of our experiments, and that was around after 50 game days. Well, remember, basically pre match optimization is the root of all evil. So start small and aim for production. Well, we started with the technical system and the social system came later, but it depends. Most important is to start where nobody's heard. Usually that's in development or in a testing or staging environment. But don't forget you're aiming for production. Otherwise chaos imaging doesn't really make sense at some point. Don't be afraid to bug people. Don't get discouraged. Well, what we did, we organized the first game days. We moderated them, took care of the documentation, we invited the people basically to make it as easy as possible for the participants, research allies like operations and q a quality assurance. And what that means is, well, take the plunge if you have to. There are jobs that nobody wants to do. But if you want chaos engineering in your big organization, to be a part of the daily life, well, then you maybe have to take the plunge and stay positive about it. And most important, for us, at least, is don't get eaten by the process line. At some point, other departments want a piece of the cake. As soon as chaos engineering is successful, quality assurance wanted to make specific experiments mandatory. So they asked us whether we could always do them in game days, like making a checklist screen. And we hate checklists, so this is bad, we thought, because game day experiments should be self driven by the participating teams, because they know best what they need. Right. So this is how we came to hate checklists. There are better alternative solutions because one fits them all often doesn't make sense. Well, and one last thing. There are no excuses. Even big organizations can do chaos engineering if they want to. It's just a matter of time, especially if you're going into the cloud. So what do we want from you? Well, last but not least, invite your team to your first game day. Share your findings with other teams, go out and do chaos engineering. Thank you.
...

Maik Figura

IT Consultant @ codecentric AG

Maik Figura's LinkedIn account Maik Figura's twitter account

Oliver Kracht

Implementation Lead @ Deutsche Bahn



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)