Applied Security: Crafting Secure and Resilient Distributed Systems using Chaos Engineering

Video size:

Abstract

Join Jamie Dicken and Aaron Rinehart to learn about how they implemented Security Chaos Engineering as a practice at their organizations to proactively discover system weakness before they were taken advantage of by malicious adversaries.

In this session Jamie and Aaron will introduce a new concept known as Security Chaos Engineering and share their experiences in applying Security Chaos Engineering to create highly secure, performant, and resilient distributed systems.

Summary

Aaron Reinhardt is the CTO and founder of a company called Verica. Jamie Dicken is the manager of security engineering at Cardinal Health. They will unite SRE and security chaos engineering. All of the disciplines are really related and one can't be successful without the others.
systems engineering is messy. Throughout a system's life, complexity has a way of sneaking in. As time goes on, those problems just compound as new change is introduced into the system. Through failure, we discover that we have far more dependencies and complexities than we ever could have imagined.
We used to pull up an infrastructure diagram of one of our systems. But maybe it doesn't even represent the system that was actually deployed to production. Even if you do have documentation, you need to wonder about how it gets updated. There are going to be plenty of problems that you don't even anticipate.
There's a vast difference between what we believe our systems are and what they are in reality. The only way to really understand that complex system is to interact with it. And that's really the heart of chaos engineering. It's really about a new approach to learning.
There is a difference between continuous fixing and continuous learning. What we're trying to do is continuously learn about our systems. With chaos engineering, we can learn about the system in a controlled way and not incur pain. It's also kind of about a change in mindset.
Chaos experiments are focused, really on accidents and mistakes that engineers make. What we're trying to do is proactively clean up the low hanging fruit so those attacks can't become successful. It is about continuous security verification.
With chaos engineering, we're not waiting for something to happen. We're proactively introducing a signal into the system. Every chaos engineering experiment has compliance value. We'll talk about instant response next.
Cardinal Health began its security chaos engineering journey in the summer of last year. The name comes from bringing continuous verification and validation to Cardinal health. The process includes five steps to identify technical security gaps.
O'Reilly report coming later this year about the discipline. Security chaos engineering is about testing your own systems. You can practice this on a system that's either a giant monolith or that's radically distributed with tons of microservices. Once you understand security chaos engineering, it's really possible to start.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, and welcome to applied security at the Conf 42 site reliability conference. My name is Aaron Reinhardt. I am the CTO and founder of a company called Verica. And I am joined here with my co speaker, Jamie Dicken. Do you want to introduce yourself? Sure. I'm Jamie Dicken. I'm the manager of security engineering at Cardinal Health, which is a Fortune 20 healthcare company in the United States. A little bit more about me and my background. I am the former chief security architect at UnitedHealth Group. At the company, I led the DevOps transformation, as well as pioneered the application of chaos engineering to cybersecurity, wrote the first tool called Chaos Slinger in the space. And we'll talk more about Jamie, and I'll talk more about that and the work that Cardinal health is also doing in the space. Jamie and I are also authors, O'Reilly authors on the topic. Our O'Reilly book on security kiosks engineering comes out this fall. And I also have a background at NASA safety and reliability engineering. That's my background. Oh, and you notice kind of an 80s theme to this presentation. So that was what I looked like in the 80s. That's awesome. I do not have nearly as awesome of a profile picture as that. But at Cardinal Health, I was brought into cybersecurity a little over a year ago, specifically lead a team focused on security chaos engineering. And we called ourselves applied security. Prior to that, I had spent ten years in software development, both as an engineer myself and then as a manager leading multiple teams. And with that, what I like to say now is that I spent the first ten years of my career building new features that added value to healthcare. And now in the next part of my career, I'm really focused on securing their legacy. So I've had the awesome opportunity to be a contributing author on the O'Reilly report that Erin mentioned. And what I'm really passionate about is using my experience both in software development and information security to really unite multiple disciplines. So today you'll hear me talk about SRE and Chaos Engineering. Other days, you may hear me talk about championing application security among software development, but in the end, all of the disciplines are really related and one can't be successful without the others. So it's crazy. Excited to be here today. So next, Erin and I have a really fun talk that's lined up today. So we're going to first talk about why a new approach to security, chaos engineering and continuous learning is really vital to us. We're going to discuss what the core of the security chaos engineering discipline is, and talk about the use cases that you can target. I'm going to discuss my real world experience leading a team at Cardinal Health as we began our security chaos engineering journey, and how you can do that, too. And we're going to tie a bow on everything at the end and unite both SRE and security chaos engineering. So, next, Aaron, whatever brought everybody here to comp 42 today, whether that's to learn what SRE is from real world practitioners or to gain exposure to world class thought leadership, or really see some technical demos and expand your team's horizons, I think it's safe to say that we can all align on one basic truth, and next, that is that systems engineering is messy. So, on the next slide, I love this picture because I think that it's one that we can all relate to. It's one that if we've ever been in some of our server rooms, we recognize that our systems, no matter how beautifully they were designed at first, we can quickly sour on those. So, in the next slide, you see that we sometimes begin with these beautifully simplistic representations, either of what we want to build or what it is that we think that we did build. But as I said, it doesn't take long for complexity to sink in. So on the next slide, you can see that it's not always our fault. So, throughout a system's life, complexity has a way of sneaking in, whether that's new business requirements come and force tight deadlines that have us just throwing on and bolting on new microservices and using into spaghetti code. Or maybe there is this huge focus on new customer acquisition that forces us to really rapidly resize our environment faster than we were planning on it, without all of the dedication and the forethought that went into it. Or maybe even there are new security requirements that come into play, and all of a sudden, we find ourselves adding in layers of infrastructure. And then on the next slide, you see that as time goes on, those problems just compound as new change is introduced into the system and as our processes change. And so next, we see that the reality is that our systems are so much more complex than we remember them, and we know that the outages and the experiences in terms of using the system or not using the system, in the case of a sev one, we start CTO realize they're not simple at all. It's actually through failure that we discover that we have far more dependencies and complexities than we ever could have imagined. So, on the next slide, we recognize that our approach in the past has failed us, and that's because it's very much old school. So what we used to do when we wanted to build for security in mind is that we would pull up an infrastructure diagram of one of our systems, and we would start to threat model and point out single points of failure and identify opportunities for latency or for somebody to get in there. But we all know the problems with this approach. So first we think about how and when that documentation is actually created. So maybe that documentation is created at the beginning of the project, in the planning phase, before any lines of code are written, and before any infrastructure is deployed, CTO production. So if that's the case, maybe our documentation really matches the ideal system at that point in time. But maybe it doesn't even represent the system that was actually deployed to production, or if that documentation was created after the fact. Well, it's still dependent on the memory of the architecture, the engineer that created it. And you're banking that that architect or engineer isn't misremembering based on their experience in creating that system. And even if you do have documentation, you need to wonder about how it gets updated and when, if that even happens. If we have our documentation that is fully updated, do we really know? Do we have all of our integration points mapped? Do we know the downstream effects? Do we know if somebody's actually using our system in an automated manner? If they're just leveraging our APIs behind the scenes? We have no idea. Right? So before you start hating on me, let's be clear. I am not saying don't create documentation or don't update documentation. I mean, for the love of anybody who works on your systems, if you have any respect for your coworkers or your customers at all, please create that documentation. But what I am saying, though, is that if your process of evaluating a system for security vulnerabilities is dependent on an outdated or just a flat out incorrect representation of a system, that misremembering that we taken about our evaluation is going to fail. And there are going to be plenty of problems that you don't even anticipate. And quite honestly, if I look at it, the number of industrywide publicly disclosed breaches or DDoS attacks proves that we're really failing to keep up in our understanding of our systems. So, on the next slide, I really like this quote. As complexity scientist Dave Snowden on nuts. Dang it, I messed up. I forgot that slide entirely. Oh, yeah, here it is. Here it is. We got it. That's my bad. Okay, no, sorry. It was on me. So I really like this quote. As complexity scientist Dave Snowden says, it's actually impossible for a human to model or document a complex system. And if you're like me, you hear, challenge accepted. But what he says really makes sense. And that is the only way to really understand that complex system is to interact with it. So then our answer is not CTO rely on our recollection of our systems or our memory of the system, but rather to learn from them and use empirical data. And that's really the heart of chaos engineering, just implementing experiments that clearly show us what our real system landscape looks like so we can tease out those false assumptions. And, Aaron, I'm going to hand it over to you. Right? Yeah. And it's really about a new approach to learning. That's really what we're using to talk about here. There's a vast difference between what we believe our systems are and what they are in reality. And Jamie makes a good point about the number of breaches that are growing. And I think there's a quote that says we should be surprised that our systems work as well as they do, given the little amount of little information we know about them already. The same is true of security. We know very little about how our systems actually really function in reality, and that's kind of what's being exposed in breaches and incidents. I mean, if that wasn't the case, then we would have fixed it, and they never would have became breaches or incidents. So, moving on here. So I like to talk about continuous learning in terms of, it's important to understand that there's a difference between continuous fixing and continuous learning. And what we're trying to do is continuously learn about our systems actually work, so we can make better decisions and have more better concept proactively to reduce customer pain, because that's really what we're trying to come after here, is, in the end, we're trying to build more performant, reliable, and safe systems. And the customers are the ones that typically feel the impact of our misgivings, like I said. So there is a difference between continuous fixing and continuous learning. A lot of what we're doing now is continuous fixing. Were fixing what we believe the problem to be, instead of continuously trying to learn deeper insights and develop a better understanding about how our systems actually function. So I love this. I'm using to share a brief story. So, my co founder of Erica is Casey Rosenthal. He's the creator of chaos engineering at Netflix, and I was at lunch with him one time and I witnessed sort of this conversation with a potential customer in large payment processing company. And they're talking about, they have this legacy system with the core applications for the company, and it processed all the organizations, so they trusted it. The engineers were competent. They really had an outage, and they wanted to move it all over to Kubernetes, and they needed help with that. And I started thinking to myself, like, was that legacy system always stable? Was there a point in time where it was incurring outages and it wasn't as widely understood? We kind of learn about how the system really functions over time through a series of unforeseen, unplanned, or surprise events. With chaos engineering. It's a methodology where we can learn about the system in a controlled way and not incur pain. Because when you learned about that legacy system and it became stable through those unforeseen events, there was pain as part of that process. Your service was not available. You might have made public headlines. We don't have to build things that way and run and run our systems that way. We can be proactive and use techniques like chaos engineering to develop better understanding of how our systems really work. It's also kind of about a change in mindset. It is kind of a different way of thinking in that a lot of people say that chaos is a very provocative term. Right, Casey and I like to sort of reframe, and so does Jamie. She uses this terminology as, like, we try to reframe chaos engineering more in terms of continuous verification, is that as we build things, as we move to the cloud, as we build new applications, as we move legacy applications to kubernetes, as we're building them, as part of the process of building and operating them, we're continuously verifying that the system still does what it's supposed to do. We do that in the form of a hypothesis, in asking the system a question, but it's a different way of thinking. And just like DevOps was a different way of thinking, cloud was a different way of thinking. Chaos engineering is right along with it, but it is a change of mindset. Also, it's important to understand this is important. Caveat to chaos engineering is that when incidents occur in outages and breaches, people operate differently in those conditions, but people also in reverse, operate differently when they expect things to fail. So what do I mean by that? What I mean is, so chaos engineering, we don't do it when people are freaking out during an active incident. During an active incident, people, they're worrying about the blame name. Shame game, right? Was that my system? Is this the breach? Did I cause this? And people think their jobs are on the line. People are freaking out. This is not a good way to learn. So we don't do chaos engineering here, okay? We do it when, essentially, there is nothing to worry about. Right. We do it when it's all rainbows and sunshine, right? We do it proactively when nobody's freaking out. It's a better learning environment. People's eyes are more wide open, and their cognitive loads are not bore down on all that load of worrying about the breach and affecting the company and their jobs, remember? So, we do chaos engineering here? Not here, here. A little lag. There we go. All right. Chaos engineering is about establishing order. We do that by instrumenting the chaos inherently within the system. And what we're trying to do is try to proactively introduce turbulent conditions into the system to try to determine the conditions by which it will fail before it fails. So, in security terms, we're trying to introduce the conditions that we expect our security to operate upon, and we validate them. And most often, almost every chaos experiment, security, availability, that I have either run myself or stories I've heard from other people. I don't think I've ever heard anyone say it all worked the first time, because it rarely does. Were almost always wrong about how our systems really work. Chaos. So we like to sort of put chaos or instrumentation in two loose domains. One is testing and experimentation, right? So, testing is a verification or validation of something we know to be true or false. It's a binary sort of thing. It is or it isn't, right? We know what we're looking for before we go looking for it. Whereas experimentation, we're trying to derive new information that we previously did not know about the system beforehand. New insights, new information. And so, example of sort of testing would be like, we go in looking for attack patterns or signatures or things like our scves, whereas experimentation, we're trying to ask the computer specifically a question. When x occurs. I know I built y to be the response, and we actually do that, and we try to find out whether or not that is actually true or not. So, chaos engineering in general is really. It's about establishing order. It's a very practical technique, if you really think about it. We're just trying to validate what we already know to be true. You never do. A chaos experiment you know is going to fail. You'll never learn anything from it. The point is to learn proactively. So, what is security? Chaos engineering. So security chaos. Newsflash, security chaos engineering is exactly the same thing as applied to security. It's just a little bit different in terms of the use cases. The use cases we're coming after for security are really focused around. I have some slides on that, but they're focused around instant response control validation. That's a big one. I believe Jamie's going to talk about a bit about cardinal health and control validation, increasing observability. You'd be surprised how poor your logs really are. And during an incident, is not a good time to be evaluating the quality of log data. Right. But if we can inject these signals and incidents proactively, we can kind of find out, hey, did the technology in question actually give good log data to make sense of what the heck happened? Right. Do we have the right log data we thought we were supposed to have? And instead of trying to scramble things together, try to figure out what happened, because you're being attacked, that's not conducive to a good way of operating. So I love this quote. I've always said, in general, that engineers don't believe in two things. They don't believe in hope or luck. Right. It either works or it doesn't. And hoping your security works, hoping your systems work the way you think they do, it's not an effective strategy, especially in engineering. It worked in Star wars, but it's not going to work. Here's. It's also about. So with, you know, with the change of. This goes hand in hand with change of mindset, is that it's really about security. Chaos experiments are focused, really on accidents and mistakes that engineers make. The size, scale, speed, and complexity of modern software. It's hard. We are building the equivalent of complex adaptive systems in terms of software, and it's impossible for a human to mentally keep track of everything. It's also very easy to make mistakes when you can't, when you have poor visibility and your mind can't mentally model that behavior. So do we sit around, we blame people for making mistakes and accidents, or do we proactively try CTO, discover them and fix them before an adversary has a chance to take advantage of it? And that's really all we're doing, is we're injecting the low hanging fruit by which most malware and most attacks are successful to begin with. If you look at most malicious code out there, a majority of it is junk. It's really junk code. It requires some kind of low hanging fruit. Open port, permissive account, a deprecated version or dependency, that's applaud. What we're trying to do is proactively clean up the low hanging fruit so those attacks can't become successful. That's kind of the focus we often. Is this you? No, that's this. What we're talking about here is that, as Jamie said earlier, we often do misremember what our systems really are. And that's what I talked about, size, scale, speed, complexity earlier, is that it's easy to misremember how our systems work versus how they work now, we love to remember in terms of that beautiful three d Amazon diagram that was produced two years ago, but our system has evolved way beyond that state and solely drifting in a state of unknown. So it is about continuous security verification. And that's really the best way to really frame up chaos engineering in general, is it's about continuously verifying that things work the way they're supposed to. That's what we're trying to really achieve with it. But there's a lot of different ways you can use it. But predominantly what we're trying to achieve is combating the inherent uncertainty of breaches, outages and incidents and the state of our systems by building confidence slowly through instrumentation. So here's some of the use cases. There are many more in the O'Reilly book. There are several accounts of people in different use cases. We're going to go in depth into all of that, but some of the main use cases here that I've used before, and I know Jamie's used it for a variety of these as well, is instant great. We'll talk about instant response next. It's a great way to sort of manage and measure incident response, security control, validation. That's where I began my journey, I think were a team beyond hers. It's a great way, like I said earlier, to increase the understanding of how observable your events are within the system. And every chaos engineering experiment has compliance value. So instant response. So the problem with instant response is the fact that it's always a response. And no matter how much money you spend, how many fancy tools you have, you still don't know very much about an event when it happens. Right? You don't know when it's going to happen, you don't know where it's going to happen, you don't know why it's happening or who's doing it, how they're going to get in. You don't really know until you're being tested, until somebody's actually trying to expose that, well, that's not a good time. CTO, start understanding whether or not those things are effective. With chaos engineering, we're not waiting for something to happen. We're proactively introducing a signal into the system and saying, hey, do we have enough people on call in terms of incident response? Were the runbooks correct? Right. Did the technologies give us the right contextual telemetry? We needed to make a decision or to conduct a triage process. When you're constantly kind of waiting for an event to happen, it's so subjective. It's hard to even compare two events on a similar system in the same context. And you're assuming that when you caught the event is when it began. We know when we started chaos experiment that we began that failure injection. It's really a great way to manage its response. So what we're essentially doing, instead of making the post mortem, a three hour exercise that I'm sure everyone really does, does it after an incident where people get together in the room and talk about what they think happened after already knowing what happened, there are several biases in the whole process of doing it. And after sort of a post mortem after the fact, because you already know what the outcome. So people start shooting all over the place. They say, should have done this, should have done that, they should have known this. Whereas chaos engineering, we're flipping the model, right? We're making the post mortem a preparation exercise, verifying how battle ready we are. So, Chaoslinger. So Chaos Slinger was a project about four years ago that I wrote at UnitedHealth Group, and there's a series of us that wrote it. But the idea of it was that we wanted to proactively sort of verify that the security we were building AWS was actually as good as we thought it was. I'm going to give a quick example, but the tool chaoslinger is actually a deprecated. Now, in GitHub, I'm no longer at UnitedHealth Group. They have their own version internally that they use. But the framework for how to write experiences is still in there, and that's really all you need in order to understand how to write experiments. So a real quick example was Portslinger. So, Portslinger. We did an example so people could understand the open source tool and how it works, and a basic experiment. So we picked a misconfigured, unauthorized port change. For some reason, it still happens all the time. It could be that somebody didn't understand flow. They filled out the wrong information in a ticket, or they made the change wrong. Lots of different reasons why. What we expected, the security expectation was that we expected to immediately detect and block the change with our firewall, and it would be a non issue. What we found out was that only was true about 60% of the time. That was the first thing we learned. We learned that the firewalls that we were paying for this fancy solution wasn't as good as we thought it was. Right. But we learned this proactively. This was not learning through failure, through pain. So a second thing we learned was that the cloud native configuration management tool called it and blocked it every time. So something we weren't paying for barely caught and blocked it every time. The third thing we expected was good log data to go to, like a sim. We didn't use a sim. We had our own log solution, but we expected a correlate event to the sock. Okay, that all happened. But when the sock got the event, we were very new to AWS at the time, and they couldn't figure out which account structure the event came from, whether it was our commercial or non commercial software environments. And you say that's nontrivial. That's easy to figure that out because you can map a back IP address to what account it was. Yeah, but that could take 30 minutes. That could take 3 hours. Right. During an active incident, you lose millions, millions of dollars. Right. We didn't have to do that. All we had CTO do was figure out that we had need to add metadata and a pointer to the event and fix it. But we did all of this. We learned all these things about how our system was really working and that we had a drift problem between commercial and non commercial environments. We learned all this not through an incident or an outage. We learned it by proactively discovering it. Just asking the system, are you doing what we designed you CTO do? And it was an extremely valuable exercise. I'm sure Jamie's going to go more in depth on that in a little bit, Jamie. Absolutely. So there are several companies that are already implemented security chaos engineering today and were seeing that number increase, and that is awesome. Cardinal Health, where I work, we began our security chaos journey in the summer of last year. Of the use cases that Aaron was talking about, the one that we were most concerned about was security controls validation. And that's the one that we used to jumpstart the team. So at Cardinal health, we have this really great security architecture team. So at Cardinal Health, we have this awesome security architecture team, and they really have two jobs. So the first is to really be technical visionaries across the company and anticipate technology trends and really help the enterprise adopt them. And their second responsibility is they consult on projects to make sure that they're delivering security requirements from the get go. But the challenge that that team has is that the ultimate effectiveness of their team relies on other teams making sure that they are actually implementing the security strategies or requirements or standards or whatever. And as most of you probably acknowledge, very rarely do projects execute 100% to plan. So whether that's unforeseen technical limitations or project timelines get cut close, or just even human error, any of those things could easily undermine the security standards that security architecture had set forth. And even if a project was implemented securely from the start, changes during a system's lifetime could silently increase the risk to the company, and none of us would even know. So what Cardinal Health decided was that we needed to move away from theoretical design based architecture and move towards what we called applied security, which to us, that name really meant bringing continuous verification and validation to Cardinal health so we could validate our security posture, just like Aaron said, we wanted to do that when things were good and not when we were scrambling to try to deal with any sort of negative event. And we got our name because chaos engineering was very new at the time. And what we knew was that when we partnered with other teams, we wanted a jamie that made sense and to be reflective of the fact that we were making sure that those security controls were actually applied. So we called ourselves applied security. So it was the summer of last year when I joined our information security team and was set to lead and form this brand new team focused on security, chaos engineering. And after the first month, our mission became clear. We knew that we were supposed to go after the unknown technical security gaps at the company and really drive those to remediation before a bad guy found them and exploited them for us. And we started off as a team of five of us, and among us, we had expertise in software development and system design and architecture, and privacy and risk and networking. So we knew that collectively, we were equipped to handle our mission. But as exciting as it would have been to just go nuts and get access to every single system and start poking and looking for technical gaps, we knew that we needed a disciplined and a repeatable process. And when we tried to define what that process was first, we knew that we had to meet three key needs. So the first is that whatever technical security gaps we found they needed to identify, they needed to be indisputably, critically important ones. So it did us no good if we just went and we found a whole bunch of things that ultimately, given the risk, it would be determined that they weren't worth fixing. So to give us some credibility, what we did was we knew that we needed to establish some benchmarks that the company all agreed to that weren't arbitrary and weren't just theoretical best practices that everybody said was just too lofty to achieve. The second was that we knew that we needed to have a big picture understanding of what the technical security gap was and just how widespread it was. Whereas we would have people in the past who would hypothesize that we have gaps, or they would know a little bit here or a little bit there. Ultimately, what they lacked to be able to get these things proactively fixed was that big picture understanding and not the details that were necessary to get that documented. And then last we knew that whatever we found, we wanted to make sure that it actually stayed closed. This couldn't just be a point in time audit where these gaps were unknowingly reopened in the future. So with those goals in mind, we created a process called continuous verification and validation, which is on the next slide. And for short, we call it CBV. So really simply at its core, we wanted to continuously verify the presence of those technical security controls and validate that they were actually implemented correctly. So the CVV process includes five steps. So step one, obviously you have to identify the security control that you want to validate. Step two was that we needed to identify what those benchmarks were so that we could identify and detail what the technical security gaps were, so to give us some authority. What we typically like to do is we start with those security standards that are set forth by our security architecture team and approved by our CISO. And if those standards don't yet exist, what we do in that case is we work with our security architecture team and our security operations teams to get their recommendations, and we socialize those with relevant teams first. And what this does is it gives credibility to any of the gaps that we find, and it really helps us establish the requirement that either those fixes need to be prioritized and remediated, or the organization needs to accept the risk. The third step is that that's the point where we start to actually implement those continuous checks. And usually what we'll do is we'll write custom scripts to make those happen, and those scripts will either test the APIs of some of our tools, or we'll take a look at some other configurations as well. But before we write any code, we do try to make sure that we're making good decisions. We will evaluate if there's anything open source that we could just port over and use, or if there are commercial options that will do the job better and faster for us at the time, especially if we're talking about something where we have a lot of different vendors that we're keeping in mind. Step four is that we create a dashboard to show our real time technical compliance of all of those configurations with the benchmarks that we've identified. So that dashboard really serves two purposes. So one is it allows us to get that big picture of the technical gap so that we can communicate what our security posture is on demand, and it's a really good communication tool that we can use when we're talking to our leaders. And then finally, this is key. If at any point we find that our adherence to those benchmarks decreases, that's when we can create an issue in our risk register. So cardinal health is very, very lucky that we have a risk team who has done a ton of work to establish a process where any technical security gaps that are identified can be documented and driven to remediation. So this is awesome for us because it means that our team doesn't have to be the organizational babysitters who make sure that these things get done, but we could actually continue CTO, implement CBB in parallel, knowing that those remediations are, in fact, going to take place. So I told you at the beginning, I jamie from a background in software development, and if you're familiar with software development, what you'll start to realize is that this continuous verification, or verification and validation, or CBV process is really the system's level equivalent of a regularly scheduled automated test suite. It's just that instead of verifying that our code meets the functional requirements set forth by our business, this kernel health CVV process makes sure that our systems are meeting the security requirements set forward by our security architecture team. So let me geek out just for a little bit longer on the software development analogies, because they are fantastic. So one of the things that you could do is you could start to write some security chaos tests and run them against a non prod environment that's awaiting promotion to production. So this is pretty much exactly the same as an automated regression test suite that you would run prior to deploying new features or new code to production. And to take it one step further, where we actually want to go next at cardinal health is to actually leverage security chaos engineering as kind of a form of test driven development where we could actually partner with our security architecture team, as they're defining our security requirements or setting forward their standards, we can write those security chaos tests right then and there, and then see them start CTO pass, and continuously pass as that project continues. So the way that I see it, just as the world of software engineering really first started to embrace the concept of testing methodologies and instrumenting for health metrics, the systems engineering world is really going to do the Jamie. So, as I see it, security chaos engineering and just normal chaos engineering are going to become a part of our new normal. They're going to be baked into our project timelines. They're going to be part of our success criteria. Just like it was critically important for us to make sure that our systems are functioning properly and that the health of the systems is at its peak, that response times are high, it's serving customer needs, just like we need a pulse on that constantly. Right now, we know that we need to get a constant and consistent pulse on the security posture of our systems as well. So if I had my magic eight ball, I would be making the prediction that this is going to become a part of our new normal. And our engineering practices are really going to reflect that importance. So, on the next slide, I'm going to start telling some more of a story here. So, as I mentioned before, I don't even come from a background of cybersecurity or even site reliability engineering. But I did spend ten years in software development, where I ushered new products and new features into production for two Fortune 20 healthcare companies. And I loved it. It was awesome. In an industry that's honestly so ripe for innovation and transformation, I really believe that my work was making a difference. But what always ate at me and what always kind of nagged me in the back of my mind is that no matter how valuable the product that I stewarded was, the idea that a data breach or a DDoS attack could happen, it persisted. And really, I knew that if that happened, it just took one incident to threaten to undo all of the awesome work that I and my team were doing. And at the time, I only really had vague notions about how to go about addressing this. And those focused a lot on things like static code analysis or theoretical threat modeling. And we talked about why that can be problematic in building detections, to identify when something bad was already underway. Like Aaron said, the problem with incident response is that it's always a response. It wasn't really until one day at Cardinal Health, at an internal company function, that I met the director of security architecture, and we hit it off. And he said, I know we just met, but I'm starting this new team and I'd like you to apply. And we spent the rest of the day talking about security chaos engineering. And that's kind of when I saw the stars align, if I want to use that cheesy expression. So I saw kind of the culmination of my desire to create something that was super valuable and impactful to healthcare and the ability to proactively ensure that the company system stayed secure and continued delivering on that value that we had promised with the software systems that we had deployed. So, like I geeked out on you earlier, the parallels between software engineering and security chaos engineering, or applied security, were just too profound to ignore. And so the idea of testing your own security was just so crazy logical, I can't even articulate that. Just like I wouldn't ship new features to production without accompanying them with tests that verify that the code actually does what it should do. I was on a journey at Cardinal Health to help a team write continuous security tests that we could run on a regular basis to not only ensure the proper functioning of our systems, but to proactively protect them. And so I like to tell that story, because there may be some of you in the audience who either have never heard of security chaos engineering or those who just don't even know where to start. And the beauty of security chaos engineering, if you ask me, is not only is it just such a simple philosophy, but it's also crazy accessible to everybody. So you can practice this on a system that's either a giant monolith or that's radically distributed with tons of microservices. So, on the next slide, if you are looking to get started and you need more of a foundation before you implemented anything, the good news is, like Erin said at the beginning, there is an O'Reilly report coming later this year about the discipline. So the primary authors are Erin both here, and Kelly Shortridge, who is the VP of product strategy at capsule eight. So I'm super excited for this. It's coming together very well. And in addition to understanding what the discipline is, it also has real world stories of security chaos engineering in practice, not only from me, but from other really smart people at Verica, Google and others, and even Yuri, who chaos, a session at Comp 42 today, is one of the contributing authors for that as well. So please be on the lookout for this. I think it's one of those things that's going to add a lot of value, especially if you're looking forward CTO trying to figure out where to get started. But also what's awesome is that once you have that foundation and you understand security chaos engineering, it's really possible to start just insanely small so you don't have to have this fully automated, super complex, system wide experiment that requires vp approval to even get off the ground. So on the next slide, I want to remind you of Aaron's example with Portslinger. So here you can actually look at this as a variety of different individual tests. And you can see that on the next slide. So you could break this down and you could say that you just want to test the config management utility and know who has access to everything and things like that. Or you could decide you're just interested in understanding if the alert actually fired. You could actually start with a manual experiment and you could build that detection if observability is your main use case, and build the detection and experiment by just turning off that log source and actually validate that what you believe is true, which is that that alert will actually fire you. Validate that it does. So if you're still struggling to figure out where you can begin. So first of all, take a look at it. Do you know any high value, low effort targets that you could target right now? And if so, fantastic. Start there. You don't need something that's crazy complex. But if you're like me, maybe you have a whole bunch of discrete high value testing possibilities and it's really hard to make sense of them all. And in that case, what I like to say is that it's okay to be a little bit selfish and be what I call the right kind of lazy if you're an engineer. And what I mean by that is take a look at things that are painful for you or your team. Are you logging into a system every day to make sure that it's up and running and healthy and operating the way that you should? And even were are you doing this on off hours or weekends? What you can do, if you want to be what I call that, right, kind of lazy, is start to automate some of that boring work. Start to say, you know what, I'm going to continuously validate these things and I'll sound the alarm if something goes wrong or doesn't meet that expectation anymore. But the beauty with security chaos engineering is that with the high stakes of a security operations environment, you don't need to add more stress to the plate. You can actually start to build your confidence that your security is working even when you aren't and that it will continue to do so until that alarm fires. And so, Aaron, back to you. I love the way you state that. It's a great way to bring it. So I just have a couple more things to add to it. Is that chaos engineering? When I started down this path, I was really focused on the chaos engineering bits itself and the technique, but really, it's part of a larger domain of knowledge called resilience engineering. And there's two other domains that are related to it, which is cognitive systems engineering and safety engineering. And I believe the path to righteousness and improvement as a craft lies within the knowledge we can obtain from those domains. And I encourage anyone in security to start exploring the different capabilities and knowledge set in those domains. And I leave you with this is the case for security chaos engineering. Jamie and myself, Kelly, and a slew of other authors will be after this book is released in the fall, we will begin writing the. Well, we brought a lot of it already, but we begin writing the official animal book on the topic. So we look forward to anyone who does this to reach out to us. We'd love to hear your story and get you involved with the community. Absolutely. Thank you for everything, everyone. Have a great day. Thank you, everyone.

Slides

Download slides (PDF)

See all 10 talks at this event!

Conf42 Site Reliability Engineering 2020 - Online

August 27 2020 - premiere 5PM GMT

Applied Security: Crafting Secure and Resilient Distributed Systems using Chaos Engineering

Video size:

Abstract

Summary

Transcript

Slides

Aaron Rinehart

CTO @ Verica

Jamie Dicken

Manager of Security Engineering @ Cardinal Health

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2020 - Online

August 27 2020 - premiere 5PM GMT

Applied Security: Crafting Secure and Resilient Distributed Systems using Chaos Engineering

Video size:

Abstract

Summary

Transcript

Slides

Aaron Rinehart

CTO @ Verica

Jamie Dicken

Manager of Security Engineering @ Cardinal Health

Join the community!