Conf42 Chaos Engineering 2021 - Online

Cyber Chaos Engineering: How to Implement Security without a Blindfold

Video size:


The complex ordeal of delivering secure and reliable software will continue to become exponentially more difficult unless we begin approaching the craft differently.

Enter Chaos Engineering, but now also for security. Instead of a focus on resilience against service disruptions, the focus is to identify the truth behind our current state security and determine what “normal” operations actually look like when it’s put to the test.

The speed, scale, and complex operations within modern systems make them tremendously difficult for humans to mentally model their behavior. Security Chaos Engineering is an emerging practice that is helping engineers and security professionals realign the actual state of operational security and build confidence that it works the way it was intended to.


  • Today we're going to talk about cyber chaos engineering and security chaos engineering. Aaron Reinhardt, CTO and co founder of Verica. David Lavezzo, director director of Cyber Chaos Engineers, Capital one. We'll talk about how it applies to cybersecurity.
  • In cybersecurity, the problem is becoming more and more frequent in terms of breaches and outages. The problem is the complexities and the scale of these systems is very difficult to keep track of in our heads. The answer to the complexity problem would be to simplify.
  • The normal function of your systems is to fail. It's humans to keep them from failure. We need to be more proactive in identifying when things aren't working the way we intend them to work. Chaos engineering was a key component of Netflix's cloud transformation.
  • chaos engineering is about testing versus experimentation. What we're trying to do is understand our systems and its security gaps before an adversary does. Security is no different from chaos engineering. It's about continuously verifying that security works the way we think it does.
  • The problem with instant response is response is actually the problem. Security incidents, by nature, are somewhat subjective. With security chaos engineering, we're actually introducing that x condition. We're doing this proactively, and we can now understand and measure and manage things.
  • When I was at UnitedHealth group, we started experimenting with Netflix's chaos engineering. We applied it to some security use cases around instant response, security control validation. With all these promising vendors, it's even more difficult to make good decisions on what the correct action is.
  • As you tune your environment, you have a baseline to reference when detection or alerting failed. How are people and teams being rewarded? Uptime and visibility. Once you succeed at this, you're going to be really popular.
  • Always be running the attacks. You'll know what's working and what is not working. With the typical alerts become the negatives. Testing removes guessing, and when you remove that guessing, you're removing mistakes. It's helping to train the new people they're going through.
  • Gap hunting leads to discovering gaps in understanding like what was missed. Can we go from detection only to prevention, too? Measurement is going to help. Audits become a lot easier when metrics are constantly being generated.
  • So here's the link to the security chaos engineering book. Free copy, compliments of Verica. And if you're also interested in chaos engineering, here's a link to both books. Great.


This transcript was autogenerated. To make changes, submit a PR.
You hello and become to Conf 40 director director of cyber Chaos engineering. Aaron Reinhardt, CTO and co founder of Verica. And I'm, I'm, my co presenter is David Lavezzo, director director of Cyber Chaos Engineers, Capital one. Today we're going to talk about cyber chaos engineering and security chaos engineering. Some of the things we'll be talking about are what is chaos engineers? Very brief introduction to it. We'll talk a bit about how it applies to cybersecurity, how you can get started. David's also going to share his journey. Director of Cyber Chaos Engineering, Capital one a little background about me. Like I said before, I'm the CTO and co founder of Verica. I'm also the former chief security architect of UnitedHealth Group. That's where I wrote the first open source tool that applied Netflix's chaos engineering to cybersecurity. I'm also the O'Reilly author on the Chaos engineering book, all for Security Chaos Engineering. And I'm also the author of the O'Reilly Security Chaos Engineering report. I am David, Capital one. I'm the director of security chaos engineering, formerly from Zappos, Amazon, and also a contributing O'Reilly author for security Chaos engineers. I've released multiple metal albums and surf albums, and the picture that you see right on there is actually my best photograph. And most notably, I did indeed start an indoor snowball fight in Las Vegas. You're much more interesting than I am, David. So let's get started. So what is the base of the problem that we're trying to solve? Well, in cybersecurity, the problem, I mean, these headlines are just examples of the problem is becoming more and more frequent in terms of breaches and outages, as well as they're getting more magnified. But why do they seem to be happening more often? That's part of the theme we're going to talk about is why we believe that's happening. So to start, part of the problem is our systems have fundamentally evolved beyond our human ability to mentally model their behavior. Every dot you see, so what you see in front of you is something called a Death Star diagram. Each dot represents a microservice. This used to be kind of a reflection of how fang, Facebook, Apple Alphabet, Netflix and Google, how they built their systems. And so imagine if l every dot is a microservice. That's very complex. That's a lot of things for you to keep track of in your head. Well, now it's not just Google, Facebook and Netflix building like this. It's everyone, everyone's not building this. The problem is the complexities and the scale of these systems is very difficult to keep track of in our heads. So here's an example from Netflix is actually edge streaming service. The point of this slide is to demonstrate that. So each large circle is actually a microservice, and each dot represents a request. It's very difficult at any given point in time to keep track of what the heck is even going on here. And this is part of the problem that we're trying to articulate. So where does this complex come from? So some of the areas that introduce complexity into how we build software today are things like continuous delivery, cloud computing, auto canaries, service meshes, microservices, distributed systems in general, mutable infrastructure. But these are the mechanisms that are enabling us CTO deliver value to customer faster than we've ever done before. The problem is they also add a layer of complexity that makes it difficult for us humans to keep track of what's going on. Where is the state of security? Well, in security, our solutions are still mostly monolithic. They're still expert in nature, meaning they require domain knowledge. Operative. It's hard to pick up Palo Alto firewall. A lot of our security is very tools rich, but it's very dependent upon commercial tool sets. And you can't just really go to GitHub and download a tool and get working with it. It's not exactly the same as it is in the world of software engineering. We're getting better at it by leaps and bounds. We are getting better at it, but still our solutions are mostly staple in nature. So what is the answer to the complexity problem? Well, the most logical result would be to simplify. Well, it's important to recognize that software has now officially taken over. What you see on the right hand side here is you see the new OSI model. It's now software, software, software, software, software. Software was taken over the entire stack. But also coming back to fundamentals, it's important to recognize because of the nature of the ease and flexibility of change, software only ever increases in complexity. It never decreases. Right. There's an old saying in software that there's no problem in software that another layer of abstraction can't solve. But when does those layers of abstraction start to become a problem? Well, we're starting to see that at speed and scale in terms of complex. So what does all this have to do with my systems? What does that have to do with the systems we build every day? We're getting to it. So how? Well, the question remains. How well, given all these things that we have to keep track of in our head and the changes, how well do you really understand how your system works? I mean, how well do you really understand it? Well, one thing, and this has been really a transformational thing for me. It sounds really simple, but we love to forget that systems engineering is very messy exercise. So in the beginning, we love CTO. Look at our systems like this, right? We have a nice plant, we have time, we have the resources, we've got our pipeline figured out, the code, the base image for our docker container, the configs. We know what services we need to deliver. We've got a nice, beautiful 3d diagram of the AWS instance that we're going to build in. It's clear what the system is, right? In reality, our system never looks like this. It's never this clear. It never really achieves anything that looks like this. What happens is, after a few weeks, a few months, our system slowly drifts into a state where we don't no longer recognize it and what sort of manifests that is. So after a week, what happens is there's outage on the payments API. You have to hard code a token or there's a DNS error that you have to get on the war room and fix. Marketing comes down and says you have to refactor the pricing module or service because they got it wrong, or Google hires your best or your elite software engineer. What happens is we solely learn about what we didn't really know about our systems through a series of surprises, right, unforeseen events. But these are also the mechanisms by which we learn about how our system really functions. We learn through these surprises. But as a result of these surprises comes pain. And the problem just magnifies. Over time, our system just slowly drifts into a state that we really don't recognize it anymore. Through these unforeseen events. These unforeseen events don't have to be unforeseen. We can surface these problems proactively using chaos engineering and chaos engineering, because it's proactive and managed. It's a managed way of doing that. We're not constantly reacting to an outages or incidents as a result of not understanding how our system functioned. And we're going to get more into how that works in the end. The summation of that is that our systems become more complex and messier than we initially remember them being. So what does all this stuff have to do with security? Well, we're getting to it. We're building towards it. So it's important to recognize that the normal function of your systems is to fail. It's to fail. It's humans to keep them from failure. So humans, we need failure. It's a fundamental component of growth. We needed to learn and to grow and to learn new things. We learned to learn how not to do it, to learn how to do it. Right. But the things we build are really no different. They require the same inputs. So how do we typically discover when our security measures fail? Well, usually it's not till some sort of incident is launched. There's some human reports, some tool throws an alert, tells us we're missing information, but by the time a security incident is launched, it's too late. The problem is already probably out of hand, and we need to be more proactive in identifying when things aren't working the way we intend them to work, so we don't incur the pain of incidents and breaches or outages, for that matter. So no system, like I said before, no system is inherently secure. By default, it is humans that make them that way. If so, security, reliability, quality, resilience, these things are human constructs. It requires a human to manifest it, to create it. So point a finger at a human for causing a security problem is kind of not productive, right? It requires us initially to even create this thing we call security. So, cognitively, we need to also remember that people operate fundamentally differently when they expect things to fail. What do I mean? So what I mean by this is when there's an incident or an outage, especially one related to security, people frig out, right? Because they're expecting, hey, this is the one they're worrying about being named, blamed and shamed, right? I shouldn't have made that change late at night, or I should have been more careful. Should have, could have, woulda. This environment. So what happens when there's an incident or outage is that within 15 minutes, some executive gets on the phone or gets involved and says, hey, I don't care what you have to do. Get that thing back up and running. The company is losing money, right? So it becomes more about, get that thing back up and running, not what really caused it, the problem to occur. And so this environment is not a good environment to learn. This is not how we should be learning about how our systems actually function. So chaos engineering we do here, we do it when there is no active war room. Nobody's freaking out. Nobody's worrying about being blamed, aimed, or shamed for causing an incident. We're able to proactively surface these inherent failures that are already in the system, the inherent chaos. We're trying to make order of the inherent chaos that's already in the system, and we're able to not be worried about the company having an outage, losing money, customers calling in. We're able to do this eyes wide open, understand the failure, proactively address it so we don't incur customer pain. So, chaos engineering, this is where we enter into chaos. So, chaos engineers. The definition that Netflix wrote is the discipline of experimentation on distributed systems in order to build confidence in the system's ability to withstand turbulent conditions. So it's about establishing, building confidence in how the systems actually function. Another way of understanding it is it's a technique of proactively introducing turbulent conditions into a system or service. Try CTO, determine the conditions by which the system or service will fail before it actually fails. That's the key aspect. So, cyber chaos engineers, about establishing order from chaos. It's not about just breaking things. If you're just breaking things, you're not doing chaos engineering. Things are already breaking. What we're trying to do is proactively understand why that is, fix it before it manifests and breaks and causes company problems. So, no story about chaos engineering would be complex without explaining the Chaos monkey story. So, in late 2008, Netflix was changing over from shipping dvds to building their streaming services and Amazon Web services. This is important to recognize. A lot of people say, oh, we can't do cyber chaos engineering. Can barely do the DevOps. We're barely in the cloud. Right? Well, Netflix had the need for chaos engineering during their cloud transformation. It was a key component of Netflix transforming to the cloud. So if you're just transforming the cloud, what it ends up being? It ends up being a feedback mechanism to help you understand. Are the things that I'm building in this new world actually functional? Actually, are they doing the things that I build them to do over time? Right. You can almost think of it as a post deployment regression test for the things you're build. And so that's important to recognize. So this isn't some crazy thing. This is that you have to be super advanced to do that is a false interpretation of chaos engineering. Also, what chaos engineering really represented. So what Chaos monkey did was because at the time, they were building out their services in AWS, what was happening was Amis. Amazon machine images were, poof, just disappearing. At the time, it was a feature of AWS. So service owners had a difficult time when that happened. So what they needed was they needed a way to test that if I built my service to be resilient to amis disappearing, that it would be resilient to that problem. So it put a well defined problem in front of engineers. And so basically, chaos monkey would, during business hours, it would pseudo randomly bring down an AMI on a random service. What it did was it put the problem of an AMI outage or disappearing in front of the engineers so they clearly knew what they had to build resilience into. And that's really what chaos engineering is about. It's about building with better context and better understanding who's doing chaos engineers. I've lost track at this point. There's roughly about 2000 companies or so in some. I mean, the CNCF has declared 2021 cyber chaos engineers being one of the top five crafts that are being adopted practices that's being adopted. But it's a little over. Just about every major company is now looking into it or adopting it in some level of maturity or fashion. This is no longer just a fat. So there's also three books out there, as David said. David and I were involved in writing the security chaos engineering report. Actually, if you stay tuned, toward the end of the presentation, we'll give a link to downloading that for free, as well as a link to download the Chaos engineering O'Reilly book written by my co founder, casey Rosenthal, the creator of chaos engineering at Netflix. And there's also the original report that Netflix wrote. So, instrumenting chaos, so loosely defined, where does chaos engineering fit in terms of instrumentation? So Casey always likes to frame it as testing versus experimentation. So testing is a verification or validation of something you know to be true or false. You know the information you're looking for before you go looking for it. Whereas in our world, it's a CVE and attack, that's we kind of know what we're looking for, whereas experimentation, we're trying to derive new understanding and new information that we did not have before. And we do that through the scientific method. We do that because the hypotheses are in the form of, I believe when x condition occurs on my system, y should be the response, right? And we actually introduce x in the system and observe whether or not y was a response, and we learn about what really happens. And it's quite interesting, because I've never seen a chaos engineering experiment succeed the first time we run it. What does that mean? What that means is our understanding of our systems is almost always wrong. And it's because speed and the scale and the complexity of software is hard for humans to keep track of. In the world of applied sciences, we refer to this as the complex adaptive systems. So Casey also likes to say, casey and Nora Jones like to talk about that. It's not about breaking things on purpose or breaking things in production. It's about fixing them proactively on purpose. Right. And Casey also likes to point out, if you're just like, I'm pretty sure if I went around breaking things, I wouldn't have a job very long. Right. And that's. That's important to recognize. So, security, chaos engineering. All right, what is the application of this to security? Well, what we're really trying to address is that hope is not an effective strategy. So, engineers, as an engineer, I've been a builder most of my career, and engineers, we don't believe in two things. We don't believe in hope. We don't believe in luck. Right. We believe in instrumentation and empirical data that tells us, is it working or not? Right. So we can have context to build from and improve what we're building. And so hope and luck, that worked in Star wars as an effective strategy, but it doesn't work in engineering. So what we're trying to do is understand our systems and its security gaps before an adversary does. So here's the problem that we're dealing with today, is that our systems need to be open and available to be able for a builder to have the flexibility to change, to change what they need to do to make the system function or to make the product operational for a customer. There's this need to change and change quickly, even if you're changing a small thing. But there's also the need for a security mechanism. CTO reflect that change. Right. What's happening is the speed of our ability to change and the flexibility. CTO change is coming at the cost of our ability. CTO understand and rapidly meet that change. From a security perspective, you magnify that problem with speed and scale. What's happening is things are slipping out of our ability. CTO understand what happened, right. So what's happening is adversaries are able to take advantage of our lack of visibility and understanding how effective our controls are or not. We have too much assumption and too much hope in how we build. We need more instrumentation post appointment to tell us, hey, when the misconfigured port happens, we have, like, ten controls. CTO catch that, or when we accidentally turn off encryption or downgrade encryption on an s three bucket, we should be able to detect that we have all these things in place, but what we're actually doing is we're introducing test conditions to ensure that that is constantly happening. So it's about continuously verifying that security works the way we think it does. We introduced the condition to ensure that when misconfiguration happens, that control catches it or stops it or blocks it, or reports log data or the alerts fire. I think David's going to talk a lot about how capital one does a lot of that, but that's really what we're trying to achieve here. So it's about reducing uncertainty through building confidence. So really, you're seeing a theme here, the application of chaos engineers. Security is no different. It's just we're looking at it more from the security problem set, but we're trying to correct how we believe the system works and how it actually works in reality. We're bringing that to meet. We build confidence as we build that that is actually true, that our understanding is in line with reality. So, use cases, I think David's going to expand on some of these in a minute, but some rough use cases you can apply chaos engineering to are instant response, security control validation, security observability, and also every chaos experiment has compliance value. So basically approving whether the technology worked the way you thought it did or not. Each one of these use cases you can find in the O'Reilly report expanded from people who have applied them in these particular areas. I'm going to talk a little bit about incident response in security cyber chaos engineers. The problem with instant response is response is actually the problem. So security incidents, by nature, are somewhat subjective. No matter how much money we have, how many people we hire, how many fancy controls we put in place, we still don't know. A lot. It's very subjective. We don't know where it's going to happen, why it's happening, who's trying to get in, how they're going to get in. And so all the things we put in place, we don't know if they're effective until the situation occurs. Right? With security chaos engineering, we're actually introducing that x condition, the condition that we prepped all that time, effort and money to occur or detect and block. And what we do is because we're no longer waiting for something to happen, to find out whether it was effective or not, we're introducing it purposefully, a signal into the system, trying to understand, do we have enough people on call? Did they have the right skills? Did the security controls throw good log data? Were we able to understand that log data? Did it throw an alert? Did the Soc process the security operations center, did they process the alert? All these things are difficult when you're to assess confidence when it's happening. We're doing this proactively, and we can now understand and measure and manage things we can no longer manage and measure before. So it's about flipping the model instead. The post mortem exercise being after the fact, in a way, you can think about it being the postmortem now being a preparation exercise, determining how effective we are at what we've been preparing for. So inner chaos slingers. So this is a tool I talked about earlier when I was at UnitedHealth group. We started experimenting with Netflix's chaos engineering. We started applying it to some security use cases around instant response, security control validation, and we got some amazing results. And that tool we wrote was called chaos Slinger. There's an r at the end of that slide. It's not showing up, but it's an open source tool. It's out there on GitHub. You can check it out. But really, chaos Slinger represents a framework on how to write experiments. There's three major functions. There's Slinger generator and tracker generator finds out the target to introduce the failure condition into Slinger actually makes the change, and tracker tracks the change and reports it into slack. So here's an example. So chaos Slinger had a primary example. We open sourced it. We needed something to share with the wider community so they would know kind of what we're trying to achieve. So we picked this example called Portslinger, because everyone kind of, we've been solving for misconfigured ports and firewalls for, like, 30 years, right? Whether you're a software engineer, a network engineer, a system engineer, executive, everybody kind of knows what a firewall does. Okay, so what port Slinger did, was it proactive? Would introduce a misconfigured or unauthorized port change into an AWS EC two security group. And so we started doing this throughout Unitedhealth group at the time. And so Unitedhealth group had a non commercial and a commercial software division. So what we started doing is started introducing it into these security gifts and started finding out that the firewall. So our expectation was that our firewalls would immediately detect and block it. It'll be a non issue. This is something we've been solving for, remember, for 30 years. We should be able to detect it. It'd be a non issue. That only happened about 60% of the time. The problem was, there's a drift issue between our commercial and our non commercial environments. Well, that was the first thing we learned, remember, there is no active incident. There was no active incident. There was no war room. We did this proactively to learn about was the system working the way it was supposed to. The second thing we learned was the cloud native configuration management tool. Caught it and blocked it almost every time. That was amazing. Something we're barely paying for is working better than we expected. A third thing we expected to happen was that both tools sent good log data to our security logging tool. We didn't really use a sim at the time. We used our own homegrown solution. So I wasn't sure if alert would be thrown because we're new to the cloud, we're very new to AWS at the time, so we're learning all these things and what happened is actually through alert. So that was awesome. That was the third thing I kind of didn't expect to happen happened. So the fourth thing is that the alert went to the Sock. But the SOC didn't know what to do with the alert because they couldn't tell what account it came from, what instance it came from. Now, as an engineers, you can say, well, Aaron, you can map back the IP address. Well, truth, that could take 515 30 minutes. CTO do that if snat is in play, it could take 2 hours because snat hides the real IP address. But remember, if this were an incident or an outage, 2 hours is very expensive when you're the largest healthcare company in the world. But there was no incident, there was no outage. Proactively figured out, oh crap, this would have been bad had this actually happened. So we're able to add metadata, pointers to the alert and then be able to fix it, right? There wasn't no problem. But that could have been very expensive had we waited for a surprise or CTO manifest on its own. But these are the kind of things you can uncover with security cyber chaos engineers. That's kind of a prime example. You can check out more examples. Like I said earlier in the book that David and I wrote, security enhanced engineering and why you should do so. As for why, I don't really see a choice. Looking back over CTO Aaron's slides, the relationship maps kind of look like coronavirus. The timely reference security is hard and it's getting harder. With all these vendors promising the world will be fixed by using their products, it's even more difficult to make good decisions on what the correct course of action is. Are the tools you have in working? Do you need new tooling? Is the problem that you need to invest time and effort. If it is, how do you know? People talk about needing to address the basics before they talk about doing the advanced things like security, chaos, engineers, they're kind of missing the point around SCE in general. If you build this foundation into your engineering teams and the engineering functions, it becomes a basic capability. It becomes business as usual, and it removes the lift needed to get it implemented later on. Many basic issues are the ones that you need to worry about before tackling an advanced adversary. But I'm not saying it's not important, because depending on your threat model, it can be, but you need to know what your capabilities are before it's taken advantage of by someone else. So I know it's going to be generalizing, but most things that we're seeing in security are kind of basic. Like we see password reuse credentials in GitHub, cross site scripting firewalls, the stuff that's been around for 30 years. They're all basic things everyone knows, and these issues can just pop up at any time. But it's not easy. Even though they're basic, they're definitely not easy. You can look back to some of the recent hacks like Solarwinds, and think, wow, how can they mess up something so easy? But if you're in a large environment, you kind of know the answer to that one. It's the complexity of things. Nothing is more permanent than doing something, a quick fix, something temporary. Nobody knows how all the systems interconnect, how one firewall rule can really affect another one, or even how service accounts can have undocumented services be affected. Not until you do it. And that's what needs to be observed and measured. You just do it like in a night's tale. That movie from like 100 years ago with Heath Ledger. It was a good movie, I think. So with baselining is where you want to get all this to. Trying to see what actually happens is how you establish that baseline. Do we really know that we can stop the malware or the vendors doing what it is they promise to do? Does a proxy really stop a specific category? Well, you don't really know unless you try. And when you try it, you document it and treat anything that you find as that baseline. So you know where you've been, so you know where you're going to and when you've got something to compare against. And why you want to do that is to understand when and how things change and to drive usable metrics that aren't just uptime, but it tells you how your tooling performs. So when you want to prove the way that things work, you want to prove that your tools function. You want to know that it's more than just a blinking light that you're going to be using to comply with a policy somewhere. You don't want to wait until the ghost is already in your ballroom before you check out your security capabilities function. How you think they do when you're asked, are we vulnerable to stuff? You'll have an answer. So instead of like loose waffling around with, well, we've got rules in place that will alert when it happens. You can prove that you have an answer by having the log, having the data that work that shows the results or triggered alerts. You can remove the guesswork and replace it with analysis and data. And as you tune your environment, you have a baseline to reference when detection or alerting failed. You'll know if you fine tune the rule too much because maybe a simple variable change has now bypassed your tooling. But in practice, it's not going to be that simple. With products, they're difficult. They're difficult to implement, they're difficult to configure and tune. And some of the things may not even do what they're supposed to be doing outside of a few test cases that the vendor prepared you so they knew that their tool would work. You can get poorly designed ux, which mandates vendor specific training and vendor time because you've got a deep investment in them, because that's the only way to know how their tools work, is to get deep training. When people move, they take that training knowledge with them and you've got to reinvest into new people. As your defenses move through this testing, you continue to update your expectations on what a system provides. You've got a way to actually show improvement that isn't just, we saw this many attacks last month, we blocked this many this month. You can show incremental improvement as you go from one spot to another to another. And as for why, maybe you're bored and you think, you know what, I want to jump in the deepest rabbit hole I can find. Let's see what I can find out. Because once you succeed at this, you're going to be really popular. You're going to be so popular, fetch might actually happen. I don't know. It hasn't happened for me yet. But what they say about when you do the same thing over and over again expecting a different result, but for reals, people get really excited. And when they get excited, they get ideas. And sometimes they're dangerous ideas. Dangerous meaning increased asks, expectations and wanting to get even more of the data that you're generating. And it's a good thing, but you don't want to be caught unprepared. So how do you get started with this? Like strong defenses, alerts going off. How are people and teams being rewarded? Uptime and visibility. When you're doing this, you want to give them support. And when you're supporting people, they don't want to feel like they're being attacked. When people feel attacked, they kind of buckle down and they kind of resist any of the efforts that you're doing. So you want to take the relationships you have, take what you know, and just sort of nurture them. So start with something that will show immediate impact, something quick, something easy. Your ultimate goal at this stage is to show the value of what you're doing and do it in a way that shows support. CTO, the team is tasked with owning and operating the platforms that you're experimenting on and with. Like, let's say you have a sock. Take some of the critical alerts they've got and test them. They haven't gone off in a while. It's even better because you don't really know what alerts work until they go off. So you do it and measure it if they work, awesome. We've got a success. Let's celebrate the success. Tell everyone in a way that get these teams rewarded, because too often people don't get rewarded for doing the things that they're doing well because they don't know. You will get that data driven metric to show that your tool is working, that your teams are working. So what I did is starting out, I used a combination of endpoint alerting and endpoint protection. I took an attack trophy. I documented an expected path that the attacker would take, what defenses are expected, what alerts would go off and just let loose within 15 minutes, roughly. I had my initial results and data to bring to my new best friends and I brought the data with questions like, is this what you expected to happen? Are the blogs what you expected? Did the alert go off when you expected it? If it did, awesome. So take that and repeat it every day and help generate metrics with that data. So I mentioned I like the to do. I really wasn't settling on this gift, but it worked for me. Getting started back then, it was still endpoints and it was something that I can test myself, something that I didn't need to get other people involved. I could say, all right, let me just do this. Look, because anytime you have to leverage other people, it's taking away from their day job, from the work that they needed to keep doing. So I picked something that we cared about, critical alerts. And then picked the ones that we haven't seen fire in a while. And for why that why it shows alerts. It's alerts go off when you're attacked. If you don't see alerts, is it because you're not being attacked? Maybe your alerts don't work. This thing logs. Maybe the tools don't actually stop it. They're supposed to stop. And for how starting it. I took something nondestructive. So lull bins, living off the land binaries powershell things that wouldn't actually affect the status of the machine. And I checked it didn't want to do anything destructive because if you leave the machine in a state worse than when you found it, that could be something very bad. You could put like a trademark on that, very bad in the future. So when I did that, did I see what I expected? Did it respond how I thought it would? If so, cool. Now figure out how to automate it. Run it every day. It's a small thing, but you're starting to establish that baseline that you can use to know your basic level of protections and have a spot to compare. It was your environment gets tuned, gets updated, gets fixed, and then when you're getting started, you need to find what your target cares about, what their goal is, changes the approach that should be taken, like identifying malware detection, isn't going to necessarily be the same approach that you would take for validating alerts, trigger or things like lateral movement, network based attacks. And again, report on what you're doing. You want to be like Malcolm up there. You want to just observe, report, show the changes, and show teamwork to make things better. So getting back into the use cases that Aaron was talking about. So do critical alerts work? It's always going to be a rhetorical question, do they work? We all know alerts work until they don't. And sometimes you find out at the worst possible time that something stops working. But there is a way to prevent this. You test them. Always be running the attacks. It's not as cool as ABC. Always be closing, but that was close enough. Like, always run attacks, you schedule them, just run them at consistent time so you can identify outliers. If you're running at 08:00 a.m. Every single day, if you suddenly see it at 815 or 830, like, okay, well, what's going on that day? Was there an issue with logging that we didn't know about? When you do that, you're going to have historical trends of alert uptime, ensuring the alerts trigger as expected. Even though tuning processes in the future may inadvertently change how systems behave. You'll know what's working and what is not working. With the typical alerts become the negatives. Like, if you're like any company ever, you hate false positives. Unlimited resources aren't a thing, and teams have to prioritize what they spend their time on. What do you do with false positives? In many cases, you just spend the time and you tune out the excessive noise. But like anything, it can cause its own issues. Like has false positive production made your alerts so brittle that you're now missing attack that you used to catch? And our alerts hyper focus on a single technique that introduces false negatives. Have alerts been so finely tuned that a slight variable tweaking now no longer triggers any logging or alerts? If so, well, at least you know now, because then you can feed that into your other use cases, such as like engineering, hunting, security architecture. So when the alerts fire, how long does it take for the alert to fire? Does it fire when you think it's going to happen? Because sometimes, in reference to the tuning, when you tune something, you can still affect time to alert, which is going to also affect the time to respond. And do the results show what's expected? If you're expecting the result within ten minutes, do you get them within ten minutes? If maybe not. If so, cool. If not, that's when most of the work starts, because now you've got to start tracing back. Where is everything going? What does the ingestion path look like? What do we need to change? And finally, with there, it's helping to train the new people they're going through. Standard training exercises can be boring. So what we do is we try to spice it up. Since it's already in production, there's not really a better way to get people to practice investigations and doing real work. Like, how are they actually responding to security events? Do they adhere to playbooks? Are the playbooks right? Are there critical steps that people are keeping in their heads that never actually made it? CTo, the electronic paper that they're writing on. So testing removes guessing, and when you remove that guessing, you're removing mistakes. Like, everyone will make mistakes. It's part of everything that we do. We make mistakes all the time. It's just how we recover from it. How we respond to it. And when you're doing it, just make sure you have permission and you've got approval to do it, because otherwise you can also have a pretty bad day. So as we get to another use case of gap hunting, gap hunting, it relates to all of this because was you're finding out that your tools are running and you're missing things like, well, you are in a really good position to identify what you don't see. You're able to know what you don't detect. And there's a lot of overlap with this in something like detection engineering, but still there's a difference. We're not really worried about what we're detecting yet, because the primary goal is to find out how systems are responding, how they work, and establishing a baseline of behavior that we know can be detected and was well as understand what our security stack sees and what it doesn't see. And the behavior detection leads to discovering gaps in understanding like what was missed. Did we expect to actually see it? Is there a configuration issue? Has something changed that we never thought about when we did the initial deployment? Does deviating from an expected attack result, attack technique results in reduced visibility? Does changing a delivery technique from one method to another, let's say Powershell versus python, does it produce false negatives? Is it something that is detected only? Is it something that we can change in the future as we improve the environment? Can we go from detection only to prevention, too? If so, that's another huge win that can also lead into the next use case of engineering. I apologize. I ran out of gifts. I think the child was maybe four years old when he drew a cat, and it was really one of the most frightening things I've ever seen. Imagine if that came to life. I don't know. I don't know what to think about this. But all the other cases can lead back into something for engineering, like the return on an investment. Tooling and staffing isn't cheap. None of this is cheap. So it's great to know. Are the things that we're investing on in the tools, the people? Is it paying off? Is everything working the way that we want to? Are the tools optimized? Do we have duplicative tuning? Like maybe we've got five, six endpoint agents that all do the exact same thing? Do you need that many? Do they have any extra capabilities the other one doesn't have? If not, do you need them all? Can you reduce complexity in the environment without reducing your security capabilities? How is your defense in depth? Like how many layers can fail before you're in trouble. Let's say you've got ten different layers of security that you're expecting and you know that you can handle up to three of them failing. That's good to know. So you can determine, do we want to get 100% coverage in every single tool? Can we live without some? Can we be 80% effective like the bugs in starship troopers when Dookie Hauser was showing combat effectiveness? Can you do that? And that also leads into measuring your environment. Without measurement, you can't really show progress. It's like the training montage from footloose. The guy on the left, he goes from not being able to snap to getting just stand up applause from Kevin Bacon. I don't remember his character's name, but I guess just Kevin Bacon is going to be enough for me. Measurement is going to help. When you stand up and say, here's where we were last quarter, here's where we are now, here's where we've improved, here's what I think should be targeted, and this is where we hope to get so you can start planning out your roadmap, where you were, where you are, where you're going. Audits become a lot easier when metrics are constantly being generated. So instead of saying, we need to show this one point in time when something works, you could say, okay, well, here's how you get the data. We've been running this for this long. We have consistent reporting all the way through every week. So instead of being a snapshot in time, showing compliance that one time, way back when, you are always showing compliance because you are proving that you are secure and all the data can be used to support the people you work with for improving the entire security program. And it helps use data driven decisions on what to focus on what needs help and providing evidence that what used to work still works. There are some risks. Sometimes people don't care. It could be the way it's presented. But sometimes there are other problems. If you don't do it the right way, telling people about problems can make them feel bad, then they get the pushback. So what I use is more of a framework for the issues. Like here's what I see. I'm saying maybe it's incomplete metrics, maybe there's alerts without context. Here's my interpretation. And then ask, what am I missing? Here's what I think that means. What do you think? So it could be something that people know about it. It's something that's a work in progress. It's important not to go in there head on fire, pants on your head. Go in there with a collaborative effort, a collaborative spirit, and then deliver one message that you want them to walk away with. Why should they believe this message? What confidence do you have? Just provide all of that and it makes it a lot easier to have a good working relationship and move the needle, is what they say. And my final messages are, don't worry, because if I can do it, you can do it. Just like that. So here's the link to the security chaos engineering book. You can download it for free. Free copy, compliments of Verica. And if you're also interested in chaos engineering, here's a link to both books. Great. Thank you very much.

Aaron Rinehart

CTO @ Verica

Aaron Rinehart's LinkedIn account Aaron Rinehart's twitter account

David Lavezzo

Director of Cyber Chaos Engineering @ Capital One

David Lavezzo's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways