Conf42 DevSecOps 2022 - Online

Red Teaming AWS: Practice What You Preach

Video size:

Abstract

Culture is defined through action not words. Learn how we built an enduring security focused culture at a Cloud Native consultancy through a surprise red team event. See how we created a company myth that is still the beating heart of the security practice, driving continuous improvement.

Summary

  • Josh: How do we make it so people are learning things ahead of time or as early as possible. Make the learning cheap and how can we do better? Josh: For a consultancy, you shouldn't be telling clients to do what you're unwilling to do yourselves.
  • Every company has an response process that outlines what should happen in the instance of a breach or a potential breach. We made sure we planned out our red team approach. How do we measure performance in this? Have we got better or have we stayed the same?
  • The last known unknown for me is in terms of what lenses do people possess. Can they look at stuff through a security lens? Can they understand it that way? And when you think about unknowns, I mean, chaos engineering. It's about improving everyone together.
  • We disabled senior members access to our cloud environments. This was an opportunity for us to channel learning through our more junior members of staff. You want to see that when something serious is going on, that people will down tools and get involved.
  • KMS is one of the fundamental services within AWS, or similarly in GCP or Azure. It is such a key part of actually able to protect within an account or a subscription or a project. The red team was locked out for about an hour and a half before they were finally contained. Now it's time to move from containment to remediation.
  • The difference between a novice and a master is like a chess game. Novice players look at the board in pieces; the master sees patterns. Would you rather have a Novitz next to you or a master? The choice is yours.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Over the next half hour, I'm going to talk through a story of when I worked at a cloud consultancy firm, how we bread teamed ourselves, how we hacked our own AWS environment, and to see what would happen, see what we'd learn along the way. To see whether people knew our security policies and procedures as well as we thought they did. To see who would step up given the chance. So I'm going to kick off today with a wonderful japanese word, sujugiri, which is the closest embodiment of the spirit with which we undertook that fateful day. And for those who are unaware of the meaning, it means trying out a new samurai sword on a random passerby, which is a practice that japanese samurai used to have. They'd used to test out new swords. And this random passerby bit was kind of the interesting bit here, because the red team that we had for the day knew what was going to happen. The blue team didn't know they were a blue team until everything started. And we'll have a look at exactly what happens when you do that as we go along. Now, a little bit about me. I'm Josh. I'm a distinguished technologist at Contino. I'm the author of the Cloud Native Security cookbook with O'Reilly Hash ambassador, AWS ambassador. I do lots of stuff in cloud and I write and run my mouth a lot. So kind of the overarching thing we were trying to do with this red beating exercise was shift left learning. How do we make it so people are learning things ahead of time or as early as possible. Make the learning cheap and how can we do better? You really don't want all your employees understanding what the instant response process while a real instant is going on. That's very much too late to be learning the ins and outs and to get that experience. When you look at this ubiquitous curve of time versus cost during an active security event, it's very expensive to have people learning. What you want is people that know what they're doing who've been around before. Yeah, of course, you're going to have to do some exploration. There's always going to be new things, but a solid foundation of what the process is, what tools we have. Making sure that our security approach is robust and rigorous and resilient and gives everyone what they need is really important. And for a consultancy, you shouldn't be telling clients to do what you're unwilling to do yourselves. You shouldn't be trying to tell them to adopt principles and approaches that you are unwilling to do yourself. So this was really an example for us to put our foot forward and try and see what happened. Now, to kick off, I'm just going to go through a few mental models, some of which you may be aware of previously. This is a fairly classic one about different kinds of knowledge. You have known knowns, known unknowns and unknowns. Unknowns. Now known knowns are the things you know, you know, you know, you know these things, right? And these are some principles, like security is everyone's responsibility, right? We all know that. At least we did this consultancy. We all knew that it was all of our responsibilities. But what does it actually look like when the rubber meets the road? Every company on earth pretty much has an response process that outlines what should happen in the instance of a breach or a potential breach. For us, this was pre Covid as well. For us, being a consultancy firm, we were generally geographically dispersed across the city in which we operated. So for this, we actually brought everyone together. We didn't want to make it too challenging for everyone, right? And we wanted to bring people together, so the bandwidth of communication was really high so people could chat face to face. We weren't doing this all over slack. We weren't getting distracted while on client side. So we've brought everyone together for this. Everyone in the same room. We also thought we had a good idea of the expected avenues of when we started the red team event, we thought there were things we expected the blue team would do, things they would try. Naturally, us being the masochists that we were, we decided to disable some of these by default. Some of these things were making it so the most senior people who were going to be on the blue team side, their access was broken. Right. We made sure we planned out our red team approach. Don't think it technically classifies as Osint, but you would expect that a real threat actor would map out the environment and understand, potentially, the way that the blue team is going to respond to things, all right, into known unknowns, things we know we didn't know for this. How do we measure performance in this? This isn't so much measure performance to measure against anyone else, but more just to give us a benchmark of when we do it again. Have we got better? Have we, as a team and AWS a company improved over time, or have we stayed the same? Have we got worse? So we picked some measures. Feel free to steal them if you try this kind of thing yourself. So first was time to identify. So from when the red team event started, how long was it until the breach was identified, recognized, and called that there was something weird going on in the environment. The second was the time to contain. So from once we're identified, how long was it until the blue team got control and managed to get the red team out of all systems? Very interesting. One percentage of intrusion detected. So of all the things we managed to do as a red team, how many were found? And this was not during the event, but also including some cleanup time afterwards. Like, how much stuff did people find? Of the things that we did as the red team were able to leave things lying around in the cloud that people weren't aware of, that could have been the site for a follow up breach. Who will take responsibility? So, with this, the red team was made of myself and one of the guys who are the more senior engineers in the company and the entire leadership team. The three directors at this company knew what we're doing. And what actually happened on the day was we all went somewhere else to do the red team event. So the blue team was left in the officers, but the leadership team and the two guys on the red team, we were elsewhere. And this was a really interesting thing to see. What would happen if we left the leadership vacuum? Who would stand up and take control of the situation? Who would take up the mantle of leadership as it was on the day? Another thing this is a kind of rule of thumb that I subscribe to, is that processors generally break around three x. Now, when I joined this company, it was about ten people, and about some of us were about 30. We had a feeling that maybe our processes weren't fit for purpose anymore. And rather than try and come up with new processes just based out of our heads, we figured, how about we actually try some stuff and build the process around what we find, as opposed to based them on evidence, rather than just what we think. And the last known unknown for me is a really interesting one. I think about a lot is in terms of what lenses do people possess. So when developers generally have a good development lens, they might not have a good operational lens or a good security lens. Like, they look at problems a particular way, they build solutions of a particular style because of the nature of their experience and what they do. And I always find it's interesting when you can find the generalists or the t shaped or m shaped people or whatever letter we're using nowadays to talk through. Well, can they look at stuff through a security lens? Can they understand it that way? Can they empathize with security? Can they look at things in that way? And this was an experiment on my behalf to find out, well, what security skills do we have in the business? Do we have people with that inclination? Do people have that ability to switch? Because we were made of mostly developers, we worked as DevOps teams like you build, you run. That was our table stakes, what we did every day. So this was a really interesting point to find out what people cloud, we cultivate into security champions, because they just that way inclined, and they have that ability to see the world the right way. Unless there's the unknowns. Unknowns we knew going into this. We discover answers to questions we didn't even know we had. And you just find all these things. And when you think about unknowns, I mean, chaos engineering is the classic example that we do nowadays. You break things on purpose to realize what you find out from there. You don't necessarily know the question up front, but you decide to test things and see what happens. This is very much not the systems chaos engineering of Netflix fame, but more a chaos engineering of a really, really interesting day. It was. Naturally, we were set some rules of engagement. There were some limits to what we were allowed to do. The CEO naturally put a financial limit on what we, as the red team were able to do. We couldn't just go spin up some crypto mining and beating in mind, this was 2019, so crypto mining was quite valuable back then as well. Went hard to do that, and a big thing was making sure that we didn't overly demoralize. We didn't make it too hard for the blue team to counteract. Right. When you're doing these things, you win together or you lose together. There's no blue team one or red team one. It's about improving everyone together, really kind of that holistic, like, more wholesome approach to moving forward. All right. And with all that, we'll actually shift on to the day itself. So it started at 10:00 a.m. Because be kind, let people have their caffeine, let it kick in. Make sure the coffees are nicely imbibed before you kick off, and you start channeling a huge amount of stress into people. A really important thing. We set out with this, and I kind of alluded to it before, where we disabled senior members access to our cloud environments. This was an opportunity for us to channel learning through our more junior members of staff. So rather than have it so all the senior engineers all sit in a group of five, and the 15 more junior people end up being kind of pushed to the side. We set it up so the only people with access to affect change on the environment were the more junior members of staff. So very much in like a pair programming style. The seniors had to channel their ideas and their expertise through the more junior people. So we're actually able to get this really nice passage of knowledge going through them. So everyone got to learn. It didn't end up in, well, the guys who are the most senior guys, they've got it. We'll just sit back. We really didn't want that happen. We really want to feel the more junior staff to feel involved. Right. So at ten three, we tripped the wire, so to speak. So being a serverless first consultancy, as they still are to this day, we said that we booted up a virtual machine in our AWS environment because we had alerts that if anyone booted up a virtual machine, it would trigger alerts in slack to say, something weird is going on. We shouldn't be doing this. At the same time, we sent out a phishing email as well that we created just to see what would happen, right? Give them a chance to figure out that something's going on. And six minutes later, six minutes later, someone did notice what was going on. Someone called out that, hey, something's looking a bit fishy. We could see it on slack that they were messaging. Like, is anyone trying to do something? Because this looks a bit weird. And they started asking for some help, and a minute later, someone came to help them. Now, interestingly, both those people access we killed as soon as they started trying to do anything. You might be wondering at this point how we know what's going on in the room and we actually had to fly on the wall. So we actually had one person on the blue team side who was aware ahead of time that something was going to happen and was a communication conduit between ourselves and kind of what was going on in the room. Right? Like kind of said before, we didn't want to overly stress people. We want to make sure this was something that's going to remembered, at least mostly fondly, and make sure that we weren't pushing too hard. And at the same time, we weren't making too easy for them either. We want them to be stretched. We want them to try. I want it to be a challenge. Right? So two minutes after, the second person came with a pair of hands, not a drill, it was called out on slack that, okay, something's actually happening. Everyone needs the down tools and help. We need to mob around this problem and get there. Just exactly what we wanted to see, right? You want to see that when something serious is going on, that people will down tools and get involved and pitch in to help. Five minutes after that, the CEO was called. And I note this purely because call the CEO was number one, our instant response process. So from the initial time when something was noticed, to actually get into step one was eight minutes. And I had the lovely opportunity to see the CEO pick up his phone, look at it, put it back down on the table, and go back to drinking his coffee without a worry, in the words. So this was something that we had thought about a little bit. We didn't really know how it was going to go was organically, what structure was going to evolve out of the blue team with everyone in the office, what structure were they going to try and form to kind of combat what was going on? And the initial version looked like this. You had Paul, who was the first person to notice anything was going on, and he went, okay, I'm going to take ownership of this situation. And he had a whole bunch of people beneath him kind of reporting into him. Naturally, what happened there was this. There was too much going on, too much communication. He was trying to hold everything in his head. And bear in mind, it wasn't eight people talking to him, it was about 20. So when you've got that many people all trying to report into one person in a high pressure environment with a lot of going on, we all know that's not going to work, right? That's not possible. All. So a little bit later on, they had to stop and regroup. I think Paul decided his head was on fire enough and went, okay, let's actually stop and think about this and figure out what we're going to do. They ran an access poll to figure out who still has access to AWS, who can still do things, so they could understand what parallelization they could actually action things and how they could channel and best set up to approach the problem. They also ended up adopting a communication and leadership role duality. So instead of just one leader with everyone reporting into him, there was Paul, the leader, who was trying to took ownership for everything. And then Zinab, who was the second pair of hands to help out in that first place as well. She ended up stepping into this kind of communication facilitator filter role. So she would take all the information and pass it to Paul and filter it down for him to be able to make the decisions and calls that he needed to do. So that filtering of information actually allowed him to actually take more ownership and understand what was going on. Four minutes after that, they realized that Pete fly on the wall was not being very helpful and kind of just sitting there. And once they put him in a bit of interrogation, they realized that he knew what was going on and really wasn't there to help, and they kicked him out just more than fair enough. Seven minutes after that, I actually managed to break out of AWS. So I found one of our engineers GitHub credentials sitting there in parameter store. So naturally I start using those credentials and I create a lot of private repos with funny names under his GitHub account as well. And this just came down to, for me, an always interesting bit that one of the core components of AWS that people don't talk about enough, I feel like people talk about more than they used to. Is KMS now what it turned out in the end, the day when we went back and I had a chat to the engineer whose GitHub credentials I got, he thought he'd done the right thing by putting in a parameter store and he had encrypted it with kms. He just hadn't set up the KMS key properly, so it was open to any principle within the, you know, there's always these things know learnings. And for me, KMS is one of the fundamental services within AWS, or similarly in GCP or Azure that you have to get really comfortable and really, really good with, because it is such a key part of actually able to protect within an account or a subscription or a project, depending on your cloud of choice. Right? So at 1050, which was 50 minutes after we started all silent on the western front, they realized that potentially seeing as this was something happening internally, maybe conversing on Slack, where we could also see it, as in we, as in the red team, probably wasn't the best of ideas on their behalf. So instead they moved to a Google group, if memory serves, and started chatting there instead so they could cut us out of loop. So we didn't know what was going on, so naturally we still wanted to have eyes and ears in the room. So we cloud understand what was going on. So we sent them reinforcements, which were the COO and the CTO of the company. They weren't going to be kicking them out of the room. And also they wanted to get in there and help Paul and Zina about who've been taking on the majority of the workload and the stress, and just try and take some of the stress off and just make sure that, again, we weren't pushing people too hard. 1057, the false contain. The blue team thought they got us they asked the CTO and co asked, do you have it under control? Like, yeah, we think we've got them out of the systems. No, they had not. It's about 18 minutes after that, we got more brazen in what we were doing, kind of putting stuff right in their faces where we knew they were looking, just to realize, no, we were still in. We still had accounts that we're accessing and all that kind of stuff. Five minutes after that, we did send them a photo on slack of us, of our faces, which, as you would imagine, copped quite a bit of written abuse, all in good taste and all in good fun. And eight minutes after that, they did actually manage to contain us. So it was about an hour and a half to get the full contino and we lost access to everything. A little bit after that, we took our sweet time. We were just next door in a coffee shop. We took a sweet time going back to the office and we walked in and proceeded to be. There were hand gestures made at us, AWS. We walked back in the door, which quite a thrill being having that done to you by 25 people all at once. But we came back because it was time to move from containment to remediation. So now they had locked us out and we wanted to actually help with the cleanup. Right. We'd done things in lots of places. We wanted to help the blue team find what we'd done to a large degree and make sure that they were cleaning things up and not necessarily telling them where everything was, but giving them leads and clues and all that kind of stuff so they could figure out. So, Kaizen, change for good, continuous improvement, all that good stuff from lean theory. And this was kind of what the second half of the day was based around is how do we understand and reflect on the experience of what that morning was, realize where our problems were, where our gaps were, where we did really well, where we didn't do so well, the opportunities for improvement, all that kind of stuff. So scores on the doors. Time to identify was twelve minutes from them to go from us starting to do things to actually calling that an incident was happening a few minutes earlier. They did kind of get a sniff of it and realized something was going on, but it was twelve minutes. Time to contain was an hour and 28 minutes. So an hour and a half to get us from into locked out of all systems. Percentage of intrusion detected. So they caught about two thirds and we ran up a tally. As the red team, we were making notes of every single thing we did in the private slack channel. Just to make sure that when we stepped back through, we could find everything. One of the interesting things, one of the interesting questions we got as we were going through this process was, oh, but that wasn't a realistic scenario, which was an interesting question to get when we'd spent a lot of time thinking about it, making sure that we tried to make it as realistic as possible. The initial breach was one person's set of credentials, and it just kind of went from there, I think a lot of the time. Sometimes I think maybe it's getting a little bit better. People realize that security problems and breaches are a matter of when, not if, but just making it so people. It felt real for people for a little bit, which I think was important that it did. And then we could talk about, no, this was a perfectly realistic. This could well happen to us. Right? We always come to backups, never fail. Restores do like having these processes and everything else. If you've got this security process and approach and everything else that you think is valuable, test it. If you're not testing it with realistic scenarios, then you don't have a process at all. Right? The same as if you take backups, but you never restore them. You don't really have backups, do you? This was something that I'd been reading about at the time we did this, and I just thought was it helped me reason about why, as the red team, we felt we were one step ahead continually to the blue team, like, yeah, they did catch an hour and a half, but to a fair degree, we let them catch us again. We didn't want to make it too hard. We didn't want to spend all day with them chasing us. Right. There are diminishing returns to these things. And the loop comes from seven second John Boyd, who was a fighter pilot and trainee and trainer and effectively had this loop that used to describe how he was able to beat people in dogfights, and he was nigh unbeatable. Right? And the idea is this loop you go through, observe, orient, decide, and act. The ooda loop, first you observe, then you orient, then you decide, then you act. And what we found, and the idea with this loop is, the faster you go through the loop, the more you can outmaneuver and outperform the person you're against. And with our observability, especially on a blue team perspective, what we found was the blue team just didn't have a good idea of what was going on. They couldn't see what was going on. They couldn't find what was going on everything was very manual for them to find things, and there was just a lack of tooling that we had. They didn't have the right tools to be able to fight back, because as the red team, we kind of knew where they were, and we were running out ahead, and we were dictating the pace of everything, and we could create things faster than they could find things. And at that point, it's just an exponential curve where you outrun. So this was a definite thing. And what was the painful bit was that this lack of tooling, we had tooling on random people's laptops from client engagements for visibility pieces and other bits and pieces that we need, but just never been put back into common repositories. It wasn't shared. It was talked about after the fact that, oh, well, I've got something that does that. Well, why not all those kind of things again? It's one of those things that sometimes you need this catalyst, this catalyst give you a bias to action and to find where these holes are. One of the things that we definitely did do was every idea, every gap, everything that we came up with through this process, we captured them, and naturally, they became Jira tickets in the backlog, because how else do you capture your best intentions and best wishes? But our bench capacity going forward, it was like a rite of passage to pick stuff off this backlog and work against it, because the story became a myth, a legend. I haven't worked at this company in three years, yet. If I walk in the door, people who I've never met before, new employees, know who I am and know what I did on that day. It's an interesting legacy to have. And, yeah, it became the beginning of a tradition from there. From there, we ran an internal CTF. We ran public ctfs. We did more of these security game days, workshops, red team events. It became something of a tradition that is really, still really strongly cherished at that company. It was something that really brought security up to a first class citizen, up to part a thread in the weave of the culture of the company. That's just never going to go away now. And I'll just leave you with a final thought, which is, why, red team, why do this? And I'm going to just dip into a very quick chess analogy, which is the difference between a novice and a master. Novice chess players look at the board in pieces. They see every piece individually, and they have to hold all that in their head. The master looks at the board and sees patterns, things they've seen before. That's how they're able to move and understand how to go and be as good as they are and be 100,000 times better than a novice. Right? Like a master chess player can play a novice a thousand times and not lose. And really, when it comes down to, if I was in the trenches with someone during a real security incident, would I rather have a Novitz next to me or a master? The choice is yours.
...

Josh Armitage

Distinguished Technologist @ Contino

Josh Armitage's LinkedIn account Josh Armitage's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways