Transcript
This transcript was autogenerated. To make changes, submit a PR.
Over the next half hour, I'm going to talk through a story of when I
worked at a cloud consultancy firm, how we bread
teamed ourselves, how we hacked our own AWS
environment, and to see what would happen, see what we'd
learn along the way. To see whether
people knew our security policies and procedures as
well as we thought they did. To see who would step up
given the chance. So I'm going to kick off today
with a wonderful japanese word,
sujugiri, which is the closest
embodiment of the spirit with which we undertook that fateful day.
And for those who are unaware of the meaning,
it means trying out a new samurai sword on a random passerby,
which is a practice that japanese samurai used to
have. They'd used to test out new swords.
And this random passerby bit was kind of the interesting bit here,
because the red team that we had for the day
knew what was going to happen. The blue team didn't know they were
a blue team until everything started.
And we'll have a look at exactly what happens when you do
that as we go along. Now, a little bit about
me. I'm Josh. I'm a distinguished technologist at Contino. I'm the
author of the Cloud Native Security cookbook with O'Reilly Hash
ambassador, AWS ambassador. I do lots of stuff in cloud
and I write and run my mouth a lot.
So kind of the overarching thing we were trying to do with this
red beating exercise was shift left learning.
How do we make it so people are learning things ahead
of time or as early as possible. Make the learning cheap and
how can we do better? You really don't want all your
employees understanding what the instant response process while
a real instant is going on. That's very much too late to be
learning the ins and outs and to get that experience.
When you look at this ubiquitous curve of time versus
cost during an active security
event, it's very expensive to have people learning. What you
want is people that know what they're doing who've been around before. Yeah, of course,
you're going to have to do some exploration. There's always going to be new things,
but a solid foundation of what the process is, what tools we
have. Making sure that our security approach
is robust and rigorous and resilient and gives everyone what
they need is really important. And for a consultancy,
you shouldn't be telling clients to do what you're unwilling to do
yourselves. You shouldn't be trying to tell them
to adopt principles and approaches that you are unwilling
to do yourself. So this was really an example for us to put our
foot forward and try and see what happened.
Now, to kick off, I'm just going to go through a few mental models,
some of which you may be aware of previously.
This is a fairly classic one about different kinds of knowledge.
You have known knowns, known unknowns
and unknowns. Unknowns. Now known
knowns are the things you know, you know,
you know, you know these things, right? And these are some principles,
like security is everyone's responsibility, right? We all know that.
At least we did this consultancy. We all knew that it was all of
our responsibilities. But what
does it actually look like when the rubber meets the road?
Every company on earth pretty much has an response process that
outlines what should happen in the instance of a breach
or a potential breach. For us,
this was pre Covid as well. For us, being a consultancy
firm, we were generally geographically dispersed across the
city in which we operated. So for this, we actually
brought everyone together. We didn't want to make it too challenging
for everyone, right? And we wanted to bring people together, so the bandwidth
of communication was really high so people could chat face to face. We weren't
doing this all over slack. We weren't getting distracted while on client
side. So we've brought everyone together for this. Everyone in the same room.
We also thought we had a good idea of the expected avenues of when
we started the red team event, we thought there
were things we expected the blue team would do, things they would try.
Naturally, us being the masochists that we were, we decided
to disable some of these by default.
Some of these things were making it so
the most senior people who were going to be on the blue team side,
their access was broken. Right.
We made sure we planned out our red
team approach. Don't think it technically classifies
as Osint, but you would expect that a real threat
actor would map out the environment and understand,
potentially, the way that the blue team is going to respond to things,
all right, into known unknowns, things we know we didn't
know for this. How do we measure performance in
this? This isn't so much measure performance to
measure against anyone else, but more just to give us a benchmark of when
we do it again. Have we got better?
Have we, as a team and AWS a company improved over time,
or have we stayed the same? Have we got worse? So we picked some
measures. Feel free to steal them if you
try this kind of thing yourself. So first was time to
identify. So from when the red team
event started, how long was it until
the breach was identified, recognized, and called
that there was something weird going on in the environment.
The second was the time to contain. So from once we're
identified, how long was it until the blue team got control and managed to get
the red team out of all systems? Very interesting.
One percentage of intrusion detected. So of
all the things we managed to do as a red team, how many were found?
And this was not during the event, but also including some cleanup
time afterwards. Like, how much stuff did people find? Of the things that we did
as the red team were able to leave things lying around
in the cloud that people weren't aware of, that could have been the site
for a follow up breach. Who will
take responsibility? So, with this,
the red team was made of myself and one of
the guys who are the more senior engineers in the company and the entire
leadership team. The three directors at this company knew what we're doing.
And what actually happened on the day was we all went
somewhere else to do the red team event.
So the blue team was left in the officers, but the leadership
team and the two guys on the red team, we were
elsewhere. And this was a really interesting thing to see. What would happen if we
left the leadership vacuum? Who would stand up
and take control of the situation? Who would take up the
mantle of leadership as it was on the day?
Another thing this is a kind of rule of thumb that
I subscribe to, is that processors generally break
around three x. Now, when I joined this company, it was
about ten people, and about some of us were about 30.
We had a feeling that maybe our processes weren't
fit for purpose anymore. And rather than
try and come up with new processes just based out of our
heads, we figured, how about we actually try some stuff and build the process around
what we find, as opposed to based them on evidence,
rather than just what we think.
And the last known unknown for me is a really interesting
one. I think about a lot is in terms of what lenses do people possess.
So when developers generally
have a good development lens, they might not have a good operational lens
or a good security lens. Like, they look at problems a particular way, they build
solutions of a particular style because of the nature of
their experience and what they do. And I always find it's interesting when
you can find the generalists or the t shaped or m shaped people or
whatever letter we're using nowadays to talk through.
Well, can they look at stuff through a security lens? Can they
understand it that way? Can they empathize with security? Can they look at
things in that way? And this was an
experiment on my behalf to find out, well, what security skills do we
have in the business? Do we have people with that inclination? Do people have that
ability to switch? Because we were made of mostly developers,
we worked as DevOps teams like you build, you run. That was
our table stakes, what we did every day. So this was a
really interesting point to find out what people cloud,
we cultivate into security champions, because they just that way inclined,
and they have that ability to see the world the right way.
Unless there's the unknowns. Unknowns we knew going into this. We discover
answers to questions we didn't even know we had. And you
just find all these things. And when you think about unknowns, I mean, chaos engineering
is the classic example that we do nowadays. You break
things on purpose to realize what you find out from
there. You don't necessarily know the question up front,
but you decide to test things and see what happens.
This is very much not the systems chaos
engineering of Netflix fame, but more a chaos engineering of a really,
really interesting day. It was.
Naturally, we were set some rules of engagement. There were
some limits to what we were allowed to do. The CEO naturally put a financial
limit on what we, as the red team were able to do. We couldn't just
go spin up some crypto mining and beating in mind, this was 2019,
so crypto mining was quite valuable back then as well.
Went hard to do that, and a big thing was
making sure that we didn't overly demoralize. We didn't make it too
hard for the blue team to counteract. Right. When you're doing these things,
you win together or you lose together. There's no blue team one
or red team one. It's about improving everyone together, really kind
of that holistic, like, more wholesome approach to moving forward.
All right. And with all that, we'll actually shift on to the
day itself. So it started at 10:00 a.m.
Because be kind, let people have their caffeine, let it kick in. Make sure the
coffees are nicely imbibed before you kick
off, and you start channeling a huge amount of stress into
people. A really important
thing. We set out with this, and I kind of alluded to
it before, where we disabled senior members access to
our cloud environments. This was an opportunity for us to channel learning
through our more junior members of staff.
So rather than have it so all the senior engineers
all sit in a group of five, and the 15 more
junior people end up being kind of pushed to the side. We set it
up so the only people with access to affect change on the environment
were the more junior members of staff. So very much in like a
pair programming style. The seniors had to channel their
ideas and their expertise through the more junior people.
So we're actually able to get this really nice passage of knowledge going
through them. So everyone got to learn. It didn't end up in,
well, the guys who are the most senior guys, they've got
it. We'll just sit back. We really didn't want that happen. We really want to
feel the more junior staff to feel involved. Right.
So at ten three, we tripped the wire, so to speak.
So being a serverless first consultancy, as they still are
to this day, we said that we booted up a virtual machine
in our AWS environment because we had alerts that if anyone booted
up a virtual machine, it would trigger alerts in slack to say, something weird is
going on. We shouldn't be doing this. At the same time,
we sent out a phishing email as well that we created just
to see what would happen, right? Give them
a chance to figure out that something's going on. And six
minutes later,
six minutes later, someone did notice what was going
on. Someone called out that, hey,
something's looking a bit fishy. We could see it on slack that they were messaging.
Like, is anyone trying to do something? Because this looks a bit weird.
And they started asking for some help, and a minute
later, someone came to help them. Now, interestingly, both those
people access we killed as soon as they started trying
to do anything. You might be wondering at this point how
we know what's going on in the room and we actually had
to fly on the wall. So we actually had one person on the blue team
side who was aware ahead of time that something was going to happen and was
a communication conduit between ourselves and kind of what was going
on in the room. Right? Like kind of said before, we didn't want to overly
stress people. We want to make sure this was something that's going
to remembered, at least mostly fondly,
and make sure that we weren't pushing
too hard. And at the same time, we weren't making too easy for them either.
We want them to be stretched. We want them to try. I want it to
be a challenge. Right?
So two minutes after, the second person came with a
pair of hands, not a drill, it was called out on slack that,
okay, something's actually happening. Everyone needs the down tools and help.
We need to mob around this problem and get there. Just exactly what we wanted
to see, right? You want to see that when something serious is going
on, that people will down tools and get involved and pitch in
to help. Five minutes after that,
the CEO was called. And I note this purely
because call the CEO was number one, our instant response
process. So from the initial
time when something was noticed, to actually get into step one was eight
minutes. And I had the lovely opportunity
to see the CEO pick up his phone,
look at it, put it back down on the table, and go back to drinking
his coffee without a worry, in the words.
So this was something that we had thought about a little bit.
We didn't really know how it was going to go was organically, what structure was
going to evolve out of the blue team with everyone in the office, what structure
were they going to try and form to kind of combat what was
going on? And the initial version looked like this. You had
Paul, who was the first person to notice anything was
going on, and he went, okay, I'm going to take ownership
of this situation. And he had a whole bunch of people beneath him kind
of reporting into him. Naturally, what happened there was this.
There was too much going on, too much communication. He was trying to hold
everything in his head. And bear in mind, it wasn't eight people talking to him,
it was about 20. So when you've got that many people all trying to report
into one person in a high pressure environment with a lot of going on,
we all know that's not going to work, right? That's not possible. All.
So a little bit later on, they had to stop and regroup.
I think Paul decided his head was on fire enough
and went, okay, let's actually stop and think
about this and figure out what we're going to do. They ran an
access poll to figure out who still has access
to AWS, who can still do things, so they could understand
what parallelization they could actually action things and how they could channel and
best set up to approach the problem. They also
ended up adopting a communication and leadership
role duality. So instead of just one leader with everyone reporting
into him, there was Paul, the leader, who was trying to took ownership for everything.
And then Zinab, who was the second pair of hands to help out
in that first place as well. She ended up stepping into this kind of communication
facilitator filter role. So she would take all the
information and pass it to Paul and filter it down for him to be able
to make the decisions and calls that he needed to do. So that filtering
of information actually allowed him to actually take more ownership
and understand what was going on. Four minutes
after that, they realized that Pete fly
on the wall was not being very helpful and kind of just sitting
there. And once they put
him in a bit of interrogation, they realized that he knew what was going on
and really wasn't there to help, and they kicked him out just
more than fair enough. Seven minutes after
that, I actually managed to break out of AWS.
So I found one of our engineers GitHub credentials sitting
there in parameter store. So naturally I start using those
credentials and I create a lot of private repos with funny names under his
GitHub account as well. And this
just came down to, for me,
an always interesting bit that one of the core components of AWS that people don't
talk about enough, I feel like people talk about more than they used to.
Is KMS now what it turned out
in the end, the day when we went back and I had a chat to
the engineer whose GitHub credentials I got, he thought he'd done the right thing by
putting in a parameter store and he had encrypted it with kms. He just hadn't
set up the KMS key properly, so it was open to any principle within the,
you know, there's always these things know learnings.
And for me, KMS is one of the
fundamental services within AWS, or similarly in GCP
or Azure that you have to get really comfortable and really,
really good with, because it is such a key part of actually able
to protect within an account or a subscription or a project, depending on
your cloud of choice. Right?
So at 1050, which was 50 minutes after we started all silent
on the western front, they realized that potentially seeing
as this was something happening internally, maybe conversing on
Slack, where we could also see it, as in we, as in the red team,
probably wasn't the best of ideas on their behalf.
So instead they moved to a Google group, if memory serves,
and started chatting there instead so they could cut us out of loop. So we
didn't know what was going on, so naturally we still
wanted to have eyes and ears in the room. So we cloud
understand what was going on. So we sent them reinforcements, which were the COO
and the CTO of the company. They weren't going to be kicking them out of
the room. And also they wanted to get in there and
help Paul and Zina about who've been taking on the majority of the workload and
the stress, and just try and take some of the stress off and
just make sure that, again, we weren't pushing people too hard.
1057, the false contain.
The blue team thought they got us they asked the
CTO and co asked, do you have it under control? Like, yeah, we think we've
got them out of the systems. No,
they had not. It's about 18 minutes
after that, we got more brazen in what we were doing,
kind of putting stuff right in their faces where we knew they were looking,
just to realize, no, we were still in. We still had accounts that we're accessing
and all that kind of stuff. Five minutes after
that, we did send them a photo on slack of us, of our faces,
which, as you would imagine, copped quite a bit of written abuse,
all in good taste and all in good fun. And eight minutes after
that, they did actually manage to contain us. So it was about an
hour and a half to get the full contino and we lost access to everything.
A little bit after that, we took our sweet time. We were just next door
in a coffee shop. We took a sweet time going back to the office and
we walked in and proceeded to be. There were hand gestures
made at us, AWS. We walked back in the door, which quite
a thrill being having that done to you by 25
people all at once. But we came back because
it was time to move from containment to remediation.
So now they had locked us out and we wanted to actually
help with the cleanup. Right. We'd done things in lots of places. We wanted
to help the blue team find what we'd done to a large degree and make
sure that they were cleaning things up and not necessarily
telling them where everything was, but giving them leads and clues and all that
kind of stuff so they could figure out.
So, Kaizen, change for good, continuous improvement,
all that good stuff from lean theory.
And this was kind of what the second half of the day was based around
is how do we understand and reflect on
the experience of what that morning was, realize where
our problems were, where our gaps were, where we did really well, where we
didn't do so well, the opportunities for improvement, all that kind of stuff.
So scores on the doors. Time to identify
was twelve minutes from them to go from
us starting to do things to actually calling that an incident
was happening a few minutes earlier. They did kind of get a sniff
of it and realized something was going on, but it was twelve minutes. Time to
contain was an hour and 28 minutes.
So an hour and a half to get us from into
locked out of all systems. Percentage of intrusion
detected. So they caught about two thirds and we ran
up a tally. As the red team, we were making notes
of every single thing we did in the private slack channel. Just to make sure
that when we stepped back through, we could find everything.
One of the interesting things, one of the interesting questions
we got as we were going through this process was, oh,
but that wasn't a realistic scenario, which was an interesting question to get when
we'd spent a lot of time thinking about it, making sure that
we tried to make it as realistic as possible.
The initial breach was one person's set of credentials,
and it just kind of went from there,
I think a lot of the time. Sometimes I think maybe it's
getting a little bit better. People realize that security problems and breaches
are a matter of when, not if, but just making it
so people. It felt real for people for a little bit, which I think was
important that it did. And then we could talk about,
no, this was a perfectly realistic. This could well happen to us.
Right? We always come to backups, never fail.
Restores do like having these processes and
everything else. If you've got this security process and approach and
everything else that you think is valuable,
test it. If you're not testing it with realistic scenarios,
then you don't have a process at all. Right? The same as
if you take backups, but you never restore them. You don't really have backups,
do you? This was
something that I'd been reading about at the time we
did this, and I just thought was it helped
me reason about why, as the
red team, we felt we were one step ahead continually to the blue
team, like, yeah, they did catch an hour and a half, but to
a fair degree, we let them catch us again. We didn't want
to make it too hard. We didn't want to spend all day with them chasing
us. Right. There are diminishing returns to these things. And the loop
comes from seven second John Boyd,
who was a fighter pilot and trainee and
trainer and effectively had this
loop that used to describe how he was able to beat people in dogfights,
and he was nigh unbeatable. Right?
And the idea is this loop you go through, observe, orient,
decide, and act. The ooda loop, first you observe,
then you orient, then you decide, then you act. And what we
found, and the idea with this loop is,
the faster you go through the loop, the more you can outmaneuver
and outperform the person you're against.
And with our observability, especially on a blue team perspective,
what we found was the blue team just didn't have a good idea of what
was going on. They couldn't see what was going on. They couldn't find what was
going on everything was very manual for them to find things,
and there was just a lack of tooling that
we had. They didn't have the right tools to be able to fight back,
because as the red team, we kind of knew where they were, and we were
running out ahead, and we were dictating the pace
of everything, and we could create things faster than
they could find things. And at that point, it's just
an exponential curve where you outrun. So this was
a definite thing. And what was the painful
bit was that this lack of tooling, we had tooling on
random people's laptops from client engagements for visibility pieces
and other bits and pieces that we need, but just never
been put back into common repositories. It wasn't shared.
It was talked about after the fact that, oh, well, I've got something that does
that. Well, why not all those kind of things again? It's one of
those things that sometimes you need this catalyst, this catalyst give
you a bias to action and to find where these holes are.
One of the things that we definitely did do was every idea,
every gap, everything that we came up with through this process,
we captured them, and naturally, they became Jira tickets in the backlog,
because how else do you capture your best intentions and best wishes?
But our bench capacity going forward,
it was like a rite of passage to pick stuff off this backlog and work
against it, because the story became a
myth, a legend. I haven't worked at this company in three
years, yet. If I walk in the door, people who I've never met before,
new employees, know who I am and know what I did on that
day. It's an interesting legacy
to have. And, yeah, it became the beginning
of a tradition from there. From there, we ran an internal
CTF. We ran public ctfs.
We did more of these security game days,
workshops, red team events. It became something
of a tradition that
is really, still really strongly cherished at
that company. It was something that really brought security up
to a first class citizen, up to part a thread
in the weave of the culture of the company.
That's just never going to go away now.
And I'll just leave you with a final
thought, which is, why, red team,
why do this? And I'm going to just dip
into a very quick chess analogy, which is
the difference between a novice and a master. Novice chess
players look at the board in pieces. They see
every piece individually, and they have to hold all that in their head.
The master looks at the board and sees patterns, things they've seen before.
That's how they're able to move and understand
how to go and be as good as they are
and be 100,000
times better than a novice. Right? Like a master chess player
can play a novice a thousand times and not lose.
And really, when it comes down to, if I was in the trenches with someone
during a real security incident, would I rather have a Novitz next
to me or a master? The choice is yours.