Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, thank you for joining me today for a session on how we created
a learning culture here at Perimeterx. And when I say
a learning culture, I mean instilling a healthy culture of debriefs
and positive discussions that ultimately minimize issues
and backfires. Internally, I'm excited to share with you our
journey here at conference U. My name is Amir Shaked.
I lead the research in engineering team here at Perimeterx and
I'm an experienced breaker of production environments. We are a
software as a service company providing solutions to protect
modern web apps at scale behind the scenes. We have
a cloud based microservices environment on a large scale,
around 15,000 cores, 300 microservices.
And like any other production environment, we too see
features all the time. Today we're going to cover our
journey through the process of change towards creating a healthy and supportive
learning culture by taking those failures and building on
those. This is the essence of chaos engineering, and while a
lot can be said on the technical aspects of randomly breaking things
to find gaps, reality can always surprise you. Every production
environment I worked on experienced issues, sometimes due
to code changes, other times due to third party providers having their entire
infra crashing, leading us to seek ways
to learn and improve constantly on how we
do things, how we protect the system, and at the end of it,
how do we provide the best service to our customers. When I joined
the company, I set a destination of wishing to see rapid
deployments, being able to provide the most adequate
and up to date solution. In our case, in a world of moving
target defense where the scope of features changes all the
time due to threat actors, being able to deploy changes quickly is a major
factor in our ability to provide a competitive product.
In fact, oftentimes a good DevOps culture can be the
differentiator and act as a competitive edge for your company. We wanted
to have zero downtime and have errors or mistakes happen only
once, so we'll have a chance to learn, but not twice,
meaning we didn't learn the first time. However, the starting
points wasn't that bright. We saw a few repeating issues,
minor things using failures in production due to code
changes or configuration changes, being too
prone to incidents in the underlying cloud environments
we were using affecting our stability. Those two factors
were very concerning when we looked at how we are going to
grow and scale, looking at ten, x and 100 ticks ahead.
But maybe a minor risk today will likely be catastrophic in the
future and that future can be next week
if you're in a fast growing company. While those were concerning the
last one really prevented us from improving, and that's the fear of
judgment. Whenever we dove into trying to understand issues we has,
there were pushbacks. Why do we ask so many questions? Do we not trust
our people? Why don't we just let them just do their job? They know what
they're doing, and that's a problem. If you have team members afraid or
feeling they're being judged or generally insecure in their work
environment, they're going to underperform. And as a team,
you will not be able to learn and adapt. In essence,
this is the whole point of this exercise. So with that
starting point and the destination in mind, we set off to establish a new
process within the team of how we analyze every kind of
failure. What do we do during the analysis? How to conduct a debriefs
and the follow up. Why do we focus on the process?
Because a bad system will beat a good person every
time. And assuming you have the right foundation of engineers,
if you fix the process, good things will happen.
So let's start with an example, which I'm sure any of
us who own production environments experienced, either the same or similar,
or can relate in some way. And I'll start with
a use case and how it relates to the process. So you have
an incident. A customer is complaining about something misbehaving in
their environment, and they think it might be related to you, and they're calling support.
Support is trying to analyze and understand, and after a while realizing
they don't know what to do with it, they will page the engineering team,
as they should. The engineering team wakes up because it's the
middle of the night and they're in another time zone. They work
to analyze what's going on. They find a problem,
they fix it. They might resent the fact that they had to fix it in
the middle of the night, obviously, and they go back to sleep and move
on to other tasks the next day. If that's the end of it, you are
certain to experience similar issues again from similar root causes.
So you should ask yourselves, why is it happening? What can we do
better? What can we do? To avoid seeing this issue in any potential
similar case in the future? You have to set time to analyze
after the fact. This is the only way to make sure root causes are found
and process problems are improved. Lessons are taken
and can be learned from a case in hand. We actually
had a code was deployed into production by mistake. How can it
happen? Well, we had an engineer merging the code into the main branch.
The code failed for some of the tests. But it was late at night and
they decided to leave as it is and keep working tomorrow,
knowing that code will not be deployed from main to production.
What he didn't know has there was a process added by the DevOps
engineer earlier that week that automatically deployed that
specific code to production when there was a
need to auto scale that specific microservice. And that night
we had a specific usage increase, spinning up more
services with the buggy code. Now here lies the issue. We can focus
a lot on why the buggy code was merged into main, why the auto
scale has added. If you focus too much on why certain
someone did something or didn't do, don't they understand
what they're going to. We're going to miss the entire issue of wait a minute,
the process is flawed. How could an engineer
actually merge code into production not understanding that it's going to be
deployed? There is a meaning behind specific repositories or
specific ways you manage the code, branches,
naming conventions and all of that in our case.
So if you fix the process, the problem aligns.
In this specific case, aligning all engineers to understand that the
main branch equal deployment to production, no matter which service.
That is the way to approaching merging branches to the main
branch will change drastically. Fixing this with the process
not overjudging what a specific employee did or didn't do when they were
just trying to do their job will prevent this from happening again.
So how do we learn from such an incident? Well, there are four
steps to the process. We start with the incident,
obviously, and the more mature you
become as an organization and learning culture, the team
will create an incidence from supposedly minor things just
for the follow up and learning from them, which is a really healthy stage to
be at. You provide an immediate resolution to the issue and then
24 to 72 hours afterwards, really depending
on how much time they had to sleep and work hours.
You're going to do a debriefs and we're going to talk a bit more about
that meeting and how to conduct it in the next slide. Two weeks
after the meeting, we do a checkpoint to review the action items that came
from the debrief and make sure things are incorporated, especially the
immediate tasks. So let's talk about conducting a debrief now.
This isn't the standard retrospect, as it's usually following an incident that may
have been even very severe in impact. When do you debriefs?
Every time you think the process and or the system did not
perform well enough. I ask a lot of questions. The questions I
ask are first of all, what happened? Let's have
a detailed timeline of events from the moment it really started.
Not the moment somebody complained, not when the customer
or somebody else raised the alarm, but from the moment the issue really started to
roll into production, when the code was merged, when we changed
the query, when the third party provider we were using started
to crash and updated their own status page. What's the impact?
Also a very important factor in creating a learning environment.
It helps to convey a message to your entire engineering team by
understanding what is the actual impact, be it cost wise,
customers affected complaints, get a full
scope as you can. That is vital to help everyone understand
why we're delving into the problem and why it is so important.
It's not just that they wake up in the middle of the night
being paged and it's bothering them. You have to understand the full
picture of where everything is related and connected.
Now, after you have the story and the facts, you start to
analyze and brainstorm how to handle this better in the future.
First, two questions I use as leading into
the discussion have we identified the issue in under a certain
amount of time? Let's say 5 minutes. Why 5 minutes?
Well, it's not arbitrary. We want to have a specific goal
on how fast we do things. So did we identifying the issue in under
5 minutes? Sometimes we did, sometimes we didn't. Did we fix the problem
in under an hour? Completely fixed. Did we do it under 10 minutes?
Do we need to do anything at all? Was it completely resolved automatically? And there
was no point of us trying to analyze anything once we have
the answer. Once you answer no to any of these,
the follow up should be okay, we understand the full picture. What do
we need to do? What do we need to change? What do we need to
develop so we will be able to answer yes to the following
two questions. This part has seemingly simple led
to a drastic culture change over time. Creating the framework
in that way helps convey to everyone the focus is on the process and
the system. It's not about anyone specific. Whoever caused the
incident today is irrelevant tomorrow it can be a different
employee entirely. Now, any culture change mistakes
time. We had some things we had
to resolve along the way, and I already mentioned a few of those solutions
on how we did it. First of all was lack of trust.
Especially if you have a new manager coming in. Trying to instill a new process,
trying to change the culture, takes time. Lack of trust
could be in the process, could be in questions people could ask themselves.
Is there a need, an agenda perhaps behind it.
How would it not become the blame game we had before?
This can be completely resolved if you do properly and
consistently. What often happens when you're trying to
understand why a problem occurred. People might say he
or she did something at fault and the reissue is something else.
Also important to notice, not following up on action items,
something that's really annoying. You do the process, you review
everything. You set action items,
and then you have the same problem all over again a few weeks or a
few months later. How did it happen? So you see that the action items
that were set weren't being followed up on. The resolution we had was
very simple. We established the checkpoints. You have the debriefs.
You set checkpoints every two, three weeks, whatever time frame is relevant
for you to make sure that the immediate action items are handled
personally. What I also do is I label each Jira ticket with a debrief
and do a monthly review of all the debrief items to see what is
left open, be it irrelevant or something that had to be moved
upwards. And another critical move we've made to resolve
future issues is implementing a proper communication on wide
scale. Make sure everybody knows there was a debrief where
we are publishing our debriefs very widely within the company,
exposed to all the employees with the details of
what happened, what we're going to do to make it better. This helps bridge
the gap of trust, if you might have it,
and show that everything is very transparent and visible. We saw
that if you're not asking the right questions,
the focus might be the problem. And giving a wide audience can
help give another view with more audience to
identify gaps that might have been missed within the bigger
picture. Now there are four main things I would want you to take
from this session on how to conduct a debriefs listed
here. The first would be avoid blame. Avoid it altogether.
And if you see blame starting to happen within a meeting, within VRIF
meetings, you need to interfere and stop it politely.
Always be calm, but you need to stop it to make sure it's on track
and the media goes the way you want it to go. Because if there is
a vision of how it's going to happen later
on when you're instilling the change, it will happen on its own without needing
to be involved in the process. Go easy on the why
questions. It's important to understand why
somebody did something, but the more you dive into it, if you
ask somebody why they did something, it could
create resentment or self doubt for employees.
It can sound to them like you're being critical and judging on how they behaved
and why they're doing certain things. Be consistent, like I said
before, and keeping calm.
It's always important, especially when you're looking into things that failed
to stay calm and show there is a path forward,
especially helping to creating a better change environment.
Now, some of our most notable learnings from all of these
processes I've listed here, and I'll touch briefly on a few
of those. Also, you can see in
the format of the debrief meeting here in the QR code you
can follow up on. So first of all,
humans make mistakes. We need to fix the process, not trying
to fix people that will not work because they work hard,
they're smart, but everybody makes mistakes. Another thing I've often
heard was gradual rollout. Now it often
appears to be some holy grail, perhaps of microservices and
large scale production systems. It's a great tool,
but it's not a silver bullet. And it will not resolve everything.
And saying it will will miss a lot of problems that you need to resolve
in a different process or tools. Establishing crisis mode process
also a very important one feature flagging, especially if
I'm connecting it to the second point, in terms of handling errors
quickly, it was really important for us. It was one of the
things that we were able to do to disable certain features instead of
rolling back a lot of services, sometimes thousands
of dockers, and that helps reverting the code much
faster and understanding if that's the cause of the problem or not.
Maybe something else in the infrastructure is the issue and not even the code.
We always try to avoid replacing something that we
need to change in the system that we have at
in place changes. When you have a lot of mostly
coupled microservices, there is a lot of communication with underro
and changing something in place
can cause a lot of harm. So we try to split it into
adding a new behavior, verifying it behaves has we
expected, and then subtracting the old behavior and
essentially splitting the process into two. Now, always trying
to look ten x ahead on breaking points of the system wherever
things happen and can break. And treating config
has code was also a very crucial element in how how
we do things. So that's it. Thank you for listening.
I hope I gave you something new to use. Feel free to ping
me on any of these mediums and ask questions, and I'd love to discuss with
you more.