Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              You. Hi. This session is
            
            
            
              titled get ready to recover with reliability management. I'm Jeff
            
            
            
              Nickoloff. I'm currently with working with Gremlin,
            
            
            
              but I have 20 years in industry. I've worked with some of
            
            
            
              the largest organizations on the planet,
            
            
            
              and I've been working with mission critical business critical systems
            
            
            
              for a very, very long time. Some of those companies include at Amazon,
            
            
            
              Venmo, and PayPal, and several others.
            
            
            
              In my time working on those critical systems,
            
            
            
              I've been involved with hundreds of incidents,
            
            
            
              both as a person on call,
            
            
            
              as well as someone who's responsible for communicating across
            
            
            
              an organization, running a man, running an incidents,
            
            
            
              communicating with customers, digesting and translating
            
            
            
              between engineering information technician information and
            
            
            
              customer impact. And there's a
            
            
            
              long way of saying I've seen a lot of different kinds of incidents,
            
            
            
              but I think the type of incidents that bring us together
            
            
            
              really drive us all to be interested in a session
            
            
            
              like this, in a conference like this, are really those
            
            
            
              with the most pain, and I don't necessarily mean
            
            
            
              those with the greatest customer impact, although those are very important.
            
            
            
              I'm talking about situations that I like to call scrambling
            
            
            
              in the dark. Now, this might be literally
            
            
            
              in the dark, where it's either the middle of the night,
            
            
            
              you're tired,
            
            
            
              but it doesn't have to be. It might also mean situations
            
            
            
              where you don't know, your customer doesn't know,
            
            
            
              your technicians don't know, no one's sure you're
            
            
            
              having communication problems, miscommunication.
            
            
            
              You have to do just in time research.
            
            
            
              Scrambling in the dark might look like your technicians are
            
            
            
              trading dashboards like trading cards.
            
            
            
              There's an increased desperation to the situation.
            
            
            
              And those are
            
            
            
              the moments where the
            
            
            
              customer impact is one thing, but it's really those moments that create a
            
            
            
              bit of a crisis of conscious, or rather
            
            
            
              a crisis of confidence for yourself, for your
            
            
            
              team in itself, for your business
            
            
            
              partners, in your engineering organization.
            
            
            
              People begin to ask the hard question of, can I rely
            
            
            
              on our systems?
            
            
            
              And many, many uncomfortable questions fall out
            
            
            
              of these times where we're scrambling in the dark.
            
            
            
              And so this is a problem we
            
            
            
              need to solve as much as, and with the
            
            
            
              same urgency as we typically talk, but things like
            
            
            
              time to recovery or
            
            
            
              time to response those sort of like a little bit more concrete metrics.
            
            
            
              But in my experience, more often than not,
            
            
            
              these have similar solutions.
            
            
            
              It really comes down to preparation.
            
            
            
              You need to get ready for the incident. Before the incident,
            
            
            
              you're going to have incidents that is, that is given.
            
            
            
              And especially for systems that undergo regular
            
            
            
              change, high velocity change, either in the system itself, the system
            
            
            
              design, or in the business context in which you're operating the
            
            
            
              system, either side of the coin can
            
            
            
              change a system in ways that are difficult to anticipate.
            
            
            
              But I don't want to be another one of those people that kind of stands
            
            
            
              up and says, well, you need to be better prepared, or to just
            
            
            
              be better prepared. Preparation is important,
            
            
            
              but it's important not to minimize the level of effort
            
            
            
              and the expense that goes into preparation.
            
            
            
              Preparing for incidents.
            
            
            
              It's nontrivial.
            
            
            
              As soon as you dive into it, you have to ask yourself,
            
            
            
              what does it mean to be prepared? What level of readiness
            
            
            
              do we need and what do we need to prepare for?
            
            
            
              And the level of effort increases with the complexity
            
            
            
              of your system. The number of individual components of the system,
            
            
            
              the number of people on your team, the number of teams you have,
            
            
            
              the different ways that your system might interact with other systems
            
            
            
              or your partner systems, your upstream dependencies,
            
            
            
              or with your customers. How many different ways does your
            
            
            
              system interact with your customers? How many different ways can
            
            
            
              it fail? And that's
            
            
            
              a hard story to tell when you're
            
            
            
              asking for funding to invest in programs
            
            
            
              so that you can be better prepared,
            
            
            
              so that your teams can feel better prepared, so that you can be more
            
            
            
              confident before, during and
            
            
            
              after an incidents.
            
            
            
              Because if you don't have that,
            
            
            
              this is table stakes for getting to those places where you
            
            
            
              can begin to talk about time to recovery.
            
            
            
              It all starts with the people and understanding the space.
            
            
            
              And it's critical not to minimize the
            
            
            
              level of effort it requires to be better prepared.
            
            
            
              So when we talk about preparation,
            
            
            
              your investment really falls into two categories.
            
            
            
              You have your detective controls,
            
            
            
              identifying when your system is failing, potentially identifying what
            
            
            
              parts of your system are failing, and potentially identifying
            
            
            
              likely resolution.
            
            
            
              That all falls into a world that
            
            
            
              we typically think of as automation.
            
            
            
              Automation. These look like programs or systems
            
            
            
              that provide continuous monitoring
            
            
            
              and automated recovery.
            
            
            
              Things like resilience platforms like kubernetes.
            
            
            
              This is specifically what that is intended to
            
            
            
              provide. Automated release control, automated rollback
            
            
            
              control, automated recovery. If a pottery or
            
            
            
              process fails, it will automatically bring reconcile
            
            
            
              that desired state back to having it run.
            
            
            
              Those types of investment
            
            
            
              in automation can be very powerful, but can also end up being
            
            
            
              very tool or stack dependent. And this
            
            
            
              makes it a little bit more fragile, makes it a little bit more engineering
            
            
            
              effort to pursue robust solutions over
            
            
            
              the lifetime of your team and your product and your
            
            
            
              company. They need a little bit of love, and that's
            
            
            
              okay. These are very powerful tools, but they are
            
            
            
              expensive to maintain,
            
            
            
              to implement, maintain and continuously improve.
            
            
            
              They're not bad. They're critical. Right. The other side of
            
            
            
              this is getting your team on autopilot. And what
            
            
            
              I mean by that is having a consistent,
            
            
            
              bringing a high degree of consistency into your incidents
            
            
            
              response, the skill set and
            
            
            
              context that your technicians bring,
            
            
            
              a consistent and logical way of following through
            
            
            
              and problem solving the incidents.
            
            
            
              A fairly deterministic and consistent set
            
            
            
              of remediation options.
            
            
            
              How do we recover? Getting that on autopilot?
            
            
            
              And when you're not on autopilot, what that looks like is
            
            
            
              incident responders and technicians, specifically where
            
            
            
              you have a high degree of variation in their readiness
            
            
            
              to handle the incident. Some people understand some
            
            
            
              systems more than others. Some people are
            
            
            
              familiar with recent changes to a system more than others will
            
            
            
              be. Sometimes people have
            
            
            
              different problem solving responses, different problem solving workflows.
            
            
            
              In other times,
            
            
            
              different people will be familiar with different sources.
            
            
            
              For truth, this might look like dashboards.
            
            
            
              It might look like awareness of specific, some alarms
            
            
            
              or maybe non alarming monitors that
            
            
            
              have also been set up, that are designed to, or that are
            
            
            
              in place to try to help triage and
            
            
            
              identify a path to recovery. But those
            
            
            
              things are not. There is
            
            
            
              a missed opportunity if your team can't
            
            
            
              use them with consistency. So when I say getting on autopilot,
            
            
            
              I mean bringing a high degree of readiness
            
            
            
              to the people on your team, making sure that
            
            
            
              they're aware, making sure that they understand the systems and how
            
            
            
              those systems fail, making sure that they understand the tooling and
            
            
            
              everything else that are available to them,
            
            
            
              and making sure that they get reps, making sure that they get practice,
            
            
            
              making sure that they've seen the various kinds of
            
            
            
              failure before they show up in
            
            
            
              an incident.
            
            
            
              This is an investment, and a regular investment into the
            
            
            
              human side of things. But either way,
            
            
            
              regardless of which of you're going to end
            
            
            
              up investing in both of these things, the question is what
            
            
            
              to spend on each side of these things,
            
            
            
              and then also identifying what
            
            
            
              things to prepare for, which kinds of incidents to
            
            
            
              prepare for. Most systems have.
            
            
            
              They can become quite complicated quite quickly with the number of dependencies,
            
            
            
              the number of ways things can break,
            
            
            
              and under which kinds of conditions different parts of the
            
            
            
              system may break or may need different kinds of love in
            
            
            
              order to recover.
            
            
            
              In some cases, you might be asking yourself, like, what can we change about the
            
            
            
              system before an incidents?
            
            
            
              To either reduce the probability of an incident
            
            
            
              or to speed recovery.
            
            
            
              But again, coming back to the complexity
            
            
            
              and the level of effort that goes into just preparing,
            
            
            
              it's going to be very difficult to prepare for everything.
            
            
            
              That's a very long tail that we'll all have
            
            
            
              to be chasing. So the question really comes down
            
            
            
              to not just
            
            
            
              do we invest in automation versus autopilot,
            
            
            
              but which types of incidents,
            
            
            
              which types of failures should we invest in preparing
            
            
            
              for? And that's a
            
            
            
              nontrivial question. You can answer it trivially.
            
            
            
              Some people may have seen that some
            
            
            
              people might be more familiar with different types of failures than others, and so
            
            
            
              they'll probably lean towards that, naively lean towards the
            
            
            
              things that they are familiar with, the failing or the
            
            
            
              last thing that they ended up being paged for.
            
            
            
              There's typically a strong bias for that. But if
            
            
            
              you're standing in a position where you have the opportunity to choose,
            
            
            
              there's a better way. I want to
            
            
            
              talk about the relationship between incident management
            
            
            
              and incident response and reliability,
            
            
            
              reliability of your systems and running a reliability program.
            
            
            
              These two things are definitely separate efforts,
            
            
            
              but there's an inherent relationship between the two.
            
            
            
              Your reliability program.
            
            
            
              We put these things in place so that we can proactively,
            
            
            
              not retrospectively,
            
            
            
              looking at what has been breaking, but so we can proactively identify
            
            
            
              and regularly assert what incidents
            
            
            
              we are at risk for, the probability of those risks,
            
            
            
              the severity, if these things break. And we
            
            
            
              use that information to inform what incidents we
            
            
            
              should prepare for, and we use the information that
            
            
            
              comes out of managing incidents,
            
            
            
              how prepared are we to handle these incidents?
            
            
            
              How long does it take to recover? What's been the financial impact
            
            
            
              of the last three times or however
            
            
            
              many times this type of failure
            
            
            
              has happened? We use that as an input back into the reliability program
            
            
            
              so that we can prioritize what to change and
            
            
            
              how to measure. So, reliability program.
            
            
            
              This is a very high level, abstract idea,
            
            
            
              but in general, what this looks like is
            
            
            
              having a strong idea of being able to
            
            
            
              enumerate the components in your system,
            
            
            
              being aware of the ways that they might fail, being aware of your
            
            
            
              dependencies, being aware of the value,
            
            
            
              usually, like by rate, of how valuable certain systems
            
            
            
              are, and then regularly measuring and determining
            
            
            
              what types of operational conditions those
            
            
            
              components can survive and specifically
            
            
            
              where they tip.
            
            
            
              And then obviously there's a whole thing around identifying,
            
            
            
              funding and staffing for engineering
            
            
            
              improvements so that you can hit reliability goals.
            
            
            
              But I really want to talk, but to zoom in on the mechanism,
            
            
            
              the high level core mechanisms of a solid reliability program,
            
            
            
              there's a tool called failure mode and effect analysis,
            
            
            
              effects analysis. This is a pretty robust
            
            
            
              framework. It's rare that I've seen it
            
            
            
              implemented in SaaS and software space in a
            
            
            
              deep way, but it's a really important system,
            
            
            
              even if you're taking only high level inspiration from
            
            
            
              it. A failure mode and effects analysis is
            
            
            
              a robust and opinionated framework for
            
            
            
              cataloging the components in your system,
            
            
            
              the failure modes for each of those components,
            
            
            
              the probability of those failures,
            
            
            
              and that term starts to get a little bit fuzzy and
            
            
            
              really dependent on your business,
            
            
            
              the severity impact of
            
            
            
              that type of failure.
            
            
            
              For many groups, this might look like financial
            
            
            
              impact, this might look like downstream.
            
            
            
              If this fails, you can begin to talk about cascading
            
            
            
              failures, although failure mode and effects analysis is really not so much concerned about
            
            
            
              that. It's usually first order impacts. But if you can get to money, it helps
            
            
            
              you craft a better story later.
            
            
            
              And these analyses also typically discuss
            
            
            
              and presents an opportunity for you to determine whether
            
            
            
              or not you can detect the type of failure.
            
            
            
              But the big idea is you get this information,
            
            
            
              you build out a big table, it might be a spreadsheet, whatever it
            
            
            
              is. And this helps you identify all
            
            
            
              sorts of risks in your system to really identify where the
            
            
            
              risks are. I want to talk for
            
            
            
              a moment about failure modes and your detective
            
            
            
              controls, because this goes directly to your
            
            
            
              incident preparedness,
            
            
            
              as you're enumerating, as your reliability management program
            
            
            
              is enumerating the types of failures for each component
            
            
            
              and whether or not they can survive them. Another big question for
            
            
            
              each of those types of failure modes is, can you detect it before
            
            
            
              your customers do, or how
            
            
            
              quickly can you detect it?
            
            
            
              And that's really because if you can't,
            
            
            
              these are clearly going to be gaps in your
            
            
            
              preparedness. If you can't detect
            
            
            
              whether or not a failure mode has happened,
            
            
            
              you're going to have poor response time. If you
            
            
            
              can't detect whether or not this failure has happened,
            
            
            
              your incident responders, when they do respond, are going to
            
            
            
              have a more difficult time identifying the nature of the failure.
            
            
            
              And so it's naive. I could stand up here
            
            
            
              and say, make sure that you've got detective controls for everything. But this is
            
            
            
              one of those cases where you want to look at
            
            
            
              that breakdown to say,
            
            
            
              does this type of failure mode warrant investment
            
            
            
              into detective controls?
            
            
            
              And it's important to be able to test your detective controls to
            
            
            
              regularly, I don't mean at one point in time,
            
            
            
              but to regularly create
            
            
            
              failure conditions in whatever environment to
            
            
            
              verify that your detecting controls operate the way
            
            
            
              that they're intended.
            
            
            
              The next step, and it's not really a step,
            
            
            
              but the other part that I'd already discussed a little bit
            
            
            
              is, so when we're bringing it back to how do we prioritize what
            
            
            
              to invest into, the real big question is,
            
            
            
              well, where's our biggest risk? And I mean not just in
            
            
            
              like. And when I say risk here is like probability
            
            
            
              of failure, but also multiplied by the severity
            
            
            
              of the failure. If you have something that is very
            
            
            
              expensive, if the failure occurs, but is extremely
            
            
            
              unlikely, then this might be a lower priority
            
            
            
              than a type of failure. This might be a lower priority
            
            
            
              to prepare for than a type of failure that happens
            
            
            
              three or four times a day and is
            
            
            
              likely to continue that that has a more
            
            
            
              mild cost associated with.
            
            
            
              But you have to do that reflection activity. You have to actually ask yourself,
            
            
            
              how likely is something to happen? And that
            
            
            
              typically requires some type of experimentation. You should test it.
            
            
            
              Can this happen? Under which conditions can it happen?
            
            
            
              And when it does dive into the business. Look at
            
            
            
              your volumes, look at your, if it's
            
            
            
              a revenue type business, how much revenue is associated with these interactions?
            
            
            
              If this type of failure might result
            
            
            
              in some breach of contract,
            
            
            
              it's important to understand the penalty for those types of violations and bring
            
            
            
              that in. Let the business inform your
            
            
            
              engineering decisions.
            
            
            
              And so there's
            
            
            
              a lot of different ways to do this. It's easy to say probability and
            
            
            
              severity and talk about risk. I've seen it done a lot of
            
            
            
              different ways. And one
            
            
            
              of the concerns there is having inconsistency in
            
            
            
              your organization. If you have ten
            
            
            
              different groups in your organization and each of the ten groups
            
            
            
              are doing it slightly differently, it becomes very difficult to
            
            
            
              prioritize for your organization because you're often
            
            
            
              comparing apples to oranges.
            
            
            
              So regardless of what happens, regardless of
            
            
            
              how you move forward,
            
            
            
              consistency in measurement, consistency in those
            
            
            
              metrics that you're using to drive, prioritize decisions
            
            
            
              that dictate how you're
            
            
            
              going to spend your money in improving your preparation.
            
            
            
              Consistency is key.
            
            
            
              And this is one of the problems that we're solving at Gremlin that I'm
            
            
            
              so passionate about. Our new product reliability management.
            
            
            
              Product scoring is really central to it. And at
            
            
            
              minimum, this is something that we've learned from
            
            
            
              the vast experience in building reliability programs
            
            
            
              with companies.
            
            
            
              More often than not, from my experience and other conversations
            
            
            
              I've had with people at gremlin, more often than not,
            
            
            
              these are the dimensions that our customers find great success
            
            
            
              with. And like I said, the specific scoring mechanism
            
            
            
              you use is less important than having consistency in scoring.
            
            
            
              So what we've done is we've gone ahead and built a consistent scoring
            
            
            
              mechanism on their behalf.
            
            
            
              And so this is just an example.
            
            
            
              We do regular testing for redundancy,
            
            
            
              scalability and surviving dependency
            
            
            
              issues. We combine those into an
            
            
            
              easy to understand score and our customers,
            
            
            
              and we help present this to customers
            
            
            
              in a way that they can understand reliability
            
            
            
              issues between different services. Now, if you were to dive
            
            
            
              in, you can see specific conditions, and you can use
            
            
            
              those types of failures to inform the types
            
            
            
              of incidents that you should be prepared for. But the real
            
            
            
              power here is being able to know
            
            
            
              and identify what things can we survive? What things can
            
            
            
              we not survive for? Those things that we can't survive,
            
            
            
              what's our impact? Right. So however you end
            
            
            
              up implementing it, this is one of those great.
            
            
            
              What I believe to be a fantastic example of
            
            
            
              what you should end up with at the end of the know.
            
            
            
              That's why I'm so excited about what we're building here at Gremlin.
            
            
            
              At Gremlin, this has been our focus since the beginning,
            
            
            
              but we're really making explicit now
            
            
            
              that our mission is to help teams standardize and automate the reliability
            
            
            
              one service at a time and to help them understand at the service level
            
            
            
              in a consistent, repeat able way what they can tolerate
            
            
            
              and what they can't, so that they know this is that you understand how to
            
            
            
              prioritize your improvements, either in your product engineering
            
            
            
              or in your incidents response preparedness. Your incident
            
            
            
              preparedness. Thank you.
            
            
            
              That's all I have today, but if you have any other questions, I would love
            
            
            
              to see them in the chat. Thank you for everything.