Maintaining Reliable systems: How to minimize Incident's impact?

Video size:

Abstract

Incidents are expensive to the business, especially if customers leave us if we are perceived as unreliable. But failures will happen, it’s not an issue of IF, but a question of when. So how can we reduce the impact on our users? In this talk, I will review the production incident cycle, the time that we are not reliable and our users are not happy which includes the time to detect, time to repair and time between failures. I’ll share a few methods to tackle each one of those parts in order to minimize incident impact both from technical and people aspects, expending on incident response and postmortems to know what is the most important thing for us, and we want to be data driven in those decisions.

Summary

Jamaica make up real time feedback into the behavior of your distributed systems. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Today I'll share a few things that can make the life of an SRE or an on call a bit easier.
What is an incidents? Incidents are issues that are escalated as they are too big for you to handle on your own. To reduce the impact, we can reduce the time to detect, time to repair, a time between failures. How can we make failures hurt less?
Reducing the time to repair is mostly about the human side. Unprepared on callers lead to longer repair times. Having clear incident management processes can reduce that stress. Third part is time between failures, which begins from the end of one outage to the beginning of the next.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Those maintaining reliable systems today's focus will be on incident management and more specifically, how can customer impact be minimized by incidents management and postmortems? Before we start, I would like to introduce myself. I'm Ayelet Sachto. I'm a strategic cloud engineer and I'm also co leading PSO efforts in EME and currently I'm an SRE in GKE SRE London. Internet management is not new to me as I'm living and bleeding production in large scale for almost two decades now, most of them covering production on call. And today I'll share a few things that can make the life of an SRE or an on call a bit easier. I know it would have made mine much more so. What is an incidents? Incidents are issues that are escalated as they are too big for you to handle on your own, which required immediate and organized response. And remember, not all pages become incidents. Incidents mean loss in revenue, customers data and more, which all comes down to impact on our customers and our business. We want to avoid having too many or too severe as we want to keep our customers happy, otherwise they will leave us before we drill down, let's recap on our terminology. A service level indicator can SLI tells you at any moment in time how well is your service doing? Is it doing acceptably or not? A reliability target for an SLI called an SLO service level objective. Aggregate that over time. It says over this window of time, this is my target and how well I'm doing against it. Most of you are probably familiar with SLA. Service level agreement defines what you are willing to do. For example, provide a money refund if you are failing to meet your objective. To achieve that, we need our slos, our target, to be more restrictive than our slas. Our scope to examine and measure our user happiness is user journey. Your users are using your service to achieve a set of goals and the most important one called critical user journey. We know that failures will happen, but how can we make them hurt less? How can we reduce the impact? To answer this question, let's look at the production lifecycle, the time that we are not reliable and our users are not happy. That time includes those time to detect, time to repair, a time between failures. So to reduce the impact, we can tackle each one of those parts by reducing the time to detect, reducing the time to resolve or mitigate the incident and reducing the frequency of incidents, I. E increase the time between failures. So how can we do that? We do that with a combination of technology and human aspects like processes and enablement. At Google we found that once a human is involved, the outage will be at least 20 to 30 minutes, so automation and self healing system are a good strategy. In general, it both help to reduce the time to detect and the time to resolve. Let's zoom on each one of those parts separately. Those time to detect, also called TTD, is the amount of time from when an outage occurred to some human being notified or alerted that an issue is occurring. As part of our SLO drafting our reliability targets, we also want to do risk analysis and identify what we should focus on in order to identify and minimize the TDD. A few additional things we can do are aligning your slis, your indicators for customer happiness, as closely as possible to the user expectations which can be real people or other services. In addition, our alerts need to be aligned with our slos, our targets. We also want to review those periodically to make sure that they are still representing our customers happiness. The second thing is having quality alerts measured using different measurement strategies. It's important to choose what works best for getting the data. It can be from streams, logs or batch processing. In that regard, it's also important to find the right balance between alerting too quickly that can cause noise and alert fatigue versus alerting too slow as it may affect our customers. Note that noisy alert is one of the most repeating complaints we heard from operation teams, traditional DevOps and SRE. Another repeating issue is having alerts that are not actionable. We want to avoid alert fatigue and we also want pages that need immediate actions to achieve that. We want that only the right responders will get the alerts, only those specific team and owners. One of the most followed question is if we only page on things that required immediate action, what do we do with the rest of the issues? Remember, we have different tools and platform for a reason. Maybe the right platform is a ticketing system, a dashboard. Maybe we need only the metric for troubleshooting and debugging in a pool mode. The second part is TTR, time to repair. It begins from when someone was alerted to the issue to when it was mitigated. The key word here is mitigated. This doesn't mean the time it took you to submit code to fix the problem. It's the time it took those responder to mitigate those customer impact, for example by shifting traffic to another region. Reducing the time to repair is mostly about the human side that I mentioned. We want to train the responders having clear procedures and playbook and of course reduce the stress around on call. So let's expand on each. Unprepared on callers lead to longer repair times. So you want to have on call training like disaster recovery testing per on call or shadowing or running the wheel of misfortune exercise. Remember that on call can be stressful, so having clear incident management processes can reduce that stress as it's clear any ambiguity and clarify the actions needed. And for that purpose, let's introduce briefly how we can manage an incident incident management at Google Protocol iMog is a flexible framework based on the incident command system ICs used by firefighters and medics. It's a structure with clear roles, tasks and communication channels. It established a standard, consistent way to handle emergencies and organize an effective response. By using such protocols, we reduce those ambiguity, make it clear that it's a team effort and we reduce the time to repair. A few other things that you can do is to prioritize and set time for documentations. Create playbooks and policies that capture procedures and escalation path playbooks don't have to be robust at first. We want to start simple and iterate, which provide a clear starting point. A good rule of thumb that we advise our customers and you might be familiar with is see it, fix it and letting new team joiners to update those as part of their onboarding process. Remember, if the responders are exhausted, that will affect their ability to resolve the issue. We need to make sure that shifts are bonds and if not, use data to understand why and reduce toil. We also want to have as much quality data as possible. We especially want to measure things as closely to the customer experience as possible as it will help us troubleshoot and debug the problem. For that, we need to collect the application and business metric to have dashboard and visualization focused on customer experience and critical user journeys. That means dashboards that are aimed for specific audience with specific goal in mind. A manager view of slos will be very different than a dashboard that need to be used for troubleshooting. Can incident the third part is time between failures, which begins from the end of one outage to the beginning of the next. Other than architectural refactoring and addressing the failure points that come out of the risk analysis and process improvement, what else can we do? We would want to avoid global changes and also adopt advanced deployment strategies considering progressive and canary rollout over the course of hours days or weeks. This will allow you to reduce the risk and to identify the issue before all your users are affected. Those can be integrated into continuous integration and delivery pipeline, having automated testing and gradual rollout and automatic rollbacks CI CD save engineering time and reduce customer impact. It allows you to deploy with confidence. Another not surprising point is having robust architectures, having redundancy, no single point of failures, and implementing graceful degradation methods. We should also adopt dev practices that foster a culture of quality, create an integrated process of code review and robust testing. Remember, it's all about resilience. So in addition to training our responders and running disaster recovery exercises, we also want to practice chaos engineering, finding issues before they fund us by introducing fault injection and automated disaster recovery testing. Lastly, we want to learn from those failures and make tomorrow better than today. For that, our tool is postmortems. Postmortem are a recorded way of an incident and those should capture the actions needed. In Google, we found that establishing a culture of blameless postmortems result in more reliable system and is critical to creating and maintaining a successful SRE organization. For that, it's important to assume good intention and focus on fixing the systems that allow the incident to happen and not the people implementing postmortems. Start with educating the team about blameless postmortem, running postmortem exercises and crafting those policy so that we will learn from incident and will effectively plan work to prevent it from happening again in those future. We touch on many things we can do in order to reduce the impact, both from the technical and human aspects. But how do we know what we want to focus on? We want to be data driven in our decisions and we want to prioritize what is the most important things for us. That data can be a result of the risk analysis process and the measurement I mentioned before, but we also want to rely on data collected from postmortems. Once we have a critical mass of postmortems, we can identify patterns. It's important to let the postmortems be our guide, as our investment in failures can lead us to success and with our customers, we encourage them to create a shared repository and share them broadly across internal teams. We have a lot of public resources that developed by different teams in Google so you can learn more about Internet management and reducing customer impact. We of course have the books. We have a coursera course, the parts of slos that was developed by CRE team, blog parts and talks, and webinars developed by the Devrel team. I've curated for you a few resources to get started and in the final link you can find resources a publicly available breakdown by level and resource type, including the cheat sheet. Finally, there is a wonderful gift you can give any presenter the gift of feedback and as we are virtual and because we are all about data, I will kindly ask you to go to Bita, yell at feedback and provide your take. I was Ayelet Sachto and you are welcome to connect with me on Twitter or LinkedIn. I will be sharing soon a new white paper on incidents management. Thank you for listening and enjoy the rest of Con 42.

See all 24 talks at this event!

Conf42 Chaos Engineering 2022 - Online

March 10 2022

Maintaining Reliable systems: How to minimize Incident's impact?

Video size:

Abstract

Summary

Transcript

Ayelet Sachto

Strategic Cloud Engineer @ Google

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2022 - Online

March 10 2022

Maintaining Reliable systems: How to minimize Incident's impact?

Video size:

Abstract

Summary

Transcript

Ayelet Sachto

Strategic Cloud Engineer @ Google

Join the community!