Conf42 Site Reliability Engineering (SRE) 2024 - Online

Manage service reliability by managing risk

Abstract

Setting up Service Level Objectives (SLOs) is one of the foundational tasks of SRE practices, However, there is a pertinent question, whether these SLOs are realistic and if they could be met. So, resiliency can be best managed by understanding your risk & mitigating it.

Summary

  • Risks are very, very vital for any SRE. Services obviously can be made more reliable if you know the risks. Risk analysis helps you to provide, prioritize and communicate.
  • Most of our customers builders use a risk catalog. Risk catalog basically is a structured way where you prioritize your risks. It should include this catalog across infrastructure and software. One prominent way is to map the user journey. Automation helps reduce time to detect and repair.
  • Risk cataloging is a very vital part of your SRE work. Chaos engineering could be a vital role. It can help identify blind spots. Use chaos and many such principles to refine and build risk catalog.
  • With this, I come to an end. Hope you enjoyed my talk and any questions or any queries, I'm always available. If there is anything which is needed or any contact needed, please feel free to reach out. Thank you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to conference 42. So here's my session today for managing service reliability by managing risks. Risks are very, very vital for any SRE. So just broadly cover upon this, going to the next slide here. So first of all, let's quickly understand, are your SLR realistic. So if you see here there's an application which has multiple services and then there are again sub services within that, but are really sure the SLO, the slis defined can be met, what are the risks associated with this? What would happen if one of the services get impacted? Or what would happen if the underlying cloud infrastructure or the underlying infrastructure gets unavailable for a moment? Do we have an auto healing mechanism in place? There are so many risks which anyone is unaware, and thereby there need to be a very stringent mechanism of putting up risks within any scenario or any service. So that's what we define as risk analysis. It helps you to provide, prioritize and communicate. So services obviously can be made more reliable if you know the risks. For example, the example, the previous diagram, we looked into it like there are so many different applications, services, there could be a challenge, or there could be an impact on dependency, there could be an impact on capacity, there could be an impact within, on the operations or on the release cycles. So these are underlying risks and what they enable us is to understand what could it take to recover. So meantime, to detect becomes one of the vital criteria and it is part of the risk. So when you create a risk analysis, you look at your different, different reliability matrices like MTTD, which is your main time to detect whether it is mean time to repair, how many percentage of users get impacted, and then what's the probability of occurrence. So that's the value of the risk analysis. So how do you define the risk? So the way we have thought, or I've seen most of our customers builders use a risk catalog. So what do you mean by a risk catalog? So risk catalog basically is a structured way where you prioritize your risks or you capture all your risks and then you can definitely look at prioritizing them, identifying different counters, identifying and brain storming within the team, what would happen, looking at the past data. So see what are the various matrices for those risks. So it is very essential, a very essential part of your reliability management to first define your risks. Create a catalog which could be brainstormed with multiple team members, multiple teams, developers, as well as system engineers, cloud architects. It should include this catalog across infrastructure and software. It should also define your key Slis, your MTTD, MTTR and one prominent way we have seen or I've seen customers or our teams do it is to map the user journey. So looking at from the areas or the points of interface a user does and tracing it back straight to the different, different paths the user takes. So imagining if it's a retail commerce. So users might be interacting on a particular, let's say catalog service which where you could see all different products and different catalogs. And then the user journey might be to select a product, add it to a cartridge, and then finally could be multiple products and then could be finally checking out and possibly shipping. So you look at the user journey, looking at what are the different services where the user gets impacted or user touches upon. And obviously one of the very key point is to look at the past incidents. Past incidents give you a lot of data to define and include it in your risk catalog. So I'll quickly show you one of the sample risk catalog. This is from Google SRE. So here if you could see what does the risk catalog looks like. For example, starts with a configuration mishap which reduces capacity. And then we have MTD typically meantime to detect. So you could look at systems getting detected this within 30 minutes, what could be the possible time to repair. Again, this comes from discussions, brainstorming, connecting production incidents, collecting data. So you refine this data or refine this matrix as you get more and more information. And then you look at what is the likely impact of this. Again, this 20% number comes from experience, looking at teams and then the likely impact, this is your own release cycles. You could look at and depict this number. And then you finally look at what is your incidence per year and how much is your bad minutes per year. Similarly, if you look at new release, again when you are ready with the release and roll back if it is required, so what's the likelihood. And then we have various parameters or reliability matrices and similarly so, and so forth. You could see others like unique breakdown or unnoticed growth. And possibly there is a outage within the cloud as well. This is a great dependency and there could be another scenarios. I mean these are all likely possible, like operators are slow or, and so basically what it gives you is a fair amount of elements. And this list can be endless. Like you could add many risks and good thing is that you can refine, you can prioritize, you can maybe reduce or include and continuously keep on enhancing or revising this basis, your own observations where your performance, this is your own experience and this is the data points. So a very high level view of typical risk catalog and what it helps, ultimately, once you create your risk catalog, is to rate your risks. Now, as I said, we saw the earlier one, like bad minutes per year. But the way to rate, again, this is so looking at this user journey, again, the user looks to be happy when it is in blue, but then it all goes into a challenge when it is in orange. So there are various parameters, the time to detect, which is your MTT, which is very, very critical, then time to repair again is very critical, and then the time between failures. So as I said, you first define your risks, prioritize those risks based on this data, and then you collaborate between the teams. Start with the estimates, you collect more data and you enter it. So very, very essential, like time to detect. So we've seen today particularly observability, then some more time, like log ingestion, log management, all those stuff really helps in reducing your time to detect. And then we have automation, particularly looking into the time to repair. So again, the matter of fact is, as you rate your risk, you can also then start planning and looking at how do I mitigate this risk? Is this a risk okay, for me to carry? Is this a risk which definitely needs to address, which there needs to be some mechanism to address. So these are very vital elements. Automation really helps. And again, time between failures. How do you make your systems and your releases more stable, so you can look at maybe embedding some different, different options, so that you are able to get this more in a more. I mean, you are reducing the time between the failures. So far I've just broadly covered the risk catalogic risk analysis and why risk is very critical. And then how do you rate your risk now I could quickly cover upon accepting risk, and this is a very vital part of your SRE work, which risk should be prioritized, where we should focus and put an engineering effort, or whether we should bring in a larger team or maybe product management team to just check. This is an extra engineering effort which is required. Otherwise there could be an impact on error budget. So here, if you could see there are certain parameters, for example, operator accidentally deletes. This is one of the risk and this creates 129. Is it acceptable? Possibly yes, because it falls within my error budget. I could use this. So the ones in blue are possibly the ones which possibly I could accepted. So just marked it? Yes. And then the ones in the ones in the pink or the, or the peach color shows address which cannot be accepted. There could be a possibility the first three will not get even accepted. Basically it definitely might impact the other budget therefore would have a major impact. So it is going to require, if required, I mean, I need to spend some time to look at or put some engineering effort in order to address or mitigate this kind of risk. The ones within yellow should not be accepted. There could be a major issue, or this could be a major consumer of mirror budget. So there, there has to be a mechanism. So this amber color defines that. And it is something which may need an address, either maybe as a second priority or a third party, but it requires addressal. And then the ones which are in green possibly could be ones which I could accept. And even though these occur, they may not significantly consume my error budget. So the risk mitigation, the risk cataloging is a very vital element. And we've seen, I've worked with many customers in helping them build such risk catalogs, then putting a risk mechanism or risk analysis to then prioritize this risk. So one very key element which I also observed is the role of chaos. So here I've seen chaos engineering could be a very vital role, because when you look at estimating the risks for different, different scenarios or different, different areas, where this could be useful is by chaos, you can attempt or identify certain blind spots. You can attempt some failures, inject some failures, inject some faults, and then really understand this risk. Is it really, there are your observability, or what is your MTTD? What is your MTBF? And what is your MTTR? Whether your catalogs go and one very vital component can also build different risks or certain unnoticed risk or blind spots, which you never realized this could even happen. So definitely chaos engineering comes, and it is a very vital portion of it. So what can help is identifying and putting certain fault areas and then using this as part of your risk analysis. So a lot of work we could build, or as an SRE can use the chaos and many such principles to be able to refine and build risk catalog, analyze the risk, and as well as prioritize the risks and then define your engineering efforts. With this, I come to an end. It's a fascinating discussion. Hope you enjoyed my talk and any questions or any queries, I'm always available. If there is anything which is needed or any contact needed, please feel free to reach out. These are my coordinates. Thank you. Thank you very much.
...

Jaiprakash Pherwani

Offerings Lead @ Tata Consultancy Services

Jaiprakash Pherwani's LinkedIn account Jaiprakash Pherwani's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways