Reliability Excellence with SLO Management Products

Video size:

Abstract

Reliability Excellence with SLO Management Products which focuses on ensuring system reliability by setting, measuring, and optimizing Service Level Objectives (SLOs). These products help teams proactively monitor performance, reduce downtime, and align reliability with business goals.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello all. Good morning, good afternoon, good evening. I'm Shar and first of all, thank you for joining this session. So today I will talk about how we can achieve the reliability excellence with SLO Management product. Let me introduce myself. I have a 12 plus years of IT experience. Currently I'm working with Naka software as a DevOps and a platform lead, and I love to talk about DevSecOps Cloud Engineering, SRE Platform Engineering. And I'm happy to connect with you guys via the LinkedIn and the mail that I have already put it here. So let's start the session. So the agenda of the today's session is about the quick understanding of the SREs the principles and practices. Then we see what are the challenges that we face for the SLOs in the complex environment and what should be our approach to solve it. Then we look for the solution by understanding the SLO management. Then we have and one use case. Here we see about to how we can define the SLOs, and as well as we check how we can basically measure the reliability of the system. Then we'll look for the SLO management products whatever available in the markets. And then we see some comparisons in between those products. So let's start with the session. So in, in SDLC that we know that the developers want to go as fast. Like they want to do a development feature as fast as possible. We are in parallel like operation teams that they look down to slow down the things. They just, they want to make a system reliable. That's why this kind of a misalignment happens in between these two teams and it often create tensions. So that's why the SRE comes into the picture. SRE. Is basically what introduced by the Google in the early 2000 and it is useful to basically measure the, and to solve the operational challenges and issues. SRE, it's a job role. It's a mindset, it's a set of metrics that we used to measure and to manage the overall system reliability. So there are some key principles of SRE. First is service level indicator. Service level indicator are the metrics that used to measure the quality of the service. So there are the. Couple of different metrics. Mostly that we use is availability, throughput, and the latency. Then we have an service level. Objective service level. Objectives are the the target level of that particular. Metrics. Basically that the 99, 9 9% of our system should be available. So this is a target level that we set this. So this is, SLO is for the internal benchmarking. So SL is a goal rather than a promise. But it is also help for the customers. That we define the specific metrics and we have some kind of in shorty that our system would be available up to this kind of a target so that we promise, to the customers. Then we have an service level agreement. So it's a contract between the end users and the service provider. So if the the promises are the breaks, so the service provider is need to pay or they have some consequences are related, to the different payments and they need to expansions of the different subscriptions, then we have an error budget. But before moving to the error budget, we need to understand that there is a misconception in the overall SRE one. So the misconception is that we always look for the a hundred percent of the system reliability. But not, this is not true. If you look for the a hundred percent of system reliability, it means that our system is available a hundred percent and every time it should be up. But that, that can't be a case, of course in the practically your system. Sometimes it would be down and some kind of a disruption can be happen. And it also affect the overall product lifecycle because if we. Go for the a hundred percent of availability. Then of course developers need to focus on the reliability of the systems. They can't focus on the feature development. They can't focus on the innovation side. So that's why like the development teams operation teams and the product teams they look to and to have the common. Target level that need to be set for the overall the system. So that's why error budget comes into the picture. Error budget is the it is just unavailability of that of the particular system that the teams can be tolerated. Basically, so how much time that our system could be unavailable and which is fine for the overall, the life cycle. That's why. So when you have an advocate have educate error budget, so it means you can focus on the product development part, but your error budget is threatened. It means that you need to focus. On the SRE part of the reliability of that system. Then the caution is how we can basically practice this principle, implement this principle. So we have a different kind of practices. With the metrics monitoring alerts, so of course if we look for the system reliability, so we have a proper defining of the monitoring systems, we need to be, have correct metrics for the services. We need to be set the alerts for the error budget, so if the error budget is threatened, so we can have that alerts and we can focus on the system reliability part. Then the demand forecasting and capacity planning. So we we, this is one. A key practice where we have enough resources available for the system reliability part if something happens so that the resources can focus on the improvement sites can also focus on the reliability part. This is, one of the key practice. Then we have a change management and the emergency response where the each teams, like the development teams, operation teams can should know about the overall process defined for the change management. Like how if the change, if the, even though change is small basically, so they should focus on like how we can basically. Overall do the changes and it'll improve the overall reliability of the applications. Then the emergency response, that of course, in the practically if you, you have a system sometimes that the system would be down. So you need to be like ready with what kind of emergency comes. So you. You need to be focused and you need to be to work on that. How early just you can like back your system up and running. And then of course the culture and to management is one of the one of also the key practice where it is operational workloads that how you manage. So in SREs of course there are. Each and everything. You can't automate. So there are such kind of a manual practices that you need to be followed and accordingly, you need to be manage that overall of your operational workloads part. These are the. Some kind of a basic understanding of an SRE one. Now, let's move to the what are the challenges that you faced of as o for the complex system? So here, the complex system means it, it could be your microservices base, it could be your distributed system. So let understanding the problem, what is a problem. So sometimes what happens when we, like to define an SLOs for the complex environment. We SLOs are not aligned with the business goal. It means that the priorities are misalignment. Like we, we focus the SLOs on the focus on the different metrics, but our business goals needed a different kind of metrics. So this is the. One thing and the target level is also changed. This is the one challenge we face with the system as well as we mostly don't look focus on the data driven approach because if we define the SLOs we also need to be, ensure that what are the previous, checks for the data and accordingly we can measure it and define the SLOs for the systems. And sometimes we, our, the SLOs have we just set unrealistic SLOs. And the other KPIs are the completely different. And so we need to be, understand the system first and accordingly we can set the realistic approach for that systems. Then of course the distributed systems and interdependencies. If we have a distributed systems, multiple microservices and APIs and one microservices dependent on the other service, and they have some other interdependencies. It is it is really difficult to find out the pinpoint of where the SLOs is violated. So this is the also one of the challenge we faced in this kind of environments. And we don't look on the priority components. It means that according to the system, we need to be define it, what exactly user is doing. So we need to be look for the, their user journey. And accordingly we need to be center priorities of the system. And then we can define an SLOs as well. And, of course the tool limitations is there, whatever the tools we have. So we have the limited capabilities with respect to that tool. So that is also one challenge. And at the end, of course, the dynamic and evolving systems. We work in the agile environment our system is also upscales. We are also upgrades and we work on the continuous deployment. So we always have the new features comes in. So accordingly, we need to be again, set the SLOs as per the changes. So these are some kind of a c cha challenges. So what should be our approach to solve this kind of issue? So first is we need to be understand the system correctly. Like what, what is particular system, what are the different services in the system? What is the. User is doing like in your day to day activity. So the critical user journeys also need to be understand of that. Then we can go with the designing part how we can define design and SLOs of that part of critical services. And then we can set metrics. For that. And then at the end, of course, how we can use it effectively. That, that is the one approach that accordingly we can move it and we can try to solve this complex issues for that. But. The question is what would be our actual solution for this issue? The solution is the service level objective management. SLO management it's cover, it's a lifecycle that we need to be defined within our organization. So it involves basically to define the OS by aligning the. User journey with the business objectives, whatever the business goals we have. So accordingly, we need to be, define the user journey first. Then set a realistic target for the KPI like the uptime, latencies and error rate. So that is the one of the key practice that we need to be do. Then we have a a different like SLOs, like the composite SLOs. So composite SLOs is basically a combination of individual SLOs that we define it. So if we look to measure the liability of a complete system, so we need a holistic view of that particular system. Composite Aslu comes into the picture in that case where. We have individual SLOs for the different services. And basically aggregation of that services that SLOs is is made to create a composite SLOs. And at the end the correct measuring and the monitoring of system is needed where we are defined the correct alerts for the budgets. And the monitoring of the complete system. So what benefits we can get of course the overall system liabilities and performance would be improved. We have a less downtime. We have a less incidents. We have a less priority issues on that. If we focus on the data driven that. That we have already discussed before. So the data driven decision making is one of the key factors to prioritize improvement because if you look for the previous data to understand the system better on that one and then. Cost to the proactive monitoring of the systems. So it'll also reduce the meantime recovery of the system as well. And communications and collaborations within a team as well. It'll also help to share the understanding for the overall reliability process. Then we have a customer satisfaction of course. Why overall by defining this practices and process, and we focus on the innovation part on that. Now, let's see understand the use case of that. This is a simple like the product lifecycle of a user journey. Yeah, the user is to go into the shopping app and they just put some product into the cart. And do the payment. And after that overall the backend process comes, whereas the warehouse that received it and dispatched orders then invoicing happens and the user get the confirmation via the mail. So overall, this journey, let's break into the two parts. So first is the pre-purchase, and second is the post-purchase. So in the so in the pre-purchase the. The first, we understand that the user part is that that the put the products overall, the per repurchase system should be available. And user has just purchased the product. So the availability of the overall, the pre-purchase system is needed, and of course the payment of the system. And then the secondary part is a post-purchase. So the availability is the more important than the latency. This is the one thing. Of the system and the services are in the pre-purchase order are the more important than the post-purchase start. So in the po, in the pre-purchase website, availability is a slightly more than the payment availability, of course. Because the, if the website is not available, so user can't go and purchase the part, and then the payment part availability is needed. So accordingly from this user journey, so this, we need to be defined it which are the critical components, which are the critical services. And which are the services. And accordingly, we need to be to find the matters metrics for that services. And then at the end, pulling the post purchase, of course, the warehouse and the invoicing. Slightly more important than the emailing. If you will get the email after sometimes, then of course that, that would be fine. But your order should be dispatched and you will have the invoicing is also we should be created then after sometimes the, if you get an email. That is fine. So the thing is like that, how we can measure the reliability of this complete user journey. First is identify the services. And their dependent components. So from according to this user journey. So we have a store website, we have a payment, invoices, emailing and warehousing. So these are a services that we have identified it. Then we will set up the individual SLOs for each service. With respect to the different metrics like the availabilities and latency. And based on the understanding, we will define the composite SLO part, like the pre-purchase user experience and the post-purchase user experience. And at the end we will look for the composite slo. As I said, composite SLOs are the aggregations would be about to. So we have, the pre-purchase user journeys and post-purchase user journeys and aggregation would help to make the to understand the overall system reliability. This is the simple use case of of the overall reliability process. Then we have the SLO management products who can help you to define it. The SLO management lifecycle. It also help you to give you the good features. It would also give you the features like the composite os that also give you the user journey. In this overall product. The HA and Nobel nine are the, I will say, the leaders in that because they have the such kind of different, features were like the composite solo part, like the user journey part, even in the harness. It is a complete 360 degree platform. It also give you the specific feature for the service reliability management in which we can analyze it. We can easily gather to know how, why the, our system is getting failed and it also help you to to set up the governance with the open policy agents where the noble line is also help you to keep the three years of old data so you can have the previous data to understand your system and accordingly you can make a decision. It has a good composite feature part and it also have to define the user critical journey. Then we have some other tools like the fire head end Scott Car, ServiceNow, new Relic, which is just, fit it under the monitoring part incident management part, but it is not a specific SLO management product. These are some products well in the market. And now there are, like, I have done some the comparison on the basis of the different capabilities like the real time error budgeting tricks and map SLOs to business capabilities. And RCA. And of course, the customized dashboards and reports help you. And of course the automation part auditing parts and the overall, the customized visas with respect to the business. You can see they have some tools, have some capabilities. Some are good in the specific. Features, some are like fit under the complete category. So the harness is mostly the noble line is there, so accordingly you can choose it, whatever the requirement. And you can just use the overall features of that products. This is for this session. So thank you very much for joining this session. Thank you everyone.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Reliability Excellence with SLO Management Products

Video size:

Abstract

Summary

Transcript

Slides

Siddharth Joshi

Senior Staff Engineer/Senior DevOps Tech Lead @ Nagarro

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Reliability Excellence with SLO Management Products

Video size:

Abstract

Summary

Transcript

Slides

Siddharth Joshi

Senior Staff Engineer/Senior DevOps Tech Lead @ Nagarro

Join the community!