Conf42 Site Reliability Engineering 2023 - Online

Product Management in SRE

Video size:

Abstract

In this talk I’ll talk about the influence of Product Management in SRE and the future of SRE. We’ve come a long way from Ops to DevOps to SRE, what will the next step in this path?

Summary

  • Jayaganesh Kalyanasundaram is a software engineer at Google. In this talk he'll be speaking about product management in SRE. Will be going over the generic SRE principles. How we benefited as a team in each of these aspects by having a dedicated product manager.
  • Service level objectives are not confined for microservices. We want to be able to have a larger picture in mind. In our CI CD platform, we have a simple action of doing a rollback. We were also able to measure the product quality with this framework.
  • An error budget policy which enforces us to work on the product more. This is yet another aspect where we benefited a lot from the product manager because they were able to help us navigate these discussions with the stakeholders. The feature work or the more shiny work to focus on the reliability of the product are difficult discussions.
  • Next, let's look at the shared responsibility model. SRE ideally needs to be involved from business to development to also operations. You want to ensure that the product is built to scale at the initial time. This requires a big leadership buy in.
  • A lot of these SRE practices can be put into four major themes. Monitoring, capacity planning, incident management and blamelessness. Automating these centrally helps us reduce the cost overall.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Jayaganesh Kalyanasundaram. I am a software engineer at Google, and in this talk I'll be speaking about product management in SRE. This talk is going to be about the SRE principles and how product management fits into these. And I'm speaking about it specifically because personally, our whole team benefited a lot by having a dedicated product manager to take care of all the product aspects in the SRE. And this is also to make sure that people sre aware that Sre are not just to do the operations work, but they sre responsible for the overall reliability of the product. And hence, how can we benefit from realizing the product aspects within SRE. So I'll be going over the generic SRE principles and how we benefited as a team in each of these aspects by having a dedicated product manager. So let's say the first one about service level objectives. Pretty straightforward, right? When service level objectives are met, when these slos are met, you have happy users. When they're not met, you have sad users. But this gets more tricky, especially in this day and age, because we are nowadays moving from monoliths to microservices, and rightly so, because we have a lot of other benefits from having microservices. The difference is that earlier the monolith was measuring the overall service level objective, including the user paradigms. But now, because we have microservices, and because we have platforms and a lot of products which by default have monitorings and stuff for these microservices, service level objectives are not confined for these microservices. We want to be able to have a larger picture in mind. So let's take an example. In our CI CD platform, we have a simple action of doing a rollback, and I'm pretty sure many of you would be using this one on a day to day basis, whenever your system doesn't function as intended, we hit the button rollback. Right now, the engineers wanted to measure the slos, the rollback, as the journey starts, from the user hitting the button rollback all the way to a specific RPC being sent to the right system to initiate the rollback. Makes sense, right? The managers and also tech leads were like, you know what? We wouldn't stop just at the point where we have an RPC. We would also wait for the workflow instantiation, which does the rollback. But the people who sre focused on the product, which is management, and our product managers, they felt that rollback is a journey. It's not something we want to be accountable for. Just as a button, but the whole journey by itself, which is when I click the button rollback, I want everything to function smoothly and the rollback to actually happen. So right here we have total of three different interpretations of how a rollback journey could be described and what is the service level journey. We are looking at the targets and the objectives are next because of which our product manager made a beautiful framework of writing down the overall critical user journeys. What I mean, writing down having an explicit document which states clearly what is the end to end user journey for each of them and why did we benefit from it was that everyone was on the same page when they spoke about any specific journey and what the SLO should be for that. It also helped us a bit more that we were able to kind of realize the overall product quality. What I mean here is, let's say you have 15 journeys offered from the product, but we have slos for governing only five of them. We are just covering one third of the journeys and the other two thirds are not actually governed. So which means that they could be not functioning as well and we wouldn't be alerted for that. And that's not a right product. Right. So we were also able to measure the product quality with this framework. And right just there, we, weve able to benefit a lot from product management in just this one aspect of slos. So next topic of making tomorrow better. And this becomes more and more important in the earlier phase of SRE in any team, because initially SRE is the one which holds a pager. So they're more like the pager monkey for the initial few times until they kind of realize the pattern, realize the major pitfalls, and try to improve on them in the future. But we can only improve on them when we have the error budget for that. What I mean here is, let's say that your sLo states that your service can be down for five minutes in one month, but let's say you are just pretty much ten days past in a month and you're already down for four months and you just have 1 minute more for the next 20 days and you will be forced to hold the pager and do nothing more because any 1 minute more downtime you have hit your maximum error budget. But we want to be having an error budget policy which enforces us to work on the product more. And this is yet another aspect where we benefited a lot from the product manager because they were able to help us navigate these discussions with the stakeholders, where we basically weve able to advocate for, you know, what we will not be doing any more feature development. We'll be able to stop it and we will literally improve the reliability of the product. So these kind of discussions or these kind of initiatives, which are pretty difficult for the product itself because you're going to compromise on the feature velocity and putting more focus on the reliability. Rightly so, but it also enforces that you need to be able to have these communications with the upper management and trying to convince them. And the product manager having that interest in the product is one of the best persons to do this. And in terms of product work, as I mentioned before, a lot of things like automating repetitive work and trying to look into the postmortem action items and trying to say that, you know what, this whole quarter we'll be just focusing on postmortem action items because we haven't been doing so for the last two years. Again, that's a very bad state to be in. But these kind of difficult discussions weve you pause. The feature work or the more shiny work to focus on the overall reliability of the product are difficult discussions which sre to be taken with the hat of the product manager. Next, let's look at the shared responsibility model. So as I mentioned before, the general tendency is that all the work gets dumped on the SRE. All the toils of work gets dumped on the sres. They sre more like the pager monkeys, whereas as I mentioned before, it's supposed to be a shared model where SRE are responsible for the overall system's reliability. So in a classical system development model, from the stage of having a business idea to doing the initial business modeling, to doing the development and to the operations, or to ship the features and ship the whole development, to actually making revenue of it. SRE ideally needs to be involved from business to development to also operations like traditionally, they have just been the operational people. That's why we have the whole aspect of DevOps, which is like shared model between development and operations. But if they were involved in the business to development model as early as possible, it helps them to have a voice and an opinion in the way the product is being developed. Looking at the scale and the overall reliability needs of the product, scalability being an important aspect, you want to be able to ensure that the product is built to scale at the initial time, rather than building it in some ad hoc fashion and then investing a lot more time on rebuilding it for the scale. Again, these are difficult things which require a lot of leadership, buying and hence a product manager, a product owner influence makes a big impact here. So an example from my own team about leadership volume. So as I mentioned before, we were looking at the rollback journey as a whole, as to the user clicking the rollback all the way to the rollback actually finishing successfully being the user journey. And we wanted to measure the Slo for this. Just like any other SLO, we started with a 99.99% target. That four nights is the target for any rollback attempt to finish successfully. To our surprise, initially the success rate is 40%. It's nowhere close to four nine. It's like less than half, right. The reason was more than 50% of these errors was because the user didn't have the right access to have a rollback initiated for the microservice. To give an example, a user or an engineer who works for Google Search shouldn't be able to roll back a software which is working for YouTube for the matter. Right, because you wouldn't want to dismantle or work on some other person's product. Again, in this case, it wasn't a totally different product, but it was different microservices where the user didn't have the access to necessary access to. Again, this is not specifically an issue of the product. The product is functioning perfectly well. It's the issue of the user. But we still wanted to measure the end to end journey, because if there are unhappy users, that doesn't transfer to a good user experience. So we want to better the user experience. So probably what we had in mind was to improve the user experience by letting them know that they don't have the access, which is what we have done finally. We finally have let them know in a big red bar saying that you don't have access to this microservice, so you can't perform any of the emergency actions on this, and because of which they don't attempt to do a rollback anymore on these kind of services. But keeping that aside, having a target of 40% makes you look really bad before the upper management. So we wanted to loosen the SLO from a four nine to literally 45% initially until we made this recent change. And this is a very drastic change on how your product SLO should be. And this requires a big leadership buy in. Like, we had to convince the upper management that this is the case, that we don't want to have a four nine target, we instead want to have a 45% target. Because most of these errors are user cost errors. So convincing the upper management and the stakeholders for the overall product by itself requires a lot of leadership and a lot of stakeholder management, with also keeping the interest of product in mind. And this was yet another place where our team's product manager helped us greatly to navigate these conversations and as I mentioned before, putting reliability and the consistency of the product upfront and making them one of the major aspects of any feature launches. For example, ensuring that your feature launches are covered by slos or covered by integration tests so that you invest as much as possible early on to ensure that the product is reliable and it's working well, are some of the things which we have developed in our team, and this also builds a lot of resiliency within the product. So we have recently developed a lot of feature launch requirements for an internal SRE based feature launch as well to ensure that our feature launches for any of the SRE products or SRE feature based tools is also governed with a lot of practices like integration test slos and having the proper emrs on who is going to be the EAP customers and what is the market for the strategies and so on, just to ensure that the product launch is pretty smooth and we have a very consistent product going forward. The product doesn't get brittle after every frequent launch. Automation is yet another place which can benefit a lot with SRE teams specifically, I want to dwell a lot on this slide because recently a lot of these SRE practices can be put into four major themes. CI CD is the major bread and butter of every operations DevOps by itself, because that's how the whole DevOps even started. They wanted to ensure that the CI CD aspects SRE done by an operations team when you're close to CI CD is the aspects of monitoring. You want to ensure that your product, your services are monitored well. The next aspect is capacity planning, which is when you have any inorganic launches. Let's say you want to view your cricket matches and because of which there's a sudden spike. I'm sorry, you want to view the cricket matches and because of which there's a sudden spike in the traffic for, let's say, YouTube and Google Ads. You want to be able to do the capacity planning for these. And the fourth, obviously is the incident management, which is when everything works well. Everyone is happy, but when there is an incident of a very big impact, everyone rushes to see what is happening, how can they solve it, and what is the right status and so on. So we want to be able to have a medium to communicate to all the required stakeholders and people who are interested as to what is the state of the incident and what is the way they have been taking to fix it, and is it mitigated or not? And what is the impact it has right now. So these four aspects of CICD, monitoring and capacity planning and incident management are the four major themes across all SRE teams. So automating these centrally helps us reduce the cost overall. And this is yet another place where we have been very lucky to be able to invest onto horizontal products. For example, our team focuses on the CI CD products. This helps us reduce the cost of maintaining CI CD platforms and CI CD tools specifically for every single team, because now they can focus on their vertical team based strategies. And this concludes a section of having the ability to regulate the workload. So we want to be able to prioritize the work, we want to be able to push back when there are unreliable practices. As I mentioned before, we want to be able to sometimes say that we want to be able to focus on reliability as such and not focus on feature development anymore. And obviously the fourth aspect of blamelessness. Blamelessness is something which is homegrown within SRE for most parts, right? Whenever something goes wrong, we want to be able to capture it well and we want to be able to make sure we don't repeat the same thing again, having postmortems for every incident and writing our action items and ensuring the postmortem action items are fixed right on time and they solve the problem at the root cause. And we don't hit any other issue after that are some of the basic principles of SF, because we have paid the price. Making it blameless ensures that people put in all the thoughts as to what all went wrong and what all have gone wrong and how the system should be resilient to prevent those issues from happening in the future. Because human errors are really system problems at the end, right? So failure is an opportunity to improve and not to blame blunt and switchbox. So instead of pointing fingers, we want to be able to ensure the reliability of the service and find out what went wrong and trying to ensure that gets better over time. So we want to be able to improve the mean time, to detect and mean time to repair of any failure. Because if come things similar to that can happen in the future, we want to be able to find that out as soon as possible, and in fact hopefully not even find that out, because we have ensured that we have fixed them on the postmodern action items. But if at all we catch them, we find that out and we fix them as soon as possible. So to recap, these are the four principles of SRE and how we benefited with the product management in each of them. So, to conclude my talk, I hope I've given you a sense of how we benefited in our team with a dedicated product manager in each of these four principles, and how these four principles can be seen through the lens of product. And what I've seen in a lot of teams within Google as well, is that the senior members of the team, also the people who are into the leadership, technical leadership, and management, generally, wear the hat of product management for SRE, and any training through them in the aspects of product management has been really beneficial because they understand the users and they understand what the product can do for the users. I hope everyone has got the sense of the product management in SRE and hope you liked it.
...

Jayaganesh Kalyanasundaram

Technical Lead @ Google

Jayaganesh Kalyanasundaram's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways