Conf42 Site Reliability Engineering 2023 - Online

Bridging the Gap: Leveraging Incidents to Align Platform and Product Engineering

Video size:

Abstract

Incidents can be a real eye-opener when it comes to the flaws in your infrastructure and product. In this talk, you’ll learn how to turn incidents into opportunities to boost collaboration between Platform Eng and Product Mgmt by implementing Reliability via product-driven monitoring.

Summary

  • Hi, this is bridging the gap, leveraging incidents to align platform and product engineering. Explain how severe incidents happen. How to make the most out of an incident. Even more tips and conclusions.
  • This talk is about how to survive at 02:00 a. m. Do you revert the code and pray that it fixed the issue? Or do you investigate the issue further? There is definitely something else that you have done. And the goal of this talk is to share with you all what that something else is.
  • Bacteria cause cheese, and bugs create incidents. Even with three very good layers, issues can get through. Incidents are great as a wake up call to invest in modernizing your code base. But if you choose the wrong metric, you'll end up fixing the wrong thing.
  • Tests are not enough to catch all the issues that you're looking at. DNS, BPS issues, these things are extremely common. It's really hard to catch these issues if you rely on testing alone.
  • implementing business value proxy metrics, bpms for short. Bpms are objective measures on how well your system is performing business wise. This allows us to build a lot of alignment between different parts of our business.
  • Make sure that you're using some sort of incident management software to capture your learnings. Just utilize the learnings to create BPM backed alerts. The most successful way to introduce bpms is to have alignment with the rest of your company.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, this is bridging the gap, leveraging incidents to align platform and product engineering, and the stock will cover the following how to survive at 02:00 a.m. On call apocalypse. Explain how severe incidents happen. How to make the most out of an incident. The gaps in your testing talk about a metric you should be tracking and you're probably not. How product management or product engineer think in very, very similar ways. Even more tips and conclusions. Let's start. So this is going to be a choose your own adventure. And let's start here with a story about how to survive at 02:00 a.m.. Apocalypse. So it's 02:00 a.m. And your pager's app starts buzzing. Do you revert the code? You won't even check what contains or how it broke your app. You're just going to go and revert and pray that it fixed the issue? Or do you go and investigate the issue further? Let's assume that you chose a you're going to revert the code and let's hope that it fixes your issues. Nah, it didn't. Of course, this was clearly not the culprit, so youll should go back. And now let's try and look deeper into the issue. So you look deeper into the issue and success. You define an issue and you quickly deploy a fix. It was probably just a typo or something really simple that you can caught up quickly. So yeah, you're happy. Now do you just go back to bed? I mean, your shift is almost over. In 30 minutes it'll be someone else's problem, so no need to worry. Or do you try something else? Let's assume that you were lazy and you just went back to bed. Who cares? Someone else will deal with it if it races again. Unfortunately, this was terrible. Your site has gone down again. The engineer just went ahead and reverted your fix, which in turn caused an ether bigger outage. And teachers receive a call from an angry customer demanding explanation. The end. And this is a terrible ending. This is definitely a failure. For your company and your team and even yourself. There is definitely something else that you have done. So how about we try something different? And the goal of this talk is to share with you all what that something else is. But before jumping into that, let's have another story. It's time for meltdown. Swiss cheese and the anatomy of severe incidents. So, severe incidents and cheese have a lot of things in common. We're the product of letting something simple go unchecked. Bacteria cause cheese, and bugs create incidents. You can also not predict well the holes will be, they'll appear at random and because of that randomness, holes will align in very unexpected ways, which in turn will make even bigger holes. This is extremely common. And this is so common that there is even a model named behind this. It's called the James recent sweet cheese model. And as you can see, even with three very good layers, issues can get through. If the holes just align right away. It's almost as if there was nothing in between them. And this is a very common phenomenon. So it's something that you need to keep always can eye on. So what can you do about these comes you can acknowledge that they will happen. I mean distributed systems are hard so happen. Youll can try to increase monitoring and alerting like yeah, data helps spot these issues. You can try to invest in tools to detect anomalies. Like this is very common. Like most platforms now have anomaly detection built in. Or you can ask developers to instrument their apps again, make it a developer's problem, don't have to worry about it, and they will learn a lot if they do that well, unfortunately this is not enough. Like yes, bugs happen, but we should strive to reduce them as much as we can. Increasing, monitoring and alerting, it's great, but again, if you choose the wrong metric, you'll end up fixing the wrong thing. Like we saw on that first example. Like if you assume that something small is the issue, it's not necessarily bad. Investing in tools to detect anomalies, this is great, but unfortunately in our business, anomalies happen all the time for all reasons. The network is not reliable. Solar happen or data centers can have a power fluctuation. All those things will create anomalies on your code. And if you're not careful, you can end up addressing it in the wrong way. And finally, asking developers to instrument their apps without any guidance, it's going to eng up being extremely expensive. And also you're contributing to that problem that people might be looking in the wrong place. Sometimes they're going to blame issues on the wrong metric, and that's definitely not what we want. So you should never let a good crisis go to waste. Incidents are great as a wake up call to invest in modernizing your code base, improving your infrastructure, adapting better tools. And that's because incidents expose our weak spots, incidents that show where these issues are scheming from and identify the areas that need further investment. It's important though to beware of survivorship bias. Adding armor to the least hit areas of an aircraft is more effective than reinforcing non mission critical areas. So when you're thinking about your issues, try to also look into the things that are not failing, because some of them can be the critical thing. Those can be the load bearing parts of your system. So it's important that you look at this holistically, right? Basically. And yes, a severe incident will energize your step to spring into action. But unfortunately, we cannot just solve the problems with the same mindset that the mindset that created decisions. On the third place, we need to do something else. We need to look at this with different lens. So now I'm going to reveal the secret of the font pyramid. Tests are not enough. Imagine this scenario. You have 800 tests for the user model, 80 tests for the endpoint, 40 tests for the react component, and even eight end to end tests with playwright and cypress. Things can still go bad. And I've seen situations like this happen. Someone deletes an important SV bucket, someone accidentally uploads bottom color that matches the background. There's a route misconfiguration, like these simple issues happen. DNS, BPS issues, these things are extremely common. And it's really hard to catch these issues if you rely on testing alone, because as we saw previously, holes align very easily. So with this level of testings, it's not enough to catch all the issues that you're looking at. So now, without further ado, I want to reveal what that something else is. That thing that can really help youll feel more confident in your call. That's something else. That's called implementing business value proxy metrics, bpms for short. I haven't found a better name for this metric, so I just came up with this, and it sounds fancy, but it's not complicated at all. Let me go for an example that will clarify everything. So let's assume on the first example, you want to make sure that your login is working correctly. This is a metric that you can generate, and it's extremely simple to generate. You just say, like, okay, given all the login people that go to the login page, how many get to the dashboard? Meaning that that dashboard had to go to this sign up process, login process. And to generate a number, you generate a heuristic and saying like, hey, in normal situations, let's say 70% of people are able to do that. That's your metric, that's your PPM. That's something that you can track. And whenever that number comes way up or way down, there is something happening on your system, something that you need to look into on the positive side, yeah, if people are logging in more and this turn out to be like 80 or 90, like okay, maybe your effort of simplifying the login process is paying off or your auto login is working versus the opposite. Like okay, maybe this metric is tipping. Like did someone delete a button, did someone did something wrong? Or even worse, how about the login is not working at all and people are just laying on the dashboard page because it's unauthenticated. All these three scenarios are really easy to detect with a BPM. So yeah, alerts on BPM are great. They allow youll to move fast with confidence. This is something that you can just page on and this will be a better proxy than just looking at these individual metrics. But there's a catch. Like as the VPN says, you need to understand what is business value? Let's go to a definition. I really like this definition by Coda Hale. Business value is anything which makes people more likely to give us money. A temple like you build a product, people are using it, people are paying for it. That is business value. And it's really important to remember that the moment that your code generates business value is when it runs, not when we're writing it, not when we're testing it, not when people in QA are checking on it. Until it reaches our customers and our customers are successful with our product, that's when the business's value gets generated. This is very clear at this point. I hope it's clear at this point that platform engineering should care about the business as much as product management does. We need to be BFS, business focused friends. Hurrah. Okay, I know I'm getting again with these acronyms, but these are simple to understand. And this is way to understand RA is to just map them to bpms. So retention. Retention is simply the number of logins per hour. Easy activation, the number of sign ups per hour, which is great. Like how many people that landed on your page actually sign up for your product revenue, how many subscriptions you've had per day, and then a referral. Like how many marketing page views happen. Or if you have landing page, like how many people ended up on those landing pages. Bpms are just objective measures on how well your system is performing business wise. As youll can see, it's a one to one relationship. But this is great. This definitely allows us to build a lot of alignment between different parts of our business. And not only that, product development is shifting. As you can see at the beginning here on the left, product development was focused a lot of an experience. Like, I'm a user and I want to be able to do this and hopefully we'll get some benefit out of it. That is good. Definitely we're always going to be building futures, but thinking about that, it's limiting because sometimes you need to have more than one future, more than one experience to be able to achieve an outcome. So people are slowly shifting into this capability model, which we say like, hey, we want users to be able to do this on our product and we want this ability, this capability to give us this business outcome. And this is great. This is a much better level of sophistication because now we can say like, hey, this business outcome is actually measurable. Well, we know that this is working because we have the signal. As before, we were just guessing if the future was working or not. We can actually now see what's going on and try to get some objectives on that. And nowadays product development is going a step farther, which is going to measure things. We want to say like, hey, this is the signal that we need to measure with this data. And then we're going to experiment on it. We're going to do a b test, we're going to do all this process to make sure that what we built is actually being used in the way that we intended and that the outcomes that we're expecting happen. So thinking this way is a great way to achieve the results that we're looking for. And it's very similar to the way we're thinking on the platform side. Business value proxy metrics is not just a signal that a feature is creating the algorithms that we want. It is, but it's also doing more than that. Good business. Valid proxy metrics are showing what is critical. What is mission critical to a business. What are the things that if they're not working, you should be worried about? This is a screenshot of dashboard that we call mission control. This very simple board allows us to see when things are debating from our expectations. So look at that large orange spike. It's clear that there's something going on there. With login, we can be like, hey, did someone change something recently? Did we activate a feature flag recently? We can correlate this data with these changes, and just keeping an eye on these numbers will always allow us to go faster with confidence. We know that as long as this is leveraging within our standard range, we don't have to worry about what's bridging or not. But whenever something breaks, this is a great way to make actionable items. So now let's go back to our story. Let's assume that you're back at a 02:00 a.m.. Apocalypse. And here's just something else. Instead of going back to sleep or jumping right to it, just start checking mission control. If everything looks okay, you can meet the alert and wait until business hours to deal with it. Can you go back to sleep? You know that things on the system are looking okay, but if for whatever reason, one of your vpms is looking off, you can quickly look and try to correlate with alert signals. And once you do get out the fix, as long as there's a BPM, you can rest and go back to sleep because yeah, the bpms will detect if anything else is wrong. Also, I really encourage you during incident to think about like, hey, is there a VPN for this or not? Because if not, this is a great opportunity to do this one. And here are a couple more tips on how to use incidents. Call up to action. Make sure that you're using some sort of incident management software to capture your learnings. It's a great way to spread the knowledge. And also during a crisis, it's a great way to collaborate. People are able to jump in and out of an incident and just gain by the context by looking at those records. Just utilize the learnings to create BPM backed alerts. So yeah, every time that you have a serious incident that feels bad during your retrospective, look into like, hey, what are the bpms that I can use to make sure that this won't happen again? Bpms are very similar to key performance indicators. So if youll company has some sort of process like okrs B, two moms, or any other framework, bpms can guide you and create those conversations. Like what are the metrics that we want to change and prove? Bpms are almost one to one with KPIs, don't forget to share youll perspectives with broad and other departments. Again, the most successful way to introduce bpms is to have alignment with the rest of your company and talk about what's important for everyone. And remember, the same tools that we use to fix things on the infrastructure side, SRE side, platform side, whatever you want to call it, they can be used for business metrics. We can use UDA Loop to address business metric issues. So yeah, it's great to introduce this and share them with the rest of the company. And finally, when doing can introspective, don't forget that finding the root cause is not enough. You should try to also implement preventive measures, BPMS tests, dashboards. They're just as important as bug fixes. So please introduce them, and your whole company will be happier because of that. All right, this is the talk. Thank you so much. If you have any questions, you can reach out to me on master or I'm at Algons everywhere else, Twitter, whatever people are using these days, you can probably find me in there. Thank you so much.
...

Gonzalo Maldonado

Staff Engineer @ FireHydrant

Gonzalo Maldonado's LinkedIn account Gonzalo Maldonado's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways