Conf42 Site Reliability Engineering 2023 - Online

Bell the “Chaotic Cat” with SRE

Video size:

Abstract

In chaos engineering, we introduce intentional chaos to find blind spots where the products may fail in a production environment. We then use the obtained knowledge about those blind spots to make the products more resilient should actual chaos hit production.

But what about the overall product development lifecycle (PDLC) where unintentional chaos takes a DevOps team’s bandwidth away from making the product what it could be? Such unintentional chaos seeps into teams as silently as a cat walks into the house and creates chaos (yes, you may recall those ‘Tom & Jerry’ episodes).

That’s what I refer to as ‘chaotic cat’.

Examples of unintentional chaos: obsession about speed (time to market); conflicting priorities within a DevOps team; manual repetitive tasks (toil); lack of monitoring & observability leading to high mean-time-to-resolve (MTTR); vendor products becoming a bottleneck in end-to-end observability; company acquisitions causing dependent application hierarchy; reactive fire fighting, etc.

But how does it all relate to Site Reliability Engineering (SRE)? Isn’t SRE just about SLIs, SLOs, and Error Budgets? Well, think again; or join this power packed session where I explain how SRE helps you bell such chaotic cats.

You may argue that such chaos in PDLC is ‘business as usual’ and we really can’t eliminate that, and you’d be right. That’s why it is about ‘belling the cat’; not killing it! It is about becoming aware of what can create havoc if left unattended and then taking proactive actions.

We are not going to talk about SRE concepts, much. We are going to focus on HOW to implement specific SRE practices that help teams grow out of routine chaos in PDLC. And as a result, enable focus on improving UX of the products on business critical factors - availability, performance, and overall reliability.

In a way, we are going to talk about shifting the SRE to the left! We will look at the top 3 themes that induce unintentional chaos - speed, toil, and lack of monitoring.

Summary

  • How to use SRE to address unintentional chaos in your development lifecycles. Without the defined number of deployments, teams are constantly chasing an undefined target. Is this really healthy? Is it really efficient way of looking at the continuous improvement?
  • Next theme of unintentional chaos is dependencies. I refer to the dependencies within the organization across multiple teams, or the dependencies on the vendor products or the vendor APIs. It's a lot of time back and forth that teams spend to ensure that the right team is engaged.
  • We can measure almost everything. But the question is, should we? And if not, how do we know what not to measure? The dashboards and monitoring systems can get really complicated. Where do you draw the line? Which data to capture and which data to discard?
  • One of the reasons is the differentiation between product and service. A service is really a running instance of a product. SRE doesn't really have to be an operations thing. How do we integrate SRE practices right into the design and development lifecycle?
  • One of the core tenets of SRE is service level objectives, slos. It's providing a common objective or a common goal to the development as well as the operations team. It basically bridges the gap, brings a common language between product and service paradigms.
  • The time teams spend working out the dependencies, waiting on other teams basically falls into a bigger bucket. Toil cannot be eliminated, it can only be reduced. As the team start focusing more and more on automation and engineering, toil tends to reduce.
  • In terms of resolving dependencies with data. You can create probes to the downstream dependencies, you can get data from the downstream applications. This is one of the approaches that we can consider to resolve the dependency bell. On measuring and monitoring everything like I discussed, we can fine tune the monitoring and measuring strategies to be more focused on slos.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome, everyone. I'm going to talk about how to use SRE to address unintentional chaos in your development lifecycles. I refer to this unintentional chaos as chaotic cat. So let's dive in. So let's start with chaos engineering. So, as we all understand that chaos engineering is the practice of introducing intentional chaos in the production environments to basically identify areas and opportunities. Tom, improve residency. So while authoring my book on chaos engineering, I wondered, but unintentional chaos, the issues that basically seep into teams work day in, day out, the scenarios, the workflows that basically cause issues and take the team's bandwidth away. What am I talking about? So let's dive into that. I'm basically talking about three things or three themes in general. Let's go into those themes one by one. So the first one really is about speed, speed to market, or time to market, or the number of deployments per day, per week, per month. What is it that we're trying to achieve with that? What number are we trying to achieve? How many number of deployments are good enough? Where does the buck stop? So without the defined number of deployments, without the defined target number of deployments, teams are constantly chasing an undefined target. Really? Is it like no matter how many deployments you are doing per day, per week, it is not good enough because we have not defined what is good enough. In absence of thats, teams are under constant pressure to elevate the game to the next level. If they have been deploying, let's say ten times a day. So maybe they want to take thats number to the next level. They maybe want to achieve like 15 deployments a day or 20 deployments a day. And things will not stop there. They'll just continue to, as we call it, continuous improvement, right? So they'll continuously try to improve the number of deployments per day. But is it really healthy? Is it really efficient way of looking at the continuous improvement? Maybe not. So the next theme of unintentional chaos is dependencies. So when I say dependencies, I refer to the dependencies within the organization across multiple teams, or the dependencies on the vendor products or the vendor APIs. So whether it's about launching a new feature, rolling out a new feature, or fixing a production incident, team spend or invest a lot of time in terms of trying to identify the right team to get the approvals, or those right team to engage for the rollout, or trying to fix a production incident where the incident really seems to be from the vendor product, not from the own service. So it's a lot of time back and forth that teams spend to ensure that the right team is engaged or everyone is on the same page to fix the incident or to roll out the new feature. And it really does take a lot of time, takes a lot of time and also creates the unintentional chaos. It's like teams SrE trying to roll out a feature, but they have not received the approval, or let's say the upgraded API version from the vendor, and the feature is kind of stuck, even though their work on the feature is done. But because of the dependencies on the vendor product, the feature cannot be rolled. Tom, the production environments, issues like that, that's what I'm referring to. And the next one is measuring everything. Given the technical capabilities that we have, with lot of tools and platforms available to us, we can measure almost everything. But the question is, should we? And if not, how do we know what not to measure? So the dashboards and monitoring systems can get really complicated, really complex. But the question then is, all the panels, all the dashboards that youd have put together, are they really helping, or they're just there because they need to be there? How do you know which panels Sre really helping, which dashboards are really helping, and which data is really helping, which one is not? How do you know? Where do you draw the line? Which data to capture and which data to discard? That's essentially a big question. And in situations where teams are causing almost everything that they can, just because their platforms and tools allow them to, can actually be counterproductive, can actually create more chaos than solving anything. And let's talk about how do we get into these kind of situations? What are the precursors, what are the triggers that land teams into these kind of situations and what they can do about it? So one of the reasons, in my experience, is the differentiation between product and service, or the lack of it. We don't generally talk about product reliability or product reliability engineering. We talk about site reliability or service reliability. So thats is the difference. So going by the as a service paradigm, a service is really a running instance of a product. Traditionally, we used Tom buy computers. Now we use computers on cloud, and we use them as a service, meaning that we don't really own those computers, we don't really own the infrastructure on cloud. We just use it as long as we want it, and then we pay for as long as we use it. And then when we don't need it, we stop using it and we stop paying for it. So from that perspective, service really becomes a running instance of a product. Teams at the cloud provider, they are just building that product, but when we use them as customers, we use them as a service. So what that means is when a product is really running in the production environment, that's where the business critical factors like reliability, availability, performance, those kind of things come into play. Now, how do we focus on those aspects while we're building the product during those development lifecycle? So let's talk about that. So where does SRE fit and how does it help? SRE doesn't really have to be an operations thing. The idea really is that how do we align the core fundamentals of SRE with the product development lifecycle? How do we integrate the SRE practices right into the design and development lifecycle so that SRE doesn't have to be an afterthought? So talking of development lifecycles, let's talk about DevOps for a minute. So one of the core tenets of SRE is service level objectives, slos. Now, what this image is showing is where does SLO fit in the DevOps pipeline? So if you see on the top, we have the development team working on enhancement, bug fixes, new features, and pushing all the updates through a continuous delivery pipeline to the production environment. And on the other side, we have the operations team monitoring the production environment and the it systems and those platform teams working with the cloud infrastructure and whatnot. Now, service level objective, it fits right in the middle of the pipeline. It's providing a common objective or a common goal to the development as bell as the operations team. So let's understand what SLO is actually doing here, how does it help and what is it all about in the first place? So SLO is really a balancing lever or a common language that connects the product and the service paradigm. I showed the product and the service paradigm a few slides ago. So product paradigm is all about innovation, it's all about speed, how quickly product teams can launch new features and how innovative they can be. And service paradigm is all about stability, how reliable the service is, how available the service is. So SLO provides a common objective, a common language for the product and the service paradigms together. Product team can be as innovative as they want to be, as long as they meet the service level objective. And from those service paradigm, service perspective, the service needs to be as reliable as required by the SLO. So it basically bridges the gap, brings a common language between product and the service paradigms. And in other terms, without going into too much of details in terms of SLI and SLO. So SLO is a journey that basically translates user expectations or the customer journeys or the critical business transactions into something that can be implemented technically in the monitoring systems. So going by the example on those slide, let's say a product team is building a login module, or the login module already exists, that they sre maybe trying to add let's say a multifactor authentication to the login. And if they're trying to roll out that feature, the user expectations, the end user expectation really is that the login should complete successfully. And while rolling out that feature, the multifactor authentication feature, product teams really need to ensure that at least 99% of the login request could be successful. So they have some sort of margin of failure. In other terms we call that as error budget. So the product teams know thats when rolling out that feature, they need to ensure that the login request, they continue to work. 99% of the login request should be successful even with those new feature rollout. And from the service perspective, the teams have a target that they need to ensure that 99% of the login requests SRE successful. So they have a reliability target, they know how available, how reliable the service seeps to be. So it's really bringing a common language for the product as well as the service teams to meet, basically. So the process to define and implement SLO is really a journey, or to translate the critical business transactions, or to identify those risk on the critical business transactions and translate that into some objective numbers that can be implemented in the monitoring systems. I will not be deep diving into the process itself. That's not in the scope of this talk. So let's continue with the next theme of unintentional chaos. So the time that teams spend working out the dependencies, waiting on other teams basically falls into a bigger bucket. In the terms of SRE, we call that bucket as toil. So let's talk but toil for a minute. So toil is work, but it's a kind of work thats chaos. Certain characteristics to it, it tends to be manual, repetitive, can be automated, it's tactical, it doesn't carry any long term enduring value, and that it scales linearly with the service growth. All the characteristics that are mentioned under toil. So any work that the teams are doing that has most of these characteristics, we call that work as toil in the SRE terms, and waiting on other teams, waiting on vendor updates, the inherent dependencies, in my experience, that kind of work basically falls under toil in most of the cases thats can be automated and it doesn't really carry any long term enduring value. Some examples like I mentioned on the slide, it's like setting up some environments to reproduce a production issue or upgrading the API versions manually or basically work about work, identifying the latency of an application, how fast users can log in and creating some quick one pagers, capturing some critical metrics. So all that kind of work basically in my experience falls under toil. So bell talk about how we can address it and how we can address it and release the bandwidth that the teams are basically spending in doing this work. So the first things first, toil cannot be eliminated, it can only be reduced. And how do we reduce it? It's basically a cultural change over a period of time. As the team start focusing more and more on automation and engineering. Not just automation, more of engineering. Engineering basically involves all parts of the STLC or the development lifecycle. Automation may not necessarily involve all those steps, but engineering definitely does. So over a period of time. When we start focusing more and more on the engineering efforts, toil tends to reduce. And in the next few slides I'll show you an example where we can apply some sort of engineering mindset to address the dependency issues that I talked about in the beginning of this talk. So in terms of measuring everything on monitoring, in my experience, measuring everything is as bad as measuring nothing. So really youd need to be strategic in terms of what is it that we want to measure really? And the characteristics of a good monitoring system that it is, is very strategic and it's very focused on things that really matter from the user experience, proactive or from the SLOS perspective. So by fine tuning the monitoring and the measurement strategies to be aligned a bit more towards the slos, it's definitely a good step. And the other thing is we look at application monitoring and infrastructure monitoring, but I think there is often ignored or missing aspect, which is dependency monitoring. We need to understand how an application or a service basically creates to other applications and services in the workflow and identifying the downstream dependencies or the upstream dependencies. Definitely a good strategy to have in the monitoring space, to be able to use monitoring for the production incidents or to ensure thats the time taken to resolve production incidents is minimized. So let's connect all these does I talked, but a lot of things. So let's try to connect all those dots together and then see if things really make sense. So going over those themes again, if you talk about speed now, so with the slos being defined so we can now define what is a good enough number of deployments per day. So as long as the number of deployments is not impacting, the SLO teams can continue to deploy. So they now have a defined target, they know where the bug stops, right? So with the slos we get a balancing lever, like I talked about. And with that we can kind of put some numbers into how fast we want to be in terms of number of deployments and time to market. So in terms of resolving dependencies with data. So this is one example I often quote. So if you were developing a gaming application and you would want to integrate the Discord channel into that application, and if you were to define the SLO or the user journey, users being able to connect with their friends on the Discord channel, the service would have dependency on a vendor, on the vendor availability and their APIs and their performance. So basically you can look at the discord's status data, it's available on the Discord status page publicly, but this is really an example. So you can create probes to the downstream dependencies, you can get data from the downstream applications, and when defining the dependencies and the availability and the slos for your own services, you can actually go with those data. So you don't really have to depend on or wait on on the downstream applications. You can basically create some sort of probe, you can collect some trending data from the downstream applications and make your decisions accordingly. But again, this is only one of the ways, and it has its own pros and cons and there sre certain aspects associated with this approach, but again, one of the approaches, but you get the idea right. The idea is really instead of just waiting without any data and kind of depending on the manual collaboration and things like that. So we can go with data if it's available. And this is one of the approaches that we can consider to resolve the dependency bell. Finally, on measuring and monitoring everything like I discussed, we can fine tune the monitoring and measuring strategies to be more focused on slos. And in fact, even for the new features, if youd can define the SLOS upfront or define the strategies to ensure that new feature, how do we define the reliability of the new feature and then define the slos upfront and then bake that into the development? Bake that into the coding while the feature is being developed. So it can really help us achieve a very high signal to noise ratio. And the monitoring can be very powerful and useful tool to ensure that the production incidents have minimum time to resolve. And the metrics like MTTR or MTBf, they really improve over a period of time with these kind of strategies. So that's that. So by shifting SRE practices to the left, we can minimize the unintentional chaos in the product development lifecycle and ensure the SRE practices are baked in, into the design, into the development. And SRE doesn't have to be an afterthought. And by doing that, we can ensure that the products that we build, when we deploy them into production, they are reliable by design. So that's that. Thank you so much for listening to me and feel free to connect act and we can have follow up discussions if you need to. Thank you so much.
...

Mandeep Ubhi

Founder & CTO @ DevSRE.org

Mandeep Ubhi's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways