Conf42 Site Reliability Engineering 2022 - Online

SRE Anti-patterns

Video size:

Abstract

Based on my experience, I see many of the organizations have completely missed the mark.

Some are doing the same as they used to do, and others are doing SRE activities in bits and pieces.

In this session, Niladri will be talking about such few anti-patterns and what needs to change, why and how.

Summary

  • SRE will make a benefit only if there is scaling in the organization. For that they need to learn from failures. There's a lot of psychological safety that is required to implement SRE. How can we scale at a very fast growing scenario?
  • The third one is measuring until my age. We should never target 100% because it is not possible. SRE are responsible for better customer experience. Today customers experience is not just delivering some components. It is a journey that they are looking at, as I call it.
  • Traditional infrastructure is not suitable in today's world, because there are so many moving parts. Next is configuration management trap. We need everything to be automated. We are moving into a immutable scenario, with automation, with containers.
  • Next important aspect is incident response. We have to move from a project to product approach. One single cross functional, self sufficient, self organized team where all the capabilities are there. A lot of it has to be automated. How do we learn? Through blameless postmortem.
  • Sre is not about only automation. The value creation starts only when the users SRE using the product or service that you have created. DevOps India Summit 2022 is coming up on 26th August 2000. Join us for more speakers speaking on various things.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to the Conf 42 site Reliability Engineering 2022 event. Today I will be talking about SRE antipatterns. We will talk about ten different antipatterns that I have seen and that every day while interacting with people who are practicing SRE. The first that we are going to talk about is renaming operations to SRE and continue to do the same work that you do. Please understand that sres are for a specific objective. Google created SRE because at that point of time, others was a need for tremendous scaling and it was not possible to use the traditional methods of maintaining the operations. And hence what Benjamin trainer's class had to do was bring in the sres. Sres are not there to do the regular work. What operations does operations will continue to do it. The main focus of SRE, first of all is about reliability. And not only reliability as of today, but reliability when it scales. SRE will make a benefit only if there is scaling in the organization. So they are looking at a perspective from the future point of view. And for that they need to learn from failures. So SRE learns from failures. There's a lot of psychological safety that is required to implement SRE. People should be able to break things, learn from it, because what Google believes is that if something breaks, it is the system's problem. For example, if I am not in a good mood and I come to the office and I do something which breaks the system, it's not my fault. It's a system which needs to be changed. Because if I could break the system in a particular manner, everybody else will be able to break it in that way. So the learning has to come so that nobody is able to break the system in the same manner. So we have to learn from failure. And again, as I said, that scalability is the main objective. How can we scale at a very fast growing scenario? It can be a scale in terms of amount of new users joining in, like what happened to zoom after the lockdown started, so many different people started using it. It can be in terms of your unstability in the system, so more and more increases happening for your incidents and it can be in terms of newer technology that you are going into. It can be new features that are coming up very frequently. Whatever is the way that scaling happens, if there is scaling, there is need for site reliability engineers and they have to focus on that, not running the day to day operation. So do not expect sres to do your regular operations work in terms of monitoring, in terms of doing some automation to do your regular work. They are going to do it, but for example, they will automate to reduce toil, to free up time for people, so that they can focus on making things better, on increasing the reliability of the system. And believe Google in their book, the latest book, which was published in 2020 January, called Building Reliable and Secure system. Google believes that a system is not reliable if it is not secure. So they have to think from all of these perspectives. So don't just rename your operation and continue to do what you used to do. That's not the case. And that's why the skill set required of an SRE is also not the same. The second one that we see, users should not notice an issue before you do. A lot of times we sre that there is problems that comes up in production and users gets to know of it before we get to know. So the first thing that we need to do is identify proper, appropriate slos and then define the alerts, which is actionable. We lose ourselves in the noise of alerts. That happens. Everything doesn't need to be alerted, everything needs to be logged, everything needs to be traced, but not alerted. We all have got onto the aircraft and in the aircrafts, when we are standing there, before it takes off, the cockpit door is open and we can see that there are hundreds of different meters that are there. Do you think that the pilot and the copilot looks at each and every meter that is there on their dashboard? No, they look at only a handful, four or five different meters that they need to look at to ensure that everything is going fine. You can think of the Dora metrics, you can think of the Google's golden signal, but we need to capture everything that is happening. So what we do, we look at these signals. If anything is not working as per expectation, then we go to the others relevant to that to diagnose the problem. So alerts has to be on the service, not on the metrics. Obviously, we need to increase observability, we have to know each and everything. And observability here we mean melt, melt metrics, events, logs and traces, everything we should be able to know and we should be able to detect faster, because to us the objective is MTTR. Meantime to recover should be shorter and for meantime to recover to be shorter. We need meantime to detect to be shorter. If you need meantime to detect to be shorter, you need to have better understanding of the system. The end to end domain knowledge. You may not be a subject matter expert, but that knowledge is required. You may need to have a better understanding of your it system and your entire journey of the different customer personas. It is not a one size fits all approach. We need to know what each one is doing, what is normal for each of them. We need to bring that observability into it. And finally, we need to have better fault tolerance to achieve the slos that we have promised. And that is from the point of view of the customer experience, what we are doing. So we need to increase our monitoring, our telemetry, our application performance management, all these and move towards observability so that we can get each and every aspect when we need that information, but not drown in the noise of alerts. The third one is measuring until my age. Now this is a common traditional scenario where we are bothered about a 99.99% availability of our servers. We are looking at latency and all but all from within the four walls of our organization. The things that we are in control, but that is not okay. Customer is happy only when their experience is better. I can have a 99.99% availability on my servers, but I am going through some last mile network which the customer is using a mobile data which is giving a 95% availability. Now that is not giving the customer a 99.99% availability. And that's why having the appropriate level of Slo is very important. We should never target 100% because it is not possible unless we are a manufacturer of pacemaker or something like aircraft, that it should work. Most probably everything else doesn't need to be 100%. So we need to understand what is normal, what is the customer really looking at, what they really need and joint become. Sres work together with the product, others with the users to define the slos based on facts and figures. What is happening today. So again, observability is important to even have this discussion we need SRE are responsible for better customer experience. And today customers experience is not just delivering some components or it elements. It is a journey that they sre looking at, as I call it. We are looking at servitized products. Today we don't buy songs in a CD. We buy a musical experience that also on a pay as you go model from a Spotify.com. And the entire journey is important for the customer experience, not a specific song. We look at how easy is it to search for the different types of song, the different genre of songs that we want to listen. We look at how easy it is to run it, whether I am in the house, whether I am in the office, whether I'm on the car, whether I'm in the camping site, am I able to do it? Can I use any type of devices to listen for it? What is the kind of recommendations that the system is throwing based on my listening habits? Can I pay as you go? Can I pay only when I'm listening and not every time. All these together gives the customer experience. If we look at Uber, Uber is not delivering a cab service. They are giving us a travel experience. And that's why the customer is looking at that entire journey, starting from the time they are booking the cab till the time they pay the money and get off the cab. That entire journey has to be good. That is the kind of thought process that we have to start looking at. Sres have to look at and then find out ways to make it better. That is the job of SRE. Not again, not running your day to day operations work. And obviously we have to look at end user performance. We need to look at the end user performance not from within our edge, not from within what is within our control, how it is happening at the customer site. There are tools like Catchpoint which allows you to do end user performance management. We need to look at the web analytics, how fast the page is opening up or how slow it is opening. We need to look at it from that point of view, keeping in mind that the customer, how they are getting there is a situation when I was talking to the founder of Catchpoint, Mehti Daudi, and he was mentioning about a situation where they had seen that there was a speed which was slowing down in an AWS situation of their client in California. The client put the customer complaint in AWS. AWS sent them the server logs and said everything is fine now. The customer then sent the catchpoint report and then they looked at it and took five days for AWS to find out where the problem is. The traffic from California was supposed to go to a north american server. It was not going to that. Some change had been done earlier because of which it was going to some server in Asia back now. If you now look at the server, North America server and give the statistics of that, the details of that, it will not show anything. But is your customer happy? Is your customer getting your promised service? No. So we have to move out of our edge and go till the customer. That is what SRE is going to look at. The next important point is false positives are worse than no alerts. And today we find that with the traditional monitoring and all, we find a lot of false positives which are going and ultimately affecting us in our production. Customers are finding those problems. So we cannot look at individual host alerts and think that everything is fine, even for that matter, if we are looking at the HTTP request, if you look at from the server point of view, is one thing, but what happens to those HTTP requests which are failing even at the load balancer level? So we have to look at everything starting from the customer experience, and then go backwards to see what needs to be done to achieve what the customer has been promised in terms of slos. So alerts has to be very, very specific with respect to the services and not with respect to some components. Yes, we will have to track different things in the component side, but the ultimate objective has to be about the service that the customer is getting. Response fatigue and information overload of time series data is not good. So as I said, too much of alerts is not good. If people have to keep moving from one to the other because the pager is ringing very often, it is not a good thing. It is actually reducing productivity. We cannot do multitasking. Our brains. Neuroscience has proven that our brains are not wired to do multitasking. We do things faster. We think we are doing multitasking, but that's not the case. So we need to look at only actionable alerts, not look at everything and get disturbed. Every time we are moving from one to the other, the brain is taking time to shift. Alerts should have great diagnostic information that is again very, very important. Just telling that something is wrong is not good enough. So when we are putting the alerts, we have to make sure that at that point of time we can collect as much health related information of the system and pass it on in the alert. So that whoever looks at the alert, looking at it is able to faster diagnose, that is lesser MTTD, which leads to lesser MTTR. Because if MTTR is more, that means meantime to recovery is more, that means your outage is longer. That means you are going to eat up on your error budget. You will have lesser time to do things, better things like releasing new features, like putting some security patches, like doing chaos engineering. You will not have time if you are having outages already from something else. The next is configuration management trap. So traditional infrastructure is not suitable in today's world, because there are so many moving parts, there are so many different things. It is so much of distributed system. That traditional way of doing infrastructure is not going to help. I remember long time back when I used to, I started working. So that time the desktops had those motherboards which had only two slots for memory. So you can put two mb, not gb, two mb memory or two four mb memory, or two eight mb memory or 216 mb memory. That was the maximum that you could do. So I had a four mb memory, one in the slot. I wanted to make it eight mb. Submitted the request, got all those approvals. The engineer comes to me, opens my desktop, sees there is a four mb slot, which is empty, and others is one with four mb ram. He opens the desktop beside me, because that desktop, nobody was sitting there. Because the person who was sitting, he has gone to a client location for a month or so. He opens that, takes up out his four MB memory, puts that ram chip into my machine, my machine becomes eight MB, closes both the machines, goes away after one month. When he comes, when this person comes, whose machine was opened, his machine is not booting, his desktop is not booting. And by that time, we have forgotten what happened. Imagine that happening in today's world with millions of different components that are there. It's impossible. So we have to make sure that we get into infrastructure as code, configuration management as code. We have to have a very, very strong configuration management, not only because of the stability, but also, as I said earlier, that reliability means security, even for security purpose. We need everything to be automated. We need to move into that immutable infrastructure scenario, the pets versus cattle versus poultry scenario. Pets means it's a kind of idea which has been brought in. But the main thing is that we have to servers, huge servers. We have seen those servers with names like John and Thomas and Paul and so on. We are so emotionally attached to those servers that we want to keep treating those servers as much as possible so that it keeps running. But that is not cost friendly. That is not giving us the kind of result that we sre looking for to satisfy the current needs of the customer, the ever changing needs of the customer. So we move to cattle, which is less emotional attachment, more numbers, more work can be done if the cattle is sick, and we just put down the cattle and replace. So this is the VM. But in today's evolving world, it is not also good enough. So what we are moving towards is poultry. Like chickens, huge numbers can be put in one place, lot of work can be done, less expensive, and that is your containers. So we are moving into a immutable scenario, with automation, with containers. And immutable means that you cannot change. So in today's world, we cannot change. What we can do is replace. We kill the old one and replace it with a new one. That way it is also much more secure. It is much easier to detect problems and rectify problems. So sre don't spend much time on changing. Rather, they automate in such a manner to homogenize the ecosystem, to make it in such a manner that it becomes much more easier for people to take care of it. Lot of it automated, lot of it. Getting into a self healing kind of a situation, that is your work that needs to be done. Next important aspect is incident response. We all know we are doing incident. Yes. Sre also have to be part of incident response. Couple of reasons. First, if they sre coming as the expert, as the advisor, as the consultant to the entire delivery lifecycle, bringing the wisdom of production from the right to left, they need to know what is happening on the ground. They cannot be an advisor without knowing what is happening. So they have to be hands on. But here the incident response is different. First of all, SRE do not look at a tiered support model. If you have to implement SRE, you have to move from that level one, level two, level three, level four support. It has to be one single team responsible for the entire system. End to end. We have to move from a project to product approach. One single cross functional, self sufficient, self organized team where all the capabilities are there. So we are looking at a comb shaped skill set of the whole team. And here, if a problem happens, if an incident happens, you need to swarm. So everybody that is there comes together, that is relevant, comes together and solves the problem, because everybody looks at it from the same point of view and solve it. That's subject. It's not that handing it over from one unit to the other to the other. No. So no tiered support. We have to get into swarming now. When incidents are bigger, we need to have a proper framework for incident command. And Google has defined an incident command framework with incident commander and various other roles which you need to look at, and that facilitates the smooth flow of work. A lot of it has to be automated. So SREs looks for opportunities of automating, whatever can be done, because on call anyway, is not something which we look forward to. Do a lot of it. Now, you can use chat bots, you can use ibRs, you can use many automated system, you can create a lot of runbooks to take care of it. And what is also very important is the learning from it. We talked about the learn from the failure. So if the incident has happened, it's an opportunity to learn. So how do we learn? Through blameless postmortem. So SRe are the ones who are going to facilitate and conduct the blameless postmortem with the people who Sre actually involved during that, because they have the information as to what happened at that point of time. What was the sequence of activities that happened? What was their expectation, what were their assumptions, what were the things that they have done? That's a learning. And this blameless postmortem has to be documented and it has to be circulated to everybody in the organization, not just within the team. We have seen incidents where even organizations like banks have not only shared postmortem within themselves, they have shared that to the outside world in social media, because there is still somebody who is going to be benefited out of it, number one. Number two, in a specific case, as you see in a video by Monzo based, one of the problems because of that incident was it was due to some open source product, certain aspect in that which had changed in a new version which created the problem. Now, people in the social media were also people who are creating and maintaining those open source codes, products. They got back and said, great, we have got this information. Sorry to hear this. We will take care of it in the next change. So everybody gains out of it. SRe are involved in all of these. As sres, we have to start thinking beyond point fixing. Point fixing means that we are looking at only the problem from immediate point of view and solving it. But Sres doesn't look at that. SREs looks at it in a much bigger context, in a much long term kind of a scenario. So minimize outage with automated alerts and solid paging mechanisms and quick workarounds, faster rollback, failover and fixed follow. So you have to, when you sre creating anything, you have to think through all of these, how you can do it, how you can automate it. If there is any new release, that new release will not be released unless there is also the script ready for rollback. And all of these has to be tested, analyze and eliminate the class of design errors. As SREs, we have to design and analyze what is happening and based on that, automate things, short term fixes followed by preventive, long term fixes leading to predictive methods. So as I said, sres are not looking at a situation, what is happening today. SRE are looking at a situation in future. They are looking to be ready for the unknown unknowns, not only the known unknowns. So they have to look at the observability, they have to look at everything that is happening and also analyze and do the what if analysis for the scaling that might happen in the future. And be ready for that. For example, something may work in your current situation. Let's say a small latency delay, but that same small latency delay when it is happening with millions of users, can crash the system. In your environment, there can be an automated system where you are transferring some data from some files from one point to the other, and then working on that file, processing that file. But if there is a delay when you are doing it over a distributed system, the whole process gets stuck. So those are kind of things that SREs are going to look at. So aim for auto remediation and closed loop remediations without human intervention. What is repetitive, what can be done should be automated, what machines cannot do, like refactoring of technical data. So like rearchitecting, like bringing in new features. Those SRE things which human beings has to do, people will focus on that, the rest of it will be automated. That is what toil is about. Production readiness gatekeeper SREs are not the gatekeepers in DevOps and Devsecops. We want things to move faster. SRE complements DevOps. So SRE cannot be a gatekeeper where it is stopping from faster releases. So any process that increases the length of time between the creation of a change and its production release without adding definitive value is a gatekeeper that functions as a choke point or a speed bump. So we have to make sure that the whole activity is such that it is helping in improving the flow rather than stopping the flow. So if you are looking at a release and deployment, you are as an SRE going to put that automation and release and deployment not only on production, but also at each and every other stages in every other environment. So that that release and automation is tested throughout. And it helps the users, it helps the developers. So SRE will enable and enhance the velocity like they will use the error budgets, build platforms and provide dev teams with the development frameworks and templatized configurations to speed up reviews. So they will create the infrastructure as code, they will give it to the developers, to the testers, so that with a push of a button, they can create those environments and test it on those environments. The same release and deployment automation which they can use to deploy in each environment to test. So SRE shift left to build in resilience by design in the development lifecycle. So they are involved in bringing that wisdom in production, the wisdom of production to the entire deliveries lifecycle. Starting from ideation. They will help and guide the developers, the testers, so that when they complete the work in that stage, it is something which is deployable in the production, and it is not going to affect the production in a negative manner. As I said, everything is a systems problem, not a human error. So SRes will strive not to have a cause of an outage repeated. That means there is no learning. The desire to prevent such recurrent failure is a very powerful incentive to identify causes. So sres are responsible for reliability. That means the same failure should not happen for the second time. That's the learning that they are talking about. One of the challenges that we see about root cause analysis is that the root cause is just the place where we decide we know enough to stop analyzing and trying to learn more. Sres don't stop there. Sre try to see. Okay, fine. Now we have understood the problem in the current scenario, what in future? What in a different scenario. So we have to see that what will happen in the future, even with our root cause, and continue to find out ways to make the system more reliable. So we have to move to think about the contributing factors if we know what happened, where things went wrong. Let's explore the system as a whole and all the events and conditions leading to the outage. Again, this is related to the point fixing. We are not looking at only point fixing the current problem, but all things that may lead to it, all things that this can cause in future. And as I said, it's always a problem of the system, not a human problem. SRE is not about only automation. This is again another kind of thing that I see. Many organizations are taking more and more software developers, because they have heard that SREs are developers. So they are taking in a lot of developers, and the only thing they sre doing is automating a lot of things. The point is we have to be very, very clear as to why we are automating. And that has to tie to the measurement. Do we really measure what we are automating? Our project is over successful the moment we have implemented something. No, that is the starting point. The value creation starts only when the users SRE using the product or service that you have created. If that is the case, then the value creation starts only when the implementation of automation is done. I've seen scenarios where people have automated and then when they start measuring, they have found that the automation has now created more problem and it is taking more time for them to do the work opposed to what it was earlier. We have to also understand that there is a constraint of resources, constraint of fund, constraints of time. So we need to prioritize what is most important and automate that SRE needs to do that prioritization based on facts and data, not on the basis of what we feel. It is also important that we should allow the people who are doing the work to decide on the automation to decide on the tools that is going to be used and for scaling along with the value stream team. Along with the product team, SREs can take the ownership of the platform, the entire CI CD pipeline and the release deployment up to production can be created as a platform and SRE can provide that platform as a service to the product teams to the value stream teams. That is the best combination for scaling. And why do we talk about SRE taking the responsibility of the platform? Because today what the developers are using the same tools is what the operations is maintaining. If developers are using kubernetes, production is also having kubernetes. If you are using dockers here also you are using dockers. The entire definition of done has changed. Definition of done has extended to production as per DevOps. The work is not complete until unless it is tested in production by actual users. So you use things like a b testing, blue green testing, canary testing, where is it done in production with actual delivery users. That is what sres are facilitating. So sres have to look at in entirety. So these are the kind of anti patterns that I have seen and hope this will help you to chart your path of SRE journey in a much better way. Thank you. And if you have any question, you can always get back on my email, on my Twitter handle and LinkedIn. And we as DevOps India Summit 2022 have partnered with Conf fourty two and that event, that global event is coming up on 26th August 2000 and from 08:00 a.m. To 08:00 p.m. India time. It's a free registration. Join us also for more speakers speaking on various things. Thank you.
...

Niladri Choudhuri

CEO @ Xellentro

Niladri Choudhuri's LinkedIn account Niladri Choudhuri's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways