Conf42 Site Reliability Engineering 2022 - Online

Combat sports principles that apply to Site Reliability Engineering

Video size:

Abstract

As a Senior Cloud Infrastructure Engineer, I find that my life is vastly different from professional fighters and athletes competing in combat sports for many obvious reasons. However, there are some principles from the combat sports world that have an interesting application to professional life in Site Reliability Engineering (SRE).

This talk will help demonstrate how these principles have helped me navigate through difficult situations in SRE as well as help any new engineers in SRE that are starting out.

While there are not a lot of obvious overlaps on paper between being in combat sports and being in SRE, there is advice and guidance from those throwing punches that can help us knock-out certain SRE challenges.

Summary

  • Paul Marsicovetere is senior cloud infrastructure engineers engineer at Formidable in Toronto. Today he'll talk about combat sports principles that apply to site reliability engineering. He'll demonstrate five of these principles, along with some examples of how these apply to SRE.
  • Cloud infrastructure engineer is never too far away from his next outage or difficult incident. Like combat sports in SRE, you need to roll with the punches and don't react respond. A more measured and mature response during an outage has several advantages.
  • A fight is won or lost based on preparation. For SRE, this translates to preparing for as many possible outcomes as is feasible. Practice game day exercises for disaster recovery so that when an actual outage occurs. Being prepared can also take shape by practicing deployments in lower environments before releasing directly to production.
  • The best laid plans for launching a service or product can quickly go awry. Failing tests, incorrect size scaling groups, even expired SSL certificates can all cause incidents. SRE teams need to stay on their toes and stay focused for when they too are metaphorically punched in the face.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Um, hello everyone. My name is Paul Marsicovetere and today I'm going to talk about combat sports principles that apply to site reliability engineering. So a little bit about myself. I'm senior cloud infrastructure engineers engineer at Formidable in Toronto and I've been here since October 2020. Formidable partners with many different companies to help build the modern web and design solutions to complex technical problems. Our key values are inclusion, autonomy and craft, and we have a sizable open source community thanks to products like Victory, Urkel and Spectacle. Previously I was working in SRE for Benevity in Calgary for three years, and while I'm originally from Melbourne, Australia, I've been living happily in Canada for over ten years now. You can get in touch with me on Twitter at paulmasi Cloud on LinkedIn and via email. I'm always open to chat with anyone about anything cloud computing, SRE DevOps, and certainly about anything related to MMA and boxing. I also run a serverless blog called the cloud of my mind in my spare time as well. So combat sports and SRE are not areas that would typically be associated with one another. And let's face it, combat sports like boxing and MMA are never typically associated with tech, software engineering or systems administration as well. Specifically, these two things are not quite the same in SRE. We don't have to worry about protecting ourselves at all times or sustaining any bodily injuries, depending on which company you work for, of course. However, there are some principles from the combat sports world that have quite an interesting application to our professional lives in SRE. Today, I'll demonstrate five of these combat sports principles, along with some examples of how these specifically apply to SRE. So let's get started with number one. Sometimes you wear the hammer, sometimes you wear the nail. What exactly does this mean in combat sports terms? This means that in some fights you will be the one winning and punching your way to victory, but in the next fight, it might just be your turn to be on the losing end. It's important to roll with the punches, as you might find yourself as the nail from time to time. For SRE, due to the complex nature of senior senior senior cloud infrastructure engineer, you are never too far away from your next outage or difficult incident at any given time. However, it's important to remember that when things are being very well or very poorly, that the inverse is always right around the corner. We can all have those days where things can be going incredibly well and the updates and changes that you are making are helping tremendously. These updates and changes could be adding an extra AZ zone for redundancy, optimizing slow database queries, or putting together a backup and restore solution that you are confident in during disaster recovery. This would be us in SRE being the hammer. However, outages such as cloud vendor or cloud servers unavailability can and will happen. Full regions can go offline or be intermittently unavailable or your DNS cloud be completely unavailable. Those days are when we are not the hammer, we are definitely the nail like combat sports in SRE, you need to roll with the punches. Your next outage is never far away and as best you try, you cannot prevent or architect around each and every one of them. There are going to be busy and complex days, situations and incidents where you're clobbered from many different angles. It's important to remember that the days where awesome improvements are instrumented, they're always just around the corner, so never be too hard on yourself. Number two, don't throw and hope, aim and fire. So here's a visual representation of throwing and hoping for everyone to understand, and this is a visual representation of aiming and firing. Homer didn't have quite enough power on the punch, but the accuracy was pretty on point. So in combat sports, this takes shape of fighters sometimes wanting to throw haymakers and hope that a knockout shot lands to win the fight. However, many fighters are of these belief that precision beats power and timing beats speed. In SRE, similar situations certainly occur. For example, when faced with a slow performing website, you may just increase the infrastructure, memory or cpu availability to see if that resolves a particular issue. You could start terminating whatever services are running on a server to free up available memory or cpu resources to see if that helped as well. You might even perform the classic turn it off and on again tactic and hope that fixes an issue. All those actions could be viewed as throwing haymakers rather than properly investigating and debugging why a system or service is failing, and then making the necessary changes to fix the underlying issue. Granted, the haymaker's throne might solve an issue completely or buy you extra debugging time during a particular incident or outage, which is always welcomed. However, always defaulting to these responses each time an incident occurs, unfortunately is not a valid long term tactical response. It's far better for you and your SRE team to aim for the root cause or causes as to why an issue might be occurring and these fire your remediation actions via monitoring, alerting or codeflicks deployments number three, don't react respond. No, this isn't another framework for JavaScript. We're actually talking about our actions and emotions during particular situations. In combat sports, this principles is these to remind fighters to keep their emotions in check when something is happening as it can help control the outcome of what happens next. Uncontrolled a reaction might cost a fighter victory or expose a flaw that cloud be used against them in the future. A response is always more calculated and tactical in these long term. In SRE, this is a very important principle when it comes to dealing with incidents or outages. When faced with an unfortunate situation, it's far too easy to think that the whole world is crumbling around you or that the incident itself is completely unresolvable. Some folks might even start to blame their colleagues, vendors or the software application and logic this is very easy to do and it's very easy to react in this way. However, the consequences of directing this blame is very hard to undo. A more measured and mature response during an incident or outage has several advantages. It allows your SRE team to focus clearly on these task at hand and not be distracted by the external noise. It also shows great leadership and calmness under pressure to those involved with the incident. No one wants to be on the receiving end of a negative reaction to an incident or outage, and a more measured response will always elicit a better outcome than a knee jerk reaction. We in SRE call this incident reasons and not incident reaction for this specific reason. So remember to take a moment to breathe, think, plan and then respond accordingly. At the end of the day, all problems have solutions. Number four, it only takes one punches to change a fight. This is very important when it comes to combat sports, as a perfectly placed punch that is thrown can cause you to win a fight or a perfectly placed punch that is absorbed can cause you to lose a fight. This gives fighters hope that even if things are not going these way, one punch can change everything. In SRE, we can take this principles and apply it to many things, but in particular it applies to incident response. Sometimes it might feel like a particular problem or incident cannot be solved or fixed. However, it's important to keep in mind that sometimes all the team requires is one particular command or corrective action to fix an incident or problem entirely. These can take shape in the form of a firewall rule, updating an application package, downgrading an application package, or migrating traffic from one availability zone to another. Sometimes that one particular action helps resolve many different issues. Conversely, it's equally important to keep good communication amongst the SRE group and teams and not perform any commands or actions without letting the team know and potentially having their authorization and permission to do so as well. Nobody enjoys cowboy coding and a team member deciding, you know what? I'm going to update this database row or screw it. I'm going to restart this server that can make a bad incident even worse. I say these things not to stress anyone out or advise folks perform no actions. Rather, you must be cognizant that those actions that you take can have very positive or very negative effects. So think carefully and talk to your team before you hit the enter key, especially during an incident. Finally, with number five, it's important to remember that a fight is won or lost based on preparation. What this means for combat sports terms is that these true outcome of a fight is determined by how well the fighters prepare. Preparation takes place in the form of training in the gym and when situations are not prepared for. This is when fights can be lost. Fail to prepare means prepare to fail. For SRE, this translates to preparing for as many possible outcomes as is feasible. Sure, you cannot prepare for every outcome, but preparing for as many as you can will have great success should you encounter any roadblocks. For example, it's important to practice game day exercises for disaster recovery so that when an actual outage occurs, things do not feel so unfamiliar. This can be hugely beneficial as it gives the SRE team a chance to refine their incident response coordination, be it through roles and or processes with practice exercises rather than during an actual outage where services are unavailable for your clients and for your customers. This refinement is essentially the preparation needed so that if similar incidents happen in the future, the SRE team has some comfort in being able to troubleshoot and resolve the incident faster. Further, by practicing these game day exercises, there is much less reliance on winging it or guessing what to do when troubleshooting an issue, whether or not it is actually related to an incident. Practice certainly makes perfect, and the more an SRE team seeks, the uncomfortable, the more they will become comfortable. Similarly, being prepared can also take shape by practicing deployments in lower environments before releasing directly to production. This can potentially save you from outages or downtime simply by taking these time to preview how a service might look and be deployed in a pre prod or staging environment rather than always deploying directly to production. Yes, there are trade offs for deployment velocity for teams to consider here, but this can be a helpful approach in establishing confidence in production and deployments as you set an expectation on what you expect to see based on a similar looking environment. The same can definitely be said about automated testing in deployment pipelines as well. Now, I said I would talk through combat combat sports principles SRE. There's actually a bonus one that applies to SRE and in life in general. And that is the infamous Mike Tyson quote. Everyone has a plan until they SRE punched in the face. This is important to remember as this principle still rings true for SRE. The best laid plans for launching a service or product can quickly go awry due to a whole host of reasons. Failing tests, incorrect size scaling groups, even expired SSL certificates can all cause incidents or affect service availability very quickly. SRE teams need to stay on their toes and stay focused for when they too are metaphorically punched in the face. In tech, not just in SRE plans often challenges quickly and drastically, so this becomes easier to adapt to over time and with experience, the threats of seemingly random errors, incidents and outages that alter the original plan will forever remain. So it's important to stay ready for what the machines decide to throw at us. So, to recap combat sports principles SRE that apply to SRE number one, sometimes you wear the hammer, sometimes the nail. You need to appreciate the good and the bad within our lives. In SRE number two, don't throw in hope. Aim and fire, especially when troubleshooting and resolving incidents. Number three, don't react. Respond. This is why it's called incident response and not incident reaction. Number four, it only takes one punch to change a fight. Think of this. Whenever you are troubleshooting or are mid incident, one command can alter the outcomes for better or for worse. And number five, the fighters is won or lost based on preparation. So always try to stay prepared in your incident response and troubleshooting so that you're not having to wing it as often. And of course, try to remember that all the greatest laid plans can go awry once things start to go badly. Keep your head up, stay in the fight, and as these referee says, protect yourself at all times. With that, I want to thank you for tuning in and listening to this talk on combat sports principles Sre apply to site reliability engineering. A big thanks to 42 for providing this opportunity and I look forward to hearing from everyone in the future. Thanks.
...

Paul Marsicovetere

Senior Cloud Infrastructure Engineer @ Formidable

Paul Marsicovetere's LinkedIn account Paul Marsicovetere's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways