Conf42 Site Reliability Engineering 2022 - Online

A guide to join operational works in your new DevOps team

Video size:

Abstract

This talk will explain my experience when I join in a new engineering team. I mean a DevOps team as a team which includes developers and IT operations working collaboratively throughout the product lifecycle.

Bottom-up approach

Imagine, you don’t have much room to learn everything about your service at once. Therefore, let’s chunk continuous learning. Learning process is observed -> recorded -> action.

In the observe phase, let’s jump into alerts and exceptions even if you’re not familiar with them. Gradually, you will start to know what are critical points for our service (i.e. website down).

In the recording phase, let’s write up missing documentation behalf of teammates. Even if you cannot solve issues directly, it would be helpful to your team. There, writing Runbooks give you chances to understand your service and participate in operations. I would recommend you write the “Architecture” part to organize your understanding of your service’s technical design.

In the action phase, let’s try service operation (e.g. fixing broken data by manual operations). Pair operation is a good idea to jump into service operational works.

Top-down approach

Check the big picture of your service reliability (e.g. is there any SLA? SLO? SLI?)

Summary

  • Kazki is a senior backend engineer at ODFI in Autify. His role includes infrastructure development and operations like instant handling. This talk will give you tips on quickly catching up your service specific knowledge.
  • Troubleshooting is one of the critical activities for anyone who operations web services. Lack of service understanding can be a challenge within that process. The more you jump into problem reports, the more knowledge you gain about problem patents. Keep participating in operations is essential.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, in this talk I'm going to talk about how join join operational works DevOps team. You join a new team such as manual recovery operations, triaging alerts, arresting incidents and so on. Sustainable shooting is essential, especially for SRE guy is, but it's difficult for new members because you don't understand how your systems work enough yet. This talk will give you some tips on quickly catching up your service specific knowledge and contributing to your team from early on. I'm Kazki, a senior backend engineer at ODFI in Autify. The responsibility of our backend engineer role includes infrastructure development and operations like instant handling, being in on call rotations, etc. Autify is a startup company which provides an AIB based software test automation platform. We have two products, autify for web and autify for mobile. I'm involved in the development and operations of the autify for web service. The idea in this talk is based on my experience at OTFi. First, I would like to discuss what makes us difficult to join services operations troubleshooting is one of the critical activities for anyone who operations web services. It's often viewed as an innate skill that some people have and others don't. However, the book cycle reliability engineer shows a general model of troubleshooting process. It's beneficial to analyze what makes it difficult to join service operations, so let me explain briefly. First troubleshooting start with a problem report for example, metrics, alerts, application, exception, customer inquiries, et cetera. After then we can look at system directory and logs such as cybermetrics, LR logs and so on. Through this exercise we can understand the current state of the system and identify possible causes. When we find potential solutions, we can actively trace a system that is chances the system in a controlled way and observed the results. This is a general process of troubleshooting, but when you join a new team, there are some challenges regarding the Toria waste. Let's say you get analyzed. The cpu utilization is over 90%. Once receiving the problem report youre need to consider whether you have to take action immediately after answering the following questions in your mind. Is this all as the first time your team has which workflow is the server used for? Do users use the services or only for internal use? Is it a known issues for the team? However, you will not have enough knowledge to answer these questions. Yet. When youre just joined the book practical monitoring mentioned in the context of alerts, they said generally there are two different alerts. The first one is alert meant to wake someone up. This kind of alerts requires action to be taken immediately or else the system will go down. For example, when all web servers are unavailable, we should take action immediately. The second one is alerts metadata for youre information. It doesn't require any immediate action, but someone ought to be informed that is occurred. For example, when an overnight backup job failed, it ought to be informed to software engineer so that they recognize they may need to take action the next business day. Contextual judgment is one challenge to doing operational work. It highly depends on knowing its failure patterns. After a few months of watching various alerts you may guess a hunger weight, but it will take time. Also, in the examine and diagnose and later phase there are problems. Even if you configure suburb of problem reports, you need knowledge of how the system is built, how it should operate and its failure modes. Basically, the exercise depends upon two factors. The first one is an understanding of how to troubleshooting dynamically. The second one is a Sli knowledge of the system. For example, let's say you get the problem report that queuing processing has stopped and the number of waiting events has increased. To solve this case, you need to know such information, what events the queuing system handles. Are there any known possible causes? Have we influenced the retroactive mechanism? So again, lack of service knowledge becomes an issue here. So far I introduced the troubleshooting process and mentioned that the lack of service understanding can be a challenge within that process. Next, I'm going to explain three pieces of advice to start participating in service operation work. Firstly, I would advise you to look at problem reports even if you were not sure about them. When you see problem reports that you're not youre about, click link to them. It doesn't matter if you can to directly contribute to solving the problem, lets yourself be yourself and keep looking at incoming reports in a casual manner. It's a good idea to set a time box, in other words a time limit, for example 30 minutes, as too much time may interfere with your main work. The more you jump into problem reports, the more knowledge you gain about problem patents. The next piece of advice is to leave what you will learn in documents. After going to detail about alerts, let's create a blank page in an internal documentation system. In the case of OTwi, we use notion as a knowledge sharing system and use Datadoc as an observability platform. So I create a blank page in notion or sometimes create a new investigation note in datadoc, then leave what youre learned in the document. For example, when you have learned about the system architectures related to the problem, write them down briefly. It's also recommended to note any similar cases youre find that have occurred in the past. This will help you visualize your learning. Keep participating in operations trust is essential. Making your learning visible lets to letting your peers know how much you understand about how your system works and how diagnose atypical system behaviors. In the book 97 things every SLI should know, Lori Hoxtain, who is an engineer of Netflix, recommends watching experts in action. He said, in general, the best way to facilitate skill transfer is to watch experts in action. Ideally, you're working alongside them, watch them solve youre problems and document how they mitigate it to operational supplies. I love this idea. In my case, autofide workstation is remote, so if I don't know how my peer investigates the problem, I ask them on slack after it's been recorded and write a new document. The third piece of advice is to write lamb book. This may be a well known concept, but let me explain it briefly. It's a detailed how to guide for completing a commonly repeated task or procedure within a company's IT operations process. It guides an operator by step by step instructions. It's sometimes known as a playbook. The might for your team is that land runbooks are a shared waste of knowledge and expertise that would otherwise be kept solely in the hands of subject matter experts. A subject matter expert is a person who is an authority in a particular area or topic. Once you put it on, you will be able to take over its population. At this point, you can start participating in service operations. Writing Lambrooks will mitigate common system ant pattern called only brand knows the book operations ant patterns DevOps defines this ant pattern. The book said, unless purposeful action is taken, information tends to coalesce around key individuals. It makes those individuals incredibly valued but also equally burdened. Your documentation work will reduce the burden of experts by showing their knowledge and expertise. Besides, this is only possible because of you. The book software engineer at Google said like this, the first time you learn something is the best time to see whether the existing documentation and training materials can be improved. You are the best person to write a new lamb book if your team doesn't have a document about it. A good ram book answers these questions. Specifically, I would recommend you answer the third and fourth questions in your document. Regarding the third question, what dependencies does it have? Modern system are distributed system and in some cases external services are involved. In one case, this information is beneficial to understand the system architecture. The fourth question what does the infrastructure for it look like is also a good question to understand the system architecture. Writing these parts is a great opportunity to organize your learning and form reusable knowledge for the team. You may feel uncomfortable writing books because you don't feel you can write perfect, useful documentation for the team. However, the book seeking Sli mentioned this concern. Like this city first draft, an imperfect document is infinitely more useful than a perfect one that does not yet exist. You don't need to finalize the documentation perfectly. I would recommend you to be the first writer. Finally, let's recap three tips. First, let's look at problem reports, even if you are not sure about them. Second, leave what you will learn in the document. Thirdly, write one books youre for your team. The books and blogs mentioned in this presentation are listed on this guide. I hope this presentation will help you quite. Thank you for listening to my presentation.
...

Kazuki Higashiguchi

Software Engineer @ Autify

Kazuki Higashiguchi's LinkedIn account Kazuki Higashiguchi's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways