Avoiding the death of SRE documents that matter

Video size:

Abstract

SRE teams can prevent the decay of processes by creating high-quality documentation, but the most important keeping them updated. In this talk I am going to describe the various types of documents SREs create during the life cycle of the services and how they could keep updated automatically.

Summary

Yury Nino is cloud infrastructure engineer at Google Cloud. Describes a non extensive list of SRE documents. Site reliability engineering is a job function, a mindset and a set of engineering approaches for making web products and service run reliable.
According to stack overflow 2022 developer survey, the most important resource for people to know how to code is technical documentation. But the same participants rank documentation as the number two challenge facing developers. Missing incomplete or out documentation hurts development, velocity, software quality.
SRE documents can be classified in these categories. Documents for running a service, documents for new service onboarding documents for reporting services. Also documents for service decommissioning, and documents for running SRE teams.
SRE teams need to have a cohesive set of reliable, discoverable documentation to function effectively as a team. SRE teams invest in training materials and processes for new sres. Here are some recommendations from Google to keep documents.
Thank you so much for listening to this presentation. I really hope that it will be useful for you. Have a good day.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everybody, thanks for being here. It is an honor to have a new opportunity for sharing my last work avoiding the death of SRE documents that matter. Nice to meet you. I am Yury Nino. I am colombian. I work as cloud infrastructure engineer at Google Cloud. Also I am chaos Engineering and Cyrelia IT engineering advocate. You can find me LinkedIn, Twitter and medium as jolinita. Regarding this talk, I have to say that this material is the result of a lot of reading and self learning trying to build a framework for documenting my exercises in chaos engineering. In this path, I found an excellent article titled why SRE documents Matter? Written by Nukala and Bibecro who are SRE product managers in Google. I'm going to start with this. After that, I am going to describe a non extensive list of SRE documents which were completed using the book seeking SRE. With this in mind, I am going to share some recommendations that I have learned while I was studying about this exciting topic. Since you know what about sres just to share this site reliability engineering is a job function, a mindset and a set of engineering approaches for making web products and service run reliable. SRE operates an intersection of software development and system engineering to solve operational problems and engineer solutions to villan larger scale distributed systems. SRE is focused on the lifecycle of services from inception and design, road development, operation, refinement and eventual decommissioning. In these phases, SI provides principles and practices for monitoring and metrics. It is about measuring how the service is actually behaving and correcting discrepancies, emergency response in another words, noticing and responding to service payloads in order to preserve their availability. According to slas, capacity planning is very important here. That means projecting future demand and ensuring that a service has enough computing resource in appropriate locations to satisfy that demand service turn up and turn down deploying and removing computing resources for a service in a data center in a data center in a predictable fashion, often as a consequence of capacity planning, change management about creating the behavior of a service while preserving service reliability and performance regarding design, development and engineering related to scalability, isolation, latency, dropout and efficient. Regarding the commentation, I would like to share that according this stack overflow 2022 developer survey, the most important resource for people to know how to code is technical documentation. However, the same participants rank documentation as the number two challenge facing developers. This is a problem because missing incomplete or out documentation hurts development, velocity, software quality and something that is critical for SRE service reliability. So let me join this idea with the reason for documenting SRE processes. The teams used to preserve the important operational concepts and principles as nuggets of trivial knowledge that are passed on verbally to new team members. And it is a fact that if the concepts and principles are not qualifying and documented, they need often needed to be relearned painfully through trial and error. In this sense, documents are very important for many reasons. They help developers communicate with each other and they allow to future developers understand and maintain the code. In general, good documents helps you learn from your mistakes. Sres often spend 35% on their time on operational work, which leaves only 65% for development. This last person includes time for documenting of course. However, documentation takes time and effort and there is a specialist challenges for SIS. It can be a major cause of frustration and job unhappiness for developers. Some developers consider that documenting might not be recognized or rewarded during performance review or promotion process. Having mentioned the importance and the challenge of documentation, let me focus on the type of SRE documents according to the article why SRE documents matter? Remember this presentation is based on this article. They can be classified in these categories. Documents for running a service, documents for new service onboarding documents for reporting services take documents for production products, documents for service decommissioning, and documents for running SRE teams. So let me explore the documents in each category in the next slide. Documents for new service onboarding these documents the documents in this category belong to a more general concept. I am talking about PRA, a production readiness review. A PR ER is conducted to make sure that a service meets accepted standards of operational readiness and they owners have a necessary guidance about running them. So let me go throw in the specific documents for production ready review and the first one is actually a dagger, the architecture and dependencies which provide answers for why your request flow from user to front end to backend. And there is different types of requests with different latency requirements. Review now the capacity planning it is a document with answers for these questions. How much traffic and rate of ground do you expect during and after the launch? And have you obtained all the compute resources needed to support your traffic? Another important document here is failure modes which provides answers for do you have any single points of failure in your design? How do you mitigate on availability of your dependencies about processes and automation? These SRE the questions are any manual processes required to keep the service running and how are we automating these processes? Finally, in this subcategory of documents for new service onboarding, remember we were talking about production readiness review documents. The main objectives with this documents is answering these questions what terrify code, data, services or events do the service or the launch depend upon? Do any partners depend on your service? If so, do they need to be notified of your release? In addition to peer ers in this first category, this first category includes documents for onboarding service. The SRE organization needs to create overview documents that explain the SRE role and responsibilities in general. Teams to product development teams. This services to set the expectations correctly. A primary goal of this document is to ensure that developer teams don't equals sres with an operations team. When developers don't fully understand what sres do, miscommunication and misunderstanding can result. Another important documents here is the engagement model document that explain how the SRE team which engage with developer teams during and after service takeover. Topics covered with this documents includes service takeover criteria and peer process, sloan negotiation process and error budget, new release criteria and release policies, content and frequency of services status report from SRE team, SRE staffing requirements and future roadmap planning processes. If you remember, sres need to be effective across two domains, two domains development and operations, so they need documents to understand enrolled service and production. Example of the core documents include service overviews, playbooks and procedures, postmortems policies, and SLA. So let me review each one. SRE needs to know the system architecture, components, SRE dependencies and service contacts and owners. These overviews SRE often an output of peer process and they should be updated as service change. For example, if you include a new dependency in the service, a complete service overview provides a wall description of the service and how it interacts with the core around it, as well as links to dashboards, metrics, and related information that the service needs to solve unexpected issues. Also cultroom books I love this type of documents. I am talking about playbooks. The on call engineers respond the alerts generated by service monitoring playbooks reduce the time to mitigate an incident and they provide useful links to consults and processes. Playbooks can contain instructions for verification, troubleshooting, and escalation for each alert generated from network monitoring processes that contain commands and steps to review for accuracy. It is a classical as you know, remember a software postmortem. It is can analysis conducted after a system failure. Each post mortem derived from this template describes a production outdash timeline, description of user impact, group calls, action teams, and lesson learn technical policies. In this slide I am going to talk about policies and the first one is technical policies. Technical policies can apply to areas such as production change, login, log retention, and internal service name. Policies also apply to processes, escalation policies, health engineers classified production issues and emergencies or no emergencies and provide recommendations on the appropriate action for each category. Can on call expectation policies outline the structure of the teams and responsibilities of the team members. An SLA is a formal agreement with a customer on the performance of a service. SRE teams document their services for availability, latency, and monitor performance nodes. Having SLAS allows SRE teams to innovate more quickly while preserving a good user experience. SRE's running services will well defining slas will detect outage faster and therefore resolve them faster. Good slas also resoning less friction between and software engineers that is classical in the next slides, I am going to describe documents that are related to the products and tools SREs documents developed. These documents are important because they enable users to find out whether a product is right for them or to adapt how to get started and how to get support. They also provide a consistent user experience and facilitate production adoption. This helps SRES product development engineers to understand what the product or tool is and what it does and whether they should use it. In the next slides I am going to describe documents that are related to the products and tools SRE developed. These documents important because they enable users to find out whether a product to adopt. Let me start with this guide, a concept guide or glossary which define all the terms unique to the product. Our goal of a quick start guide is to get engineers up and running with a minimum of delay. It is helpful to new users who want to give a product a try. How to guide this type of document is for users who need to know how to accomplish a specific goal with the product, how to help users complete important specific tasks and they are general procedure based. Finally, engineers use development guides that is really important here to find out how to program to a product. APIs engineers can use code labs or tutorials you can find in Google for example as tutorials combining explanation example code and could exercise to get up to speed with the product. Could labs can also provide in depth scenarios to wall engineers step by step draw a series of key tasks. I really like this. The fact page answers common questions. Also, the common answers dispoints users to reference documents and other pages on the site for more information. The support page which identify how engineers can get help get when they are stuck on something. It also includes an escalation flow, troubleshooting info group links, dashboard and SLO and on call information. An appearance is usually generated from code comments and sometimes written by tech writers. This provides description on functions, classes and methods which minimal verbose or narrative. This part describes the documents that SRE teams produce to communicate the state of the services they support. Information about the state of a service comes in two forms, quarterly service review and a presentation about this. The goal of the quarterly report or presentation is to cover the state of the service review, including details about performance, sustainability, risk, and overall production health. SRE is interested in highquality reports because they provide visibility into the following uncle tickets postmortems performance of SLA risk quarry reports also provide opportunities for SRE team to highlight the benefits of SRE prioritization for solving problems and request feedback about the SRE team. Folks SRE teams need to have a cohesive set of reliable, discoverable documentation to function effectively as a team. So it is a good practice to create a team site with a team chapter because it provides a focal point for information and documents about the SRE teams and projects. At Google, for example, many SRE teams use Gitris docs, that is an internal documentation platform where documentation and source code lie. Specifically, team charters explains their rationale for the teams and documents scoring major engagement a charter service to establish the team identity, primary goals, and role relative to the rest of the organization. A shorter general includes the following elements, a high level explanation of the space in which teams team operates, a short description of the top two or three services managed by the team, the key principles and service and values for the team, and the links to the team. Sres and documents SRE teams invest in training materials and processes for new sres because training results in a faster onboarding to production environment. SRE's team also benefits from having new members acquire the skills required to join the ranks of phone call as early as possible. In the absence of comprehensive training the SRE school found during a crisis or during a potential minor incident to a major outbreak. Many SRE teams use checklists for on call training and on call checklist generally covers all high level areas. Team members should understand well. Example of this include production concepts, front end and backend start, automation and tools and monitoring. Sres also use roleplay training. For example, in Google we use well of misfortune which is an educational tool for training teams members. In this exercise, a notage a scenario is presented to the team with the data the team members taking of playing the role of the on call engineer in order to mitigate any emergency role play exercise teams the ability of the SREs to know where to find the documents and to use it for solving the hypothetical incident let me finish with this presentation with some recommendations from Google to keep live documents. The first one is communicating the value of documents. That is very important. That is my first recommendation. If you want to combine fellow engineers and leadership to invest time and resource in documents, it is essential that you gather data that demonstrates the quality, effectiveness and the value of your documentation. Remember, when you talk about the impact of your documentation work you sre talking about the business value for your output. So while structured quality pretty easy to measure, functional data is combining. Remember that it falls in three buckets. Measurable success, user behavior and sentiment data. The second one is to create a centralized repository. SI team information can be distributed across a number of sites, local team knowledge and Google Drive folders for example, which can make it difficult to find correct and relevant information. So it is really important to define a structure for all information and ensure that the team members know where is to start, find and maintain information that is the most important. Here are some guidelines to create and manage a team documents repository. Determine relevant stakeholders and conduct brief interviews to identify all needs. Locate as much documents as possible and do a gap analysis on content. Set up a basic structure for the site and the new documents can be created in the correct location. Create a monitoring and reporting structure to track the progress of migration archive of documentation that is very important. Verify which common terms are used in search feature for example and finally use signals such as Google Analytics to track analyze usage. After you determine the functional requirements and quality indicators for each document, it is really valuable to create templates. Templates make easy for authors to create new documentation and providing the clear structure that they can populate with relevant information. With a good template, creating a simple documents can be easy as filling out forms. Templates make easy for readers to quickly understand the topic of the document, the type of information and how it is organized. The site reliability engineering book contains several of documentation templates. Here are an example of playbook template that provides a structure and guidance for engineers filling in the content. Another important documents is defining success metrics. It is essential if you want to able to communicate the value of your work to the rest of your organization. For example, a service review has a high functionality if it provide necessary the context needed to handle in the outage and if you have the possibility to measure the impact in the organization. It is important to have a guidance from technical writers because they provide a lot of support that make SRE documents are effective and productive. Here are some guidance to work with technical writers. Technical writers should partner with sres to provide operational documents for running services and product documents for SRE products and features. Writers should provide consulting to assess assets, analysis, documentation and information management needs. Writers should evaluate and improve documentation tools to provide the best solution for SIS. That are my recommendation when you are working with technical writers. Finally, required documents as part of code review documentation is like testing. Nobody really wants to do it, but if you can minimize this sentiment taking advantage of code reviewers powers. Not all changes requires documents updated of course, but here a good rule of thumb doing documents better. Best practice if a developer, SRE or user of your project needs to change their behavior after visa change, the changes should include document change. That is the recommendation here. Require this as part of code review and that is all. Thank you so much. Thank you so much for being here and thank you so much for listening this presentation. I really hope that it will be useful for you. Thank you. Have a good day.

Slides

Download slides (PDF)

See all 20 talks at this event!

Conf42 Site Reliability Engineering 2023 - Online

May 04 2023

Avoiding the death of SRE documents that matter

Video size:

Abstract

Summary

Transcript

Slides

Yury Nino

Cloud Infrastructure Engineer @ Google

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2023 - Online

May 04 2023

Avoiding the death of SRE documents that matter

Video size:

Abstract

Summary

Transcript

Slides

Yury Nino

Cloud Infrastructure Engineer @ Google

Join the community!