Conf42 Chaos Engineering 2023 - Online

Why Chaos Engineering Documents Matter?

Video size:

Abstract

The software documentation is a key communication medium regarding the decisions on the project, and this includes the experiments with Chaos Engineering. Since CE is still a new discipline, there are not a framework about how we should document the discipline. In this talk I am goin to present one.

Summary

  • Jody Nino: Why chaos engineering documents matter nice to meet you. Nino is cloud cloud infrastructure engineer Google Cloud. He says there is no framework that defines how to documents experiments. He will share benefits of documents each software development lifecycle stage.
  • The story began a Friday until 50 50 time ten minutes before the control chaos. A perfect recipe for the chaos in the middle of the control case. Luckily, all the characters and episodes in this story are fictional.
  • A meeting between the development and SRE team and no less important documentation than user is the main topic of this talk. documentation helps developers communicate with others and documents help future developers understand and maintain the code. For example, it is important these documents are important on the onboarding processes.
  • In the category of general, I have included team charters, production, readiness and technical designs before the chaos. After the chaos, our mission will document postmortems and reliability reports.
  • A policy is a guide to make decisions for achieving excellent outcomes. Playbooks contains instructions for verification, troubleshooting and escalation for each alert generated from network monitoring processes. Documents for after the chaos is a classic. As you know, I couldn't allow me to miss the postmortems.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thanks for being here. It is an honor to have an opportunity for shading my dashboard documenting for chaos engineering. I have given this type to my talk. Why chaos engineering documents matter nice to meet you, I am Jody Nino. I am colombian. This is the flag of my country. I work as cloud cloud infrastructure engineer Google Cloud. Also I am chaos engineering and site reliability engineering advocate. You can find me in LinkedIn, Twitter and medium at jolinin. As it is mentioned in the poster of my talk, everybody here knows that the software documentation is a key to communication the decisions of the project. But something that we have missed is that it includes chaos engineering exercise, so we should have frameworks for that. However, Chaos engineering is still a new discipline. There is no framework that defines how to documents experiments. In this talk I am going to present, but before that I would like to share a chaos story that puts in evidence the importance of having documentation for chaos engineering. Having said that, I am going to share the benefits of documents each software development lifecycle stage. It's important to mention that I am not talking about documents for building and operating systems. I am talking about the documents for the chaos. At this point I will present a framework composed by a set of documents that should be keeping the stage of the chaos. I am going to talk about the general documents for the team, for example, the team charters and I will try to explore another documents for preparing, ruining and finishing the case. So let me start with the story, not Jude. My skills for Titan stories. I am really bad on that. That is the reason for giving this simple title to the story. A chaos story in the middle of chaos. This story began a Friday until 50 50 time ten minutes before the control chaos. I am going to use my name for the main actress juries as a reliability engineer and she's participating of a chaos engineering exercise. For the first time she watched an experiments cohost announce the routine for chaos in the Chaos Engineering network channel is a classical experiment in which the engineering team unplugs some network outlets. In production at 14 time is the moment of truth. One of the cohosts run the prepared command to insate the failure. A jury's colleague and note take care recurs the time at 50. After 14 is the time for all participants except the note taker to sprinkle to action. It is the moment for collecting the evidence of the failure, the recovery and the impact on the systems. Confirm on disconfigure all the details of your hypothesis and make note of the remediations. At 1430 is the moment to restore the service. It is the real moment of the truth. All the preparation and the exercise in the development environment has led you to this moment. You notice that the production environment doesn't return to the steady state. Since something is wrong. The chaos lead decides to stop here. But it is so late because the automated recovery procedure has put online the isolated region for the chaos engineering exercise. The failure is now noticeable for the customers. This feels truly terrifying. For a moment. Jury has no idea what to do for tonight. Leach is not alone. Jury proposes extrapolate the situation to the development team. The team thinks it is a good idea. Jury pages the document team. Unfortunately, she gets no response. The chaos team search on the intranet to find some documents that could help. After precious minutes go by Juris find a document referring a wait a script that can apparently fix the problem. But this is the first time that the team has heard of it. Since it is the only option, they run the script and cross fingers. Fortunately, it works. Luckily, all the characters and episodes in this story are fictional. Let me analyze some thoughts about the story. A perfect recipe for the chaos in the middle of the control case. It is the postmortem of our story starting the things went well. Let me mention that the team war use the communication channels and chaos engineering installation were things that they didn't run well. The things that they didn't go well. Lack of knowledge about the automation recovery. Skip, did you remember that the development team was unavailable? Of course, the lack of documentation and finally let me close with the next actions plan. A meeting between the development and SRE team and no less important documentation than user is the main topic of this talk. The moral of the story the documentation is very very important. Let me explain you why it is common that the organizations depend on the performance of highly skilled individuals of the team and the teams used to preserve the important operational concepts and principles as nuggets of trivial knowledge are passed on bear value to new team members. It is a fact that if these concepts and principles are not qualified or documented, they will often need to be relent painfully through trial and error. As I chaos mentioned here, documentation helps developers communicate with others and documents help future developers understand and maintain the code. For example, it is important these documents are important on the onboarding processes. For example, good documentation helps you learn from your mistakes. I would like to share this fragment of an excellent article related to SRE documents. SRE teams can prevent this process decay by creating high quality documentation that lays the communication for teams to scale up and take a principal approach for managing new and unfamiliar services. It is the most expected moment to present the documentation framework. Here are the framework I have classified the documents according to the stages that are part of a classical chaos engineering exercise. In the category of general, I have included team charters, production, readiness and technical designs before the chaos. It is important to consider that the chaos policies, service agreements and on call policies. I am proposing to use chaos designs, playbooks and incident management documents during the exercise and finally after the chaos, our mission will document postmortems and reliability reports. Let me go in detail on the general documents. I am going to start with one of my favorite assets. I am talking about the team charters. A charter general includes the following elements a vision statement. It should be an aspirational description of what the team would like to achieve the long term a short and high level explanation of the space in which your team operates. This includes the types of services the team engage with, related systems and examples. Also a short description of the top two or three services managed by the team. This section also highlights key technologies, use and challenges for running them. Benefits of chaos engineering engage and what chaos engineering does finally, key principles and valuables for the team and needs to rebel documents the production readiness review, or PR as it is known, is a document with the criteria considered to make sure that a service meets the accepted standards of operational readiness and that service owners have the guidance they need. A service go through the review process prior before the launch to production. Important to mention that during this stage the service has not SRE support, so the product development team supports the service. Some common sections includes architecture and dependencies. What is your request flow from user to front end to back end? The capacity planning very important. A key question here is have you obtained all the compute resource needed to support your traffic failures mode? For example, do you have any single points of failure in your design on how do you mitigate an availability of your dependencies, processes and automation? Very important also. And finally, external dependence is critical to know to do any partners depend on your services? Technical design documents, also known as TDDs, are similar to proposals, but they describe in detail how a specific solution will work. Design documents are often written when the implementation a solution is not trivial or not well understood. The author will usually have to design decided, but he's looking for an approval from reviewers. Since a design document should describe the steady state of a system, it is really valuable asset for chaos engineers. Some common sections include an overview of the system, the system architectures, infrastructure, services and documentation standards, which includes naming convention. For example, another sections for at PDD includes programming standards, development tools, requirement traceability metrics, document controls, documents syncops and document change reports. Now I will move to the documents that should review before the chaos experiments. A policy is a guide to make decisions for achieving excellent outcomes. Policies are generally adopted by governance body within an communication, so my first recommendation is to include an asset that documents ks policies. Here are the sections an overview, the policy goals, a description of the meaning of the steady state in your services, the specification of your slos and slas. The description of the keys policies includes outage policies, escalation policies and related documents. These are the classical service level agreements famous in SRE. They used to include what is the target for measuring the user experience? What process will indicate success or failure by average user? What systems and piece of the infrastructure are used to conduct that process and meet the target? For each target or metric you should identify the source, the measurable aggregation duration, the threshold by which we can determine success and any explanation behind the target service. The selection finally, to this stage, the on call policies very important also which use include an overview and readiness, the training and scheduling. The chief details, the pager load compensations, tools and processes and communication standards. This is the real moment of truth in the following the documents that could be useful during the chaos experiment I think this is one of the most important assets in the framework. I know I have said that all documents are very important, but I think there is one most important assets in the framework because it describes the experiment. Here is registered what happened. The hypothesis very important that you will verify or refute. Then by domain in which the application is running the duration and load applied in the experiment. The result it is like an airplane black box. Your observability strategy and the actions that you will take from the results. The playbooks also called roombox. With them, the oncology engineers respond the alerts generated by service monitoring in the story, if jury had a playbook to tell her to do in case of failure, the incident could have been resolved in a matter of minutes. Playbooks reduce the time to mitigate an incident and they provide useful links to consults and procedures. Playbooks contains instructions for verification, troubleshooting and escalation for each alert generated from network monitoring processes. They contain commands and step to review for accuracy and incident management documents. Sometimes called an incident response plans or emergency management plans, they are documents that helps an organization return to normal as quickly as possible following an implied event. The sections here overview and readiness chief handsome escalation incident responsibilities very very important. If you remember, for example, jury is the reliability engineer in the exercise, but she's not the lead of the team. Prioritization key tools, dashboards and monitoring and finally the useful links. And finally the documents for after the chaos is a classic. As you know, I couldn't allow me to miss the postmortems. Remember, a software postmortem is analysis conducted after a system failure. The goal is to understand why an incident or error happened and to learn from the experience. The mission with this asset is that the future software becomes more robust. Remember that problems happen. By conducting a postmortem analysis you can ensure that they don't happen. Goin reliability reports are common assets to communication the results of KPIs to the management. Remember that the software reliability is the probability of failure free software operation for a period of time in an environment. Some sections that I think you can include are indicator name collection method, assessment formula, state criteria, target and performance threshold, source of data, data frequency data entries and finally expiration or revision date. And that is all. Thank you. Thank you so much. Thank you so much for being here. Remember, you can find me LinkedIn, Twitter and medium at Jorinino and that is my personal webpage journino at.
...

Yury Nino

Cloud Infrastructure Engineer @ Google

Yury Nino's LinkedIn account Yury Nino's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways