Conf42 Chaos Engineering 2020 - Online

- premiere 5PM GMT

Postmortem Culture: Learning from Failure

Video size:

Abstract

Practicing Chaos Engineering and reproducing outages have taught us that the culture of postmortems must be open and blameless. That is difficult, in part, due to the social stigma associated with publicly acknowledging the contributions of persons to outages.

Summary

  • The title of my talk is postmortem culture learning from failure. According to the survey applied to 45 engineers in my country, software engineering don't practicing postmortems. I will explain why we can use chaos engineering, a chaos game days to promote this practice in companies.
  • A survey found that 55% of people don't know what a postmortem is. Lack of accountability is related with a blameful culture. How can we change this culture? Using chaos engineering. Planning a Chaos game day.
  • Gaweta at this moment is a web application, provides the functionalities as a service. In the top bar we have four options to manage the event. In a second option we have the possibility to define or to provision the infrastructure required to run the Chaos game day. And in a final option we can see the postmortem and the actions provided by your team.
  • Sidney Decker told us to see the world with a different view. If something goes wrong and a human is involved, it is a symptom of a deeper trouble. Notice a cause of them. Thanks for listening.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thanks for coming. I'm very excited to be here. The title of my talk is postmortem culture learning from failure. Nice to meet you. I am jodinignino. I work as DevOps engineer for Awa Digital Labs, a company in Colombia that provides technology and innovation service for our banking group. And also I am engineering advocate in my country. There are photos of my coworkers and my company. And before starting, I would like to tell you about my hometown, Garagoa. I am from Garagua. Garagoa is a town located in the Bojaka department in Colombia. And Garagua means behind the healing Chipcha language and it is a town located in Bojaka department in Colombia. Since I can remember each December 16 people in Garagoa celebrates the end of the year with a postmortem. So the postmortem is called the death of the sadness. And in this ceremony, people evaluate their actions in the last year and make resolutions for the new year. So this celebration made me wonder if we are doing postmortems in our life, in our daily life, why software engineers don't practice postmortems after an incident. According to the survey applied to 45 engineers in my country, software engineering don't practicing postmortems. In this case, for example, just 40% of them read a postmortem and 60% of them don't practice postmortems in their jobs. So knowing there is a problem, in the next 15 minutes I am going to talk about postmortems. I am going to try to explain why we don't write postmortem culture. I will explain why we can use chaos engineering, a chaos game days in order to promote this practice in companies. And finally, I am sharing our journey in my company, trying to implement this practice in our daily life and trying to promote chaos engineering in our jobs. So let me start with some definitions about what is a postmortem. Postmortem. According to the site reliability engineering. If you remember the book of Google, a postmortem is a written record with the details that happened after an incident. And according Pedro Daddy, a postmortem is a register of what happened after the postmortem. However, there is a definition I like more that is an answer for these two questions and what went wrong and how do we learn from it. So probably you remember these two postmortem culture. First one, written by GitLab documented an incident in which an engineer drop a database in production. So it lets us several hour outages and the second one details one of the most critical outages in AWS in February 2017. And this document explained how they overcome a failure in the s three service in Virginia. And it has a huge impact in many websites. So if the companies such AWS and GitLab are practicing postmortems, why we don't do it? So according to the same survey, the most common persons include ignorance and culture. According this, 55% of people said that they don't know what is a postmortem and a second group think that writing postmortem is an activity for DevOps engineer and operations engineers. But if they are software developers, why to write postmortems? So this answers was studied by Adrian Hosky and John Osbo, who concluded that the lack of accountability is related with a blameful culture. So John Osbo analyzed the five y methods, one of the most famous techniques to write postmortems. According to his work, this technique is not a proper way to promote a blameless culture. Asking why conduct? To answer another question, the question is who? No why? Which is almost every case is irrelevant. It is common to find postmortem culture conclude with root causes like this in which a human is tagged and blamed for the incident. Now we know that there is a culture of blaming and how can we change it? How can we pass from these sentences to a culture where the failure is promoted? Our proposal is with chaos engineering. Chaos engineering is the discipline of practicing or injecting failures on production in order to reveal the winners in the systems. This definition is available in the website of principles of chaos and with a Chaos game day. Chaos game day. They are an easy way to introduce engineering teams to this practice. So in this exercise we have three roles, a master of disaster, first on call person and the team the engineering team. In this exercise, a master of disaster decides, often in secret, what sort of failure the system should undergo. With the team guided in one room, physical or virtual, the master of disaster declares a start of the incident and start the attack. And one member of the team who ask a person called person try to see, triage and mitigate whatever the failure that a master of disaster has caused easily. After that, the idea is the team analyze and understand the failure and provide a solution for them. And they even finish when the team write a postmortem that is a Chaos game day. So what it does means in the practice, what are the activities involved in this? Planning a game day involves a lot of work because if we are planning this event, probably we have to create an agenda. We have to define what are the events involved in them. So after that we have to define the users, define who is the master of disaster, who is the first on call person and who are part of the team. And in a third activity, the idea is to send communications in order to keep contact with the people involved in the event. And after that we have to design the experiment, because chaos engineering follow a scientific method. So the next activity is design the experiment. And finally, we have to provision hardware, software, chaos attackers, chaos agents, for example, chaos Toolkit or Gremlin in this part. And finally, we have to provide an observability tool, because we need to have a vision of what's happening during the year. So although Gremlin and Chaos IQ have done a great work generating and contributions, tutorials, documents, experience for doing chaos game days, the reality is that planning a chaos game day is an activity, takes a long context. On average, a person spends 90 days planning a game day. So probably we don't have time to have a lunch. That is it. So in order to reduce these times in digital labs, in my company, we designed a tool to reduce these times and we are working in the implementation. So it is a view of Gabeta. Gabeta means in english toolbox. So it is a view that includes four layers with activities, separates for them. And in the first layer we have the roles and the users involved. In the event, we have the master of disaster, the persons called person, and the team who are accessing to Gabeta, probably using a web browser or a mobile device. Why a mobile device? Because the chaos game days are exercise in which people is relaxed and with food, music. So probably some activities of deep is easier with a mobile. So these persons are accessing to Gabeta to the core gaweta. In the second layer, Gaweta is implemented with Go and it is using terraform in order to provision the infrastructure required to the event. So in the third layer, we have five layers for managing the Chaos game day. If you remember, we have activities for planning these events. So in this case we have a planner in order to have a tool to plan the event. So the idea is to have the possibility to use different tools in this case. So in this case, I are using Google calendar. But if you want to use another tool, it is possible. So in the second supplier we have tools to send communications. For example, you can use a slack or push notification for the mobile devices. So in a terror supplier we have a terraform. Terraform has the responsibility to provision the infrastructure required for the event. So in this case, probably we need a cloud provider, for example AWS or GCP. We need a chaos agent. In this case we are using Gremlin or chaos toolkit anyway. And finally we need an observability tool. In this case we are using datadoc but you can use new relic or any others. So finally we need two tools in order to provide infrastructure to write a postmortem and document the actions in the next steps because the idea is to provide in this case with Jira, provide tickets or user stories in order to avoid a similar incident in the future. So finally we have the system. So probably you remember this kind of architectures. Gaweta at this moment is using this architecture for provide the features mentioned and in the last slide. So in this case we are using a hexagonal architecture. The hexagonal architectures use three concepts, the core domain in the center of the hexagon and adapters and ports in order to communicate the external work with the internal core. So it is a view of Gabeta using this architecture. In this case we have in the core handlers in order to manage the activities involved in the Chaos game day. So in this case we are using a handler for plan the event and I am using another handler to manage the communicator in order to send communications and reminders to the participants. And I have a handler to manage terraform in order to provision the infrastructure required for this. And finally we are using two handlers to document and register in the backlog. The next action, the responsibility of Gabeta orchestrates these handlers and we are using another concept, the adapters, in order to have the possibility to implement these interfaces using different technologies. So in this case, for example, we are using Google suite for planning the event. But if you can use, I don't know, outlook or any other tool that is possible using this concept adapters and interface and using a core with the responsibility to orchestrate them. So let me show a video with the mockups for Gaweta. So I think that it's possible to see the yeah, in this case I am using the master of disaster credentials to access Gaeta. Gaweta at this moment is a web application, provides the functionalities as a service. So the first thing probably you want to do in Gaeta is define what are the adapters. In this case, for example, we are using Google suite for planning the event and we are using slack for sending communications. But as I mentioned, it is possible to use another tool if you are providing an implementation for the interface in this case. So we are using terraform and the concept related with infrastructure as a code in order to provision the tools required to run the event. And finally we are using confluence and Jira for documenting and providing the next steps after the Chaos game they finish. So now we have a view of the home page for a master of disaster who is planning the event. So in this case gawa in the top bar we have four options to manage the event. So we have a first option to plan the event. What does it mean that we have to define the agenda? We have to define the agenda. Probably this text in this text field will be transformed by Gaweta in activities in Google Calendar for example. In a second part we have the possibility to define who is the master of disaster, who is the users and who is the persons called. We have to define the users before because we have to send communications or to send reminders to the participants before the chaos game day start. So in a second option we have the possibility to define or to provision the infrastructure required to run the Chaos game day. So in this case we are uploading a terraform file because terraform offers the possibility to define using the concept providers in terraform define what are the cloud providers. If we are using for example AWS, we have to define a TF file for terraform with this definition. So in this case the file allows to define for example if you are using AWS, datadoc and gremlin for attack the application. So in that option we have the possibility to create or design the experiment. Remember that the intention in the chaos game day is to design the experiment. So in this case I am registered the application name, the observability tool, the hypothesis for the experiment and in this case for example I am trying to test if the histrix configuration and the sequel breaker pattern works in this architecture. So in the next option we have the possibility to chaos any attack in this case during these ten minutes. And finally we have the possibility to define what is the blast radius. And also we have the possibility to add some notes and the idea is when the experiment is defined, launch the attack during the event. And in a final option, in a final option we have the possibility to see the postmortem and the actions provided by your team during the event. So yeah, that is so slow. Did you click on that? Probably yeah. That is a view of a postmortem provided by Gaweta and it is a post mortem prepilled by Gaweta and that is the responsibility to the team to complete them. But that is an example of Gaweta could do by the team providing a template of a postmortem inspiring in this case with the KSIQ template for postmortems. So, to close, I would like to share the conclusions of signire is the author of this classical book, a really good book. So the field guide to understanding human error. And Sidney Decker told us to see the world with a different view in which the human are not the cause of the problems. If something goes wrong and a human is involved, it is a symptom of a deeper trouble. Notice a cause of them. So that is all. Thanks for coming. Thanks for listening. They are my social networks if you want to contact me. Thank you and welcome in Colombia. Welcome in Carago. Thank you. Thank you very much.
...

Yury Nino

DevOps Engineer @ Aval Digital Labs

Yury Nino's LinkedIn account Yury Nino's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways