Conf42 Cloud Native 2023 - Online

The Legend of the Arsonist Firefighter

Video size:

Abstract

I will share my vast experience, from over a decade of fighting production incidents/fires to provide several ways to prevent fires/production incidents. Yes, in this talk will use “production incidents” and “fires” interchangeably.

Summary

  • Einat Mahat is an R D group leader at Sky Fighting Production incidents. He says monitoring is a crucial component of incident prevention. In push, not pull, alerts can be a powerful tool for incident production and response. Here are some ways developers can help prevent the next fire.
  • Have an incident communication plan. Keep all stakeholders informed about the status, technical people, management, customers and others. Use clear and concise language. Don't be an arsonist firefighter, be a true hero.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
I love tv series about firefighters. The firefighter devotion and leadership skills inspire me. I know it's totally different, but I see DevOps teams as firefighters, maybe because they first responders in production incidents and it seems they are heroes that save the day. Then come the developers to aid hi, I'm Einat Mahat, an R D group leader at Sky Fighting Production incidents. For over a decade, since I was a junior developer, I was drawn to production intensive dev teams later on. While leading such teams, we had a lot of action and I felt like a firefighter's lieutenant, metaphorically speaking, of course. And then it hit me, yes, we are great, we are true heroes, but some of the fires we actually set ourselves. Ness reminded me of the arsonist firefighter. I saw it in some episode and yes, it's a real thing. In a nutshell, it's a firefighter that sets fires only to become a hero. I'm not suggesting that any of you set fires on purpose or that you do it to become heroes, but I think that as developers, there are several things we can and should do to prevent the next fires rather than set it and put it out. Fighting fires for a living can lead to burnout and sometimes even cause many trauma. In a formal group of mine where we had major incidents, too often, one of the team members still get anxious every time he hears his phone ring. Let's talk about a few ways that might help us to prevent the next fire. Monitoring is a crucial component of incident prevention. The purpose of monitoring is to provide visibility into the performance, availability, and the general health of systems. Monitoring is an art, collecting and analyzing the data from various sources and crafting the right dashboards that will have the exact data enough but not too much to provide visibility but not overwhelm us. Investing in monitoring will be helpful for early detection of issues before they become incidents, enabling us to proactively avoid issues and prevent incidents. Good monitoring can also help us. During an incident, the teams can use the dashboards to quickly identify issues and resolve them and minimize their impact. Monitoring is great, but the issues should be brought to us via alerts. In push, not pull, alerts can be a powerful tool for incident production and response. By configuring alerts correctly to trigger when certain conditions are met, they provide teams with early warning of potential issues and enable them to take actions before those issues become serious problems. Alerts typically generate notifications, which may be delivered via email, SMS, or a collaborative tool like Slack. Some alerts will be sent to the responsible teams, while others will be sent to the engineer on call depends on alert severity to ensure that the person receiving the alert can take action even in the middle of the night. The alert should provide sufficient information when it comes to alerts, same as in monitoring. The trick is to know the balance between too much and too little. We want the alerts to detect the issues but not cause alerts fatigue. Furthermore, having too many false alerts can make it harder to take genuine alerts seriously. A postmortem, also known as a lesson learned or a debrief, is a common practice. We discuss what happened in an incident or event, good or bad, and learn from it. The goal of a postmortem is to identify the root cause of the incident, understand what went wrong, and determine what can be done to prevent similar incidents in the future. The culture around it should be one of facts and learnings rather than finger pointing. The postmodern process typically includes three steps. The first is gathering data about the incident, including the timeline of events, the impact on users and business operation, and any other relevant data. The second step is to identify the root cause of the incident, what went wrong, and why, and the third is creating action items based on the findings and following up on them. By conducting postmortems on a regular basis and implementing the lesson learned from each incident, teams can reduce the frequency and severity of incidents. It's important to know that production incidents can still happen even if you're not on call, so it might be beneficial to read most debriefs and learn from the experience of others. Most of us know what postmortem is, but there is also a premortem. It fits well when performing critical changes in production, but not only before we go to production. We gather the items at risk and talk about how we can prevent them from catching fire. The technique is to imagine that a project has already failed and then identify the reasons for the failure. This approach helps teams anticipate potential issues and develop mitigation strategies before they become actual problems and prevent incidents. The process typically involves three steps with the team and stakeholders. One, imagine failure. We are in production and there is a failure. What can it be? Brainstorm potential issues and consider worst case scenarios. Two, identify risks. What are the major risks? Their potential impact and priorities. Three, develop mitigation strategies. What could have been done to prevent this, or mitigate the issues? By conducting a premortem, teams can be better prepared to address potential issues and reduce the likelihood of incidents or downtime. It can also help teams build confidence in their change and improve the overall quality and health of their systems. Rollout and rollback plans are critical components of development and deployment process designed to ensure that changes are implemented in a controlled manner. The rollout plan usually includes the order in which components will be updated, the timing, and the required resources. A good rollout plan will also identify potential risks and include issues that may arise and consider mitigations. There is a debate about whether or not everything should run in gradual rollout, but the rollback plans are often skipped. I think that for critical changes, we should plan our exit strategy. That's a rollback plan. A rollback plan outlines the steps that will be taken to undo a deployment and revert back to a previous version of the system. It should also include a clear criteria for when to initiate a rollback. The purpose of these plans is to help ensure that changes are implemented in a way that minimize the risk of downtime. By having a clear and well defined rollout and rollback plan in place, teams can be more confident in their ability to make changes while also being better prepared to respond effectively when issues occur. Runbooks are a set of procedures we write for common things that might break and how to troubleshoot and tackle them. Runbooks are designed to help teams perform tasks consistently and efficiently and can be used by both developers and operation teams. In general, runbooks contains step by step instructions for carrying out common tasks and may also include visual aids to help clarify complex procedures. The main benefits of roundbooks are that they help ensure that tasks are performed consistently and accurately, even in high pressure situations. By creating and maintaining runbooks, teams can reduce the risk of errors and increase the speed and efficiency of their operations. Creating solid runbooks is helpful during an incidents, but we can also use it to think about potential errors and prevent incidents. I believe in continuous improvement in life at work and specifically when talking about systems and incidents. Continuous improvement in this context includes regularly reviewing and refining our process tools and systems to ensure that we are continuously improving our ability to prevent incidents and efficiently respond to them. This involves conducting regular retrospectives to identify areas for improvement, implementing a process of regularly reviewing and updating runbooks, monitoring and analyzing incident trends to identify patterns and root causes, and seeking feedback from users and stakeholders to identify areas where we can improve. By continuously improving our processes and systems, we can help ensure that we are always learning from incidents and working to prevent them from happening again in the future. We talked about seven ways to prevent fires, and there are many, many more. Unfortunately, we will not be able to prevent all fires, so let's talk about the most important thing during an incident communication during an incident. There is a lot going on on the scene. There are many people that need to work together. I use these three guidelines to ensure clear and effective communication during incidents. Have an incident communication plan. Keep all stakeholders informed about the status, technical people, management, customers and others. Decide who will be responsible for communicating and what information will be shared. Assign clear roles and responsibilities to individuals involved in resolving the incident. This helps to ensure that everyone knows what is expected of them and provide regular updates. This helps to keep everyone informed about the status. Use clear and concise language. Avoid technical terms. When talking to nontechnical stakeholders, it is important to be transparent about the incident. You must communicate. You have to update on your progress and make sure everyone is aligned well, we did not save any fires. We did not even save a kitten stuck on a tree. But from my experience fighting production incidents called fires in this session, I can definitely say we all can prevent the next fire. Instead of setting it yourself. Don't be an arsonist firefighter, be a true hero. Thank you.
...

Einat Mahat

R&D Group Lead @ Skai

Einat Mahat's LinkedIn account Einat Mahat's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways