Conf42 Incident Management 2022 - Online

Using incidents to level-up your teams

Video size:

Abstract

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams together to solve unexpected and challenging problems.

The first part of the talk will walk through the different things you can learn from incidents, including: * Taking you to the edges of the systems your team owns, and beyond - incidents help broaden your understanding of the context in which you’re building * Showing you how systems fail, so you can learn to identify and build software with good observability, and considerations of failure modes * Expanding your network inside your organisation, making connections with different people, who you can learn from and collaborate with

We’ll then talk about how to get the best value from the incidents which you do have as an individual, thinking about when is an appropriate time to ask questions, and how to get your own learnings without ‘getting in the way’.

Finally, we’ll discuss how to make this part of the culture of an organisation: as part of the leadership team, what can you do to encourage this across your teams?

Summary

  • Lisa Colin Curtis joined incident IO as employee number two last year. We build incident management tooling for your whole organization. incidents and incident response are very close to her heart. Here she talks about why she accelerated her career by running towards the fire.
  • Big changes in my understanding and my ability to solve larger, more complex problems came as a result of incidents. Incidents teach you to build systems that fail gracefully. A degraded service is far better than a completely absent one. They provide a great opportunity to meet people and forge strong relationships.
  • Use public slack channels wherever possible. By writing everything down, you're enabling everybody else to learn from your experience. Watch out for anybody playing the hero. If you think you get a lot of recognition for resolving incidents, imagine how much you can get.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Let's start with a story. One of my first coding jobs was at a company called Gocardless. I'd been there for a few months when we had a major incident. Our API had slowed to accrual. I was pretty curious, so I jumped into the incident channel. One, endpoint in particular was consistently timing out. So we disabled it to get the system back up and running again. And it worked. Step one complete. Now we had to understand what had actually happened. There weren't any recent changes that looked suspicious, so youre attention shifted to the database. It turned out the query plan for this particular query had changed from something that was expensive but manageable to something that was not at all manageable. We made a subtle change to the query, which got the database to revert to the good old query plan and everything was back up and running. We'd fixed it. Well, I say we. I watched quietly from the sidelines, furiously taking notes. After the incident was over, I turned to a senior engineer in my team, Daniel, what is a query plan? We'll come back to that in a second. First of all, hi, I'm Lisa Colin Curtis. Last year I joined incident IO as employee number two. We build incident management tooling for your whole organization. And so of course, incidents and incident response are very close to my heart. And really this is a talk about why I've accelerated my career by running towards the fire. When I joined Gocardless, I was pretty junior and I progressed quite fast. I made senior a lot faster than I'd expected. I started reflecting on how that happened. Of course, like anything, it was a number of factors, but one pattern really stood out to me. The big changes in my understanding and my ability to solve larger, more complex problems came as a result of incidents that I'd been involved in. I was introduced to new technologies, learned new skills, and met people who became some of my closest friends. And every time I'd come out as a better engineer. And this is why I love incidents. Incidents broaden your horizons. As engineers, we inhabit a world full of black boxes, whether that's a programming language, a framework, or a database. We learn how to use the interface to get it to do whatever it is we need, and we move on. If we tried to understand how everything worked down to the individual chips on each machine, we'd never get to ship, well, anything. Incidents force you to open the black boxes around you, peek inside and learn just enough to solve the problem. After the incident, I read up on query plans and this proved very useful. It was not our last query plan. Related incident we did have can enormous postgres instance. After all, it was also useful for building new things. I was suddenly able to write code that scaled well the first time. Rather than relying on frankly, trial and error in production, incidents give you great signal about which of these black boxes are worth opening, and a real world example that you can use as a starting point. Incidents teach you to build systems that fail gracefully. One of the key follow ups from the API incidents was to add statement timeouts on all of our database calls. This meant that if we issued a bad query, postgres would try for a few seconds and then give up. That might sound counterintuitive, deliberately cancelling queries? Someone's going to be sad, but if my option is to do that, or to get rid of the whole API, of course I'd choose to degrade just a handful of queries. This is an excellent example of resilient engineering. Youre system can now handle unexpected failures. We don't need to know what will issue a bad query, just that it's likely that something will. It's possible to read about these ideas in a book. There are plenty of great modes, but in my experience, nothing compares to seeing it in action. During the incident, I learnt a whole set of tools that I could employ to reduce the blast radius of these kinds of failures, not just the statement timeouts which we implemented, but all the other options that the incident response team discussed and discarded. Incidents are a chance for blue sky thinking. A doctor never wants to amputate somebody's arm when they choose to. It's because the alternative is even worse. In an incidents, nothing is off the table when you're already in a bad place. Sometimes you have to make one thing worse to mitigate the wider problem, and that's what we did during our API incident. We disabled an entire endpoint, which feels like a thing that you'd never do, but in context was absolutely the correct choice, and if given it another time, I'd make the same choice again. Incidents give you a rare opportunity to think outside of your normal constraints. A degraded service is far better than a completely absent one. Incidents teach you to make systems easier to debug. Observability isn't straightforward. If you needed proof, I've certainly shipped plenty of useless log lines and metrics at my time. To build genuinely observable systems, you need to have empathy for your future self or teammate who'll be debugging the issue if they're unlucky at 02:00 in the morning. And that empathy is again, hard to learn in abstract. The people I've met who do this really well are leaning on their experience of debugging issues, their pattern matching on the things they've seen before, and that allows them to identify useful places for logs and metrics, and useful metadata to include incidents are a great shortcut to get this kind of experience and build a repository of patterns that you can recognize going forwards. Incidents build your network. They provide a great opportunity to meet people outside your team and forge strong relationships along the way. As psychologists have known for a while, there's something about going through a stressful situation with someone that forges a connection more quickly than normal. Kate was one of the account managers of a partner who was really badly affected by our API incidents. She turned out to be a great person to know. She managed a number of our biggest partners, so she had unique insights into what they wanted and how we could serve them better. Before the incident, I'm embarrassed to say I didn't even know her name and I was on the product team in charge of serving partners. Incidents are great for building relationships in the wider.org most of the non engineering folks I met at Gocardless, whether from finance, risk or support or marketing, were during incidents and those relationships proved really valuable. They gave me a mental map of the rest of the and meant that I had a friendly face that I could talk to when I needed advice. As I became more senior, that network became even more important as I was responsible for larger projects which had wider implications on the company can incidents are a chance to learn from the best when things go wrong when things go really, really wrong, people from all over the company get pulled in to help fix it. But they're not just any people, they're the people with the most context, the most experience, the most skill that everybody trusts to fix the hardest problems. Getting to spend time with these folks is rare. They're likely to be some of the busiest people in the company. Incidents provide a unique opportunity to learn from them and see firsthand how they approach a challenging problem. For me, the API incident gave me opportunities to learn much faster than I otherwise would have. Who knows how long it might have been before I'd realized that I really did need to know what a query plan was, probably until my own code broke. In the same way, incidents have unusually high information density compared with day to day work, and they enable you to piggyback on the experience of others at Gocardless, I was lucky. Their culture and processes meant that I could see incidents channels and follow along, allowing me the opportunity to accelerate. But that's not always the case. Some teams run incidents in private channels by default, operating an invite only policy. That means that junior members who want to observe rather than participate might not even be aware that they're happening. Sometimes people are excluded for other reasons. It's not culturally encouraged to get involved. There's an in group who handle all the incidents, and everyone else should just get out of their way. Joining that in group, even as a new senior can become almost impossible. So let's look at what we can do to build a culture where everyone can learn from incidents by making them accessible. First, declare lots of incidents. This is the single most impactful change you can make to your incidents process. If you only declare incidents when things get really bad, you won't get a chance to practice your incident process by lowering the bar for what counts as an incident. When the really bad ones do come around, the response is a well oiled machine. Everybody knows the tools, everybody knows the terminology, and everybody can act as best that they are able to try and fix the severe issue. It also helps with learning. When problems are handled as incidents, it makes them more accessible to everyone around you. Now, maybe it goes without saying, but if you want to encourage that, the first step is to stop counting incidents. If youre count your incidents and consider more incidents to be bad, that's a clear incentive against people declaring low severity incidents. Second, encourage everyone to participate. Incidents are great learning opportunities and they should be accessible to everybody. Incident channels should be public by default and engagement encouraged for team members at all levels. Of course, there can be too much of a good thing. Having 20 people descend into a minor incident channel may not be the outcome that you're hoping for, but most incidents can comfortably accommodate two or three junior responders tagging along. This doesn't have to come at the cost of a good response. You can get this experience in a low risk environment either by asking questions to someone who's not actively responding to the incident or doing what I did and saving them up for after it's resolved. There are also lots of other ways to gather learnings. Reading debrief documents or attending post incident reviews are both great ways of getting value from your team's incidents. You could even compile a list of the best incidents debriefs to share with new joiners. They're a great way to get started in a new company. Get into the habit of showing youre working in an incident. Youre should put as much information as you can into the incident channel? What command did you run? What theory have you disproved? If you're debugging on your own, I admit this can feel a little bit strange. I've been sat at 10:00 p.m. In an incident channel having a frankly delightful conversation with myself. But it's worth it, I promise. It's useful for your response, as it means you don't have to rely on your memory to know exactly what you've already tried and when. And it makes handing over much easier if actually you need to go to a meeting and somebody else needs to take over. But it's also beneficial for the rest of the team. By writing everything down, you're enabling everybody else to learn from your experience how you approach the problems. What are the things that you tried? Where did you look to find that bit of information? Just because it's obvious to you, it doesn't mean it's obvious to everybody. That means we should be using public slack channels wherever possible so that everyone can see and having a central location where folks can go to find incidents that they might be interested in. I'm a bit biased here, but using an incident management platform such as incident IO really does help with this one. And finally, watch out for anybody playing the hero. Often a single engineer takes on a lot of the incidents response burden, fixing things before anybody even knows that they're broken. Maybe that used to be you, maybe it still is. This doesn't end well for the hero. They'll stop getting as much credit as they expect for fixing things as it becomes normalized and they're at risk of burning out. But it also causes problems for the rest of the team. Without meaning to, the hero is taking away all of these learning opportunities from everyone else by fixing things quietly in the corner. And that teams, no one else is ever going to be able to do what they do as effectively because no one's had any practice. While that's perhaps can effective job preservation tactic, it's not going to result in a high performing team. If you think you get a lot of recognition for resolving incidents, imagine how much you can get. If you can level up your entire team so that they can do the same. So that's all we've got time for. Thanks so much for listening. If you're interested in incidents in general, we've got a slack community at incident IO slash community and I'd love to chat to you there or on Twitter. You can find me at Henge and I'll also be on the comp 42 discord server enjoy the rest of the conference.
...

Lisa Karlin Curtis

Technical Lead @ incident.io

Lisa Karlin Curtis's LinkedIn account Lisa Karlin Curtis's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways