What Can We Learn from Formula 1 Incident Management

Video size:

Abstract

How can software and SRE teams learn about incident management from Formula 1? This talk will discuss the key takeaways from a real-life incident where Red Bull Racing performed a miraculous repair on Max Verstappen’s car in Hungary, turning a potential disaster into a podium finish.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello and welcome to com 42 Site Reliability Engineering 2025. My name is Ricardo Karo and today we're going to see what can we, what we can learn from Formula One incident Management. What are we going to cover today? I'll start by setting the stage. We'll see a natural F1 incident, and we're going to see how a team, namely Red Bull went, the series of steps that they took and the learnings that we can take from that incident. To overcome and actually have a car that could run and score points at the end, we'll see how, why all of this is important. So let's start by setting the stage. So the year is 2020. So after a series of years of Mercedes just winning both the drivers and, and the constructions championship. 2020 year was probably the first year when there was a feeling that some other team could challenge the title, where this is the third race. So Red Bull is one of the contenders for both the drivers and and constructions championship. And Maxwell Staffer is their star driver. During the qualifying, max didn't do a very good job, so he qualified seventh and during the Formation lab this happened. So what happened here? So the formation lab is when the drivers try to warm up the tires and get them to the most optimal temperatures that they want. As you guys can see, the track is is wet because it had been raining, and what happened was that max. In a, in an effort to warm up the tires as much as he possibly could. He tra he brake too late and he had a lockup by locking up and a track being wet max just sl just went through the barriers and he actually crashed into the wall. Let's see our first lesson that we can refer today from and start with a small video. So I don't see if you noticed, but there's a timer around the top left of the video. This means that at this point, red Bull has under 23 minutes to do something so that the car is able to to be driven by Max. So what's happening here? So Red Bull needs to make a decision. They either go into the pit lane or they try to fix the car on track. So they need to make a fast decision and they have very little information at this point. They have sensory information and the feedback that Max has been giving them. So no mechanic. No mechanic has ever has seen the cardiac. So although they understand that they probably have a broken suspicion, they don't really know what's in there. So by making this decision, if they go into the pit lane, they have everything there, there's their garage, all their tools are there, all their parts are there. But if the race, when the race starts, Marx start, max starts last. If they do it on a track, they lose four minutes, but, and they have to take everything in there, so they need to make a fast decision with limited information. So they need to act decisively even if they don't have the full picture. Why is that? They want to minimize exp impact. They act fast and they want to stop small problems from becoming big couches in our world. It's this worry that we need to act fast before a small incident becomes a very big outage. Also, in our context, we need to protect our customers. We need to protect their assets and. Everything that we own from them, of course, come from a company perspective. We want to preserve revenue. These the quick actions that we do reduce downtime and reduce our financial losses. And of course, we want to maintain customer trust. Customers need to understand that we are in control and we're doing our best to solve the incident. As engineers, we need to become comfortable with this discomfort. It. And we also need to keep in mind that many of these decisions will oftentimes be sub suboptimal or even completely wrong. So using F1 as an example, they, no mechanic has ever even seen the car, and they decided to go to the track and fix it there. Let's see what they did next. As we can see here red Bull is moving everything from their garage into the track. So you see everyone speaking clearly very commonly. So people are asking for tools, people are asking for parts. People are asking who is doing what. So it's very clear, so there's no confusion. People are speaking in a calm voice and very clearly. So clear communication. It's all about sharing the right information with the right people at the right time. This will lead to faster resolution of sance. Clear communication will get everyone on the same page. This will reduce efforts because good communication will prevent confusion and mistakes. It of course, improve coordination because everyone will be working together to fix an incident and it'll increase transparency. Open communication will keep everyone informed and will build trust. So takeaways from here from this good communication helps everyone understand and resolve incidents faster. It will allow for everyone to be in the loop and mentoring just during incidents, be them with stakeholders or clients. Let's see what comes next. So as you can see here, we see an engineer asking for a front wing, and immediately someone who's coordinating the incident just raise the awareness that they have clear processes to do everything. So this means that they should follow a predetermined process to actually fix the car. Why is this important? So clear processes are, it's all about following defined steps that will help us be effective in incident resolution. By following these processes, we can enable a faster response. This will enable quick rea reactions to incidents. It'll reduce chaos so everyone knows what should be, do the what shape they should be doing and at time, so they, this will bring some order and will prevent confusion during an incident. It, of course, will enable consistent execution. These processes will ensure that incidents are handled every time in the same way, and it'll improve collaboration. Everyone understands their part, everyone understands what they should be doing, so it'll be making clear that person A and person B should be cooperating. So we need to plan for incidents before they happen, right? So we need to try to do, to have these processes defined and we need to try them, we need to test them out. We need to have simulations how these are done. And by doing that, these clear steps will make for a fast response. So next we talk a little bit about teamwork. Let's, so let's see an example. As you guys can see here, it's a bit a little bit of the process of taking stuff into the into the track and communicating what should be done. So if there's no good teamwork there, this would be possible. They are literally transferring a garage onto the track. So collaboration and shared responsibility will allow for a faster resolution. So there's a diverse skill set for faster fixes. So different people will be, bring different strengths for to incident response, and they need to cooperate and work together. Of course, teamwork makes a dream work. Collaboration will lead to small, smoother, and faster resolution. Everyone plays a part and they need to understand in their processes where they should be leaning into. And of course, we will all learn together. Teams that learn from incidents together will get better together, so teams that work together solve problems faster. There's not one person who fixed a whole incident. It's a teamwork, and of course, teamwork builds more resilience and more effective response to incident. Next we talk about columnists. So let's see an example. So at this point, red Bull has under seven minutes to get a car running ready to be driven by Max. So you hear communication, you hear no one screaming. Everyone keeping calling. You hear Jonathan Whitley asking for updates, right? He's not screaming. He's very calm. He knows that his engineers are under enormous pressure. They need, they not only need to get a car fixed, they're going, they're having a car that's going to be driven. And over 300 kilometers an hour for 50, 60, 70 labs. So it's not only about fixing the car and making it run, it needs to be driven at fast fast speeds under immense g forces. So they need to nail this right and Max will trust them. So keeping calm, it's all about think and post under pressure to be able to think clearly and make good decisions. If we have a calm mind, it means that we can, that calmness can help us have, make good decisions during the whole incident. Of course, calm teams work better without the other pressure of someone screaming or someone asking constant feedback. It'll reduce stress and will help everyone to work better. We have to stay tough in this, the, in these situations we have, we need to build our mental strengths to be able to handle the strengths of these incidents. So we should tr we should have simulations before of actually doing these things to actually build that toughness over over time. And we don't need to rush, right? So we need to take as much time as necessary to assess and plan before taking action. Like I said before, they not only need to assemble the car, but they need to keep in mind that someone will be driving that car over 300 kilometers an hour for 67, 80 times. So a coal mine will make the better decisions during incidents. So we need to train our teams to handle these stress situations. So by simulating these situations before so for example, in our world, it'll be using game days or using chaos engineering. We will start to build that toughness and their that control during incidents. Next, we touch on technical proficiency. So at this point, this is more on the beginning of the incident. Red Bull is assessing if they can do it or not. So they basically are saying that they need to do it faster than they ever done it before. So this means very high technical proficiency. So they need strong skills to be able to efficiently diagnose and resolve the incident. So in an incident, skills will matter, right? If you have strong skills, they will be essential for a quick and effective incident. Resolution. In an incident. Knowledge will be power. The more a team knows about incidents, about their systems, they will be better equipped to prevent and handle those incidents. And as F1 drivers, our teams needs to be, need to be sharp, hone their skills and train them over time. So stronger technical skills means that stronger, a stronger incident response, and teams will need the right tools to empower them to resolve incidents effectively. Last but not least, we have postmortems. So let's see an example here. So for Formula One, teams are famous for doing brutal postmortems. And when I say brutal, it means that they go into very detailed of what happened during the race. They're famous for doing not only when things go wrong, but also when things go right as well. So they really want to understand how things went. In this video, we saw an example from the Mercedes, a MG team. They are very public about their postmortems to the extent of what they can divulge, of course, without compromising themselves and giving secrets to their competition. So postmortems, it's all about learning from incidents to prevent future problems. So after an incident is resolved, we want to find the root cause. We want to dig deep and understand why an incident happened, not just what happened. So we want to understand the why. Why did this thing happen? Also and very important, we want to learn from mistakes. We want to use incidents as opportunities to improve processes and prevent that incident doesn't happen again. Very important is that we have no blame postmortems. We need to focus on learning and improvement, not on blaming someone. Postmortems will help teams learn from incidents and get better over time. They need to be focused on fixing problems, not finding fault. So why is this important? At the end of the day, here's what we want. We want to prevent incidents from happening. When an incident happens, we want to meet again, mitigate and fix them as fast as possible. And of course, we want that. The same incidents don't happen over and over again. Unfortunately, many organizations don't have many of these practices that we see in mind. But as we could see here, some of the top engineering organizations in the world do this because it works. So either if you and independently, if you like F1 or not we can all agree that F1 is one of the top engineering disciplines in the world and these are the things that they do to actually fix an incident. And if I had to start somewhere, I would start with postmortems. Postmortems. It's all about learning, right? Learning from what happened. And I think that postmortems will help you develop all the others because when you have an incident, you do postmortem, you'll learn, you understand, you will over time will build the other capabilities because you are constantly learning. So at the end, what happens, let's see. So let's take a moment here to understand what happened. So Max crashed his car, the radio formation lap in under 20 minutes. The mechanics transferred their garage onto the track. They fixed the car, and that allowed Max to go from seventh into second. This is amazing, right? So all of these skills and all of these items that you see during the presentation allow for someone who had a broken car and in under 20 minutes had a car able to be driven to go from seventh to second. This is simply amazing. So here are just a few links if you want to learn a little bit more about this incident in particular and see the whole video. So the first one is the video. So the video is roughly eight minutes, but it's, it was reading it. So the second link is about that grand Prix that we just saw that was in Hungary. And the third one is the Mercedes a MG channel that has a lot of forced postmortems that you can see and see how they handle them. And this is all for my part. I hope you enjoyed it and I hope to see you again in another day. Thank you very much, and follow me on my socials.

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

What Can We Learn from Formula 1 Incident Management

Video size:

Abstract

Summary

Transcript

Ricardo Castro

Principal Engineer, SRE @ FanDuel / Blip.pt

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

What Can We Learn from Formula 1 Incident Management

Video size:

Abstract

Summary

Transcript

Ricardo Castro

Principal Engineer, SRE @ FanDuel / Blip.pt

Join the community!