Sleeping with one eye open: Experiences with production support

Video size:

Abstract

I will talk about my experience as a software production support engineer. For over 7 years I have been supporting software in various degrees and have developed some insights I feel would be helpful to others. I will cover the most vital aspects of software production support and include some of the more memorable stories, with the lessons learned.

Put on the right hat: Assuming the role of a support engineer

Ask questions: Getting to know the problem

Have your tools ready: Being ready for analysis

Trust your team: Code reviews, quality assurance and best practice are your friends

Take downtime: Ensure you recover

Summary

Quintin Balsdon is an Android developed. Has been working in London since 2017. Has supported a number of different applications, from startups to long running apps. Hopes to create an environment of understanding and communication.
Everybody should be involved with production support. If people aren't involved, they tend not to understand the nuances of the infrastructure. The more we get involved, the less we actually have to do it. It's very important that we see both the positive side and negative sides of supporting a production application.
Learning from how other companies respond to problems can be really, really useful. When going into a production support call, it is so important that we lose those agendas. You want to ensure that these elements are always maintained, the perceptions and your reputation.
Questions can be asked because of three great reasons. We need the answer. Sometimes we ask questions because asking is important. Someone taking you through how they came to the end result can yield the best result. But don't use questions as a mechanism of intimidation.
You want to be careful with reports that you know how to read in between the lines. Just having one report is just not good enough. Management is having one communication tool. Communication is your biggest asset, but it can also be your biggest detractor.
The temptation is to bypass code review and bypass testing in the name of an immediate fix. When you take a risk like that, you're taking it on behalf of the entire organization. There are times when you just need to leave the problem there.
We need to take down time to look after ourselves. You want to create very, very distinct boundaries of what you're prepared to do. Those boundaries need to be effectively communicated, especially when it comes to taking time off.
One of the things I would really encourage is doing wash ups. It allows you to communicate that you understood the problem. Also important to check your merging and release strategy. Having release notes can be very telling.
Having the buddy system employed on your support team is the best decision you can make. Making sure that people have access to the numbers they need to call to the right teams. Schedule management is not just about having a calendar in place, it's about connecting, collaborating and communicating.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You. Hi there, and welcome to sleeping with one eye open experiences in production support. My name is Quintin Balsdon. I'm an Android developed. I've been working in London since 2017, and I've been developing applications since about 2010. I've supported quite a lot of different applications, from startups to long running apps. They've been both mobile, desktop, back end. I even have a stint supporting us. For seven years I supported an Excel macro that I wrote, and I think it's really important that as people involved with software, we understand that we are part of an entire ecosystem, and it's something that it's not really explained to people when they join, that everyone is a part of software delivery and the entire process that's contained within that. After a particularly intense three year project where production support was a major aspect, my company asked me to just write down the things that I'd learned, and that's how this presentation actually came about. And it's my hope that I create an environment of understanding and of communication that we can all learn from each other and grow and become better at doing this. So the first question we need to ask is, how does software support fit in? And I think in terms of the software development lifecycle, it is that last element of maintenance that we're really focusing in on. I think that everybody should be involved with production support, and that doesn't mean that you are doing overnight support getting called out at 03:00 a.m. Not everybody is at a point of their lives where they are even capable of doing that. But I do think that everyone needs to be aware of what's going on. I have found that those who have invested the time in doing production support have a far more holistic view of the domain in which they operate. And they understand the nuances of all the infrastructure and technologies that are communicating, and it makes them better developers and better people who seek to understand how things work. And I think it's imperative as well that we consider it critical, because if people aren't involved, they tend not to understand the nuances of the infrastructure and architecture that they're working with, and they tend to be less capable of spotting potential problems earlier on. And so the more we get involved with production support, the less we actually have to do it, because we are learning the way that our particular infrastructure works and are capable of dealing with that. There are times where we had one particular component, when that component failed, we knew where to look, and once we started noticing that pattern, we were far more capable of saying we need to do some work in that area. We need to go and figure out what's going on. Why is that component failing and how do we make it better so that we don't get called out all the time? I think that supporting a production application can sound quite scary. No one wants to get called out or feel massively responsible. There's a lot of ghosts in the shell that we might not want to have to experience, and it's very important that we see both the positive side and negative sides of supporting a production application. One of the best things that I've found is that you build your team in such a phenomenal way when you get called out together. There is a big sense of camaraderie in walking off of the battlefield tired and broken, and knowing that you've done your best to support your customers. A few years ago, I wrote a personal app as a joke. It really was not intended to be popular and someone created a Reddit page for it and the popularity skyrocketed and I ended up with 30,000 people using my app at one time. That was particularly scary for me because I had no mechanism of supporting a user base on that scale. And I realized that no matter what I put into the wild, it could get used by a lot of people. And I think having that knowledge in that aspect is really important. One thing we can do is look at the example of others in the news. I've done a lot of learning from just watching how other companies respond and react to problems. And so I'd like to introduce to you a few use cases. Some of the stories that I'm going to mention here, I've actually chosen very recent ones. Latest is two years, and I'd like you to keep these in mind as we go through these, because these are public mentioned in the news. You may have experienced them personally, but learning from how other companies respond, whether good or bad, can be really, really useful. So I'd like to just mention a few of these incidents. One of the biggest ones that stood out to me was from the 20 April to the 20 May 2018 TSB had a problem where 1.9 million customers couldn't access their accounts. They had no access to their bank accounts for a month as a result of a software rollover to a new system. They were exporting their servers from one place to another, and as a result of not being willing to roll back, they denied their customers access. In December 2018, there was a third party certificate renewal for two, a cellular provider in the UK, and that resulted in 30 million customers having no access to the cell phone network for a significant period of time. It was the good part of a working day. If. If 2020 didn't have enough problems, July we had Virgin Media where they had 10,000 customer complaints recorded in down detector, and that was their second outage in two weeks. In the same month, Facebook sdks came out with a problem, and that caused Spotify, Pinterest, Tinder, and a lot of other apps to fail. In August, Spotify's transport layer security certificate wasn't up to date. Security certificates are a big problem, and it's something one of the things you should keep the finger of the pulse on and one of the biggest ones in 2020, you might have experienced it, but for an hour, Google's single sign on went down on December 14, and people had no access to YouTube, Gmail, and other Google based services. That was really telling. While that was going on, it was really interesting to see how they responded and how they were trying to mitigate the problem and what the public's access to this information was. One of the biggest learnings I got from that, that we had as well, is you don't want to necessarily blame a particular service. YouTube is down. Gmail is down. When you start seeing a whole bunch of services not working, maybe it's the sign on or some kind of authentication layer. We can get to that later. And then most recently, Signal, which is a messenger app, suddenly gained popularity because of WhatsApp's privacy policy changing. And they were endorsed by Elon Musk. And the influx of new users created a problem for them at scale that they struggled for quite a few days, and they were really good at telling people what the problems were. Monzo has also been quite good at getting back to customers and saying, we're sorry, we're down, we're working on the problem. Please have patience. So it's no doubt that even the biggest of giants are capable of falling and slipping up. And how we manage ourselves as the developers of software and as the representatives of these companies can make a huge difference. So you get called out, it's 03:00 a.m. In the morning. You've just kind of opened your eyes and what are you going to do? What is my advice to you? I would say the first thing to do is to ensure that you have the right goal in mind. So I call that putting on the right hat in my day to day job. I'm an Android developer. When I go in, I have certain tools in mind. I want to make the app better. I want to improve the infrastructure, I want to do certain things, and I'll have my sticks that I want to use as my particular tools. When going into a production support call, it is so important that we lose those agendas, that the goal is to diagnose the problem without necessarily laying blame on any individual. We want to delegate in terms of making sure that we've communicated with the right groups of people, and we want to make sure that the decision we make is the best one we can given where we are at that point in time. It's so relevant to point out that production support is mostly a communicative and collaborative effort. What happens during a call out will affect others perceptions of you. This is personally, professionally and externally as a company, and behavior is so important to your reputation. You want to ensure that these elements are always maintained, the perceptions and your reputation. When software fails, it doesn't matter who's to blame. The fact is, there is a problem. Blaming will only get you so far. Understanding will get you so much further. And so when it fails, we want to make sure that we discover what's broken and we take the time to fix it properly. It's really telling during that two outage that they were so quick to point out that their third party provider didn't renew a certificate. And it's really unfortunate because they really laid the blame quite hard on Ericsson. And it's an understandable error. Some people just don't have that, and they weren't thinking about that. You set the security, the certificate to expire in ten years, and you don't write it down or have a system to create a reminder. It's an understandable error. One of the things that happened to me is we had an issue where the hardware secure module had failed and they needed to drive out a new component to the data center. And while this was going on, we had to put all our load onto one particular server. So our load balancers weren't really operational, and we had to keep quite a tight finger on the pulse of our system. We had to baby it and look after it. And thankfully, others were two of us on call at that time. What ended up happening was we had one person on monitoring and one person on communication. And deciding on those roles early was so important because we were the people who were involved on the front end. And when apps fail, the front end gets blamed. And so having someone who was monitoring the front end and ensuring things were still working, doing diagnosis, checking everything, and not having to worry about the communication element was really, really important. We decided on whose role was who was doing what, and it was really effective because when all those requests were coming in, we had so many requests from WhatsApp, from email, from an internal communication tool, from slack, from teams everywhere. It just seemed to be coming from everywhere and piling on us. But one person focusing on communication meant that that was your job, and one person focusing on diagnosis and fixing that was their job. And defining roles early and quickly was one of the best things we could have done. In another incident that we had. Once the sales team called me from an international client and I had to run out and sort and source a Bluetooth low energy printer for a system we were developing. And so right at that point in time, my goal was not only to get something out that could work and do the job, but I had to actually go and find a supplier that was a really interesting one to have to actually go and source hardware years. The real point is just define your role and know in that case, what you're going to do. Sometimes the best way to do that is to ask questions. I think questions are one of the most effective ways of guiding a conversation and taking control of a situation. When we know how to wield our questions in a proper way, I think we can use questions so nicely when trying to understand rather than trying to point out a particular system. And again, this goes back to agendas. I trust believe that one particular component needs to be rewritten. And I go in and when I get a call out, I say, it's that component, it's doing it again. And you just want to take your agenda and drive it home, whereas you might actually be wrong. And when you're wrong and you're making a hard statement like this component is the problem, you are developing a reputation of someone who can't be relied on in a crisis. And by rather asking questions, we can get to a point where we're learning and suggesting without developing a negative reputation around ourselves. One of the best questions that I used to ask was, could it be this component? And then someone would come along and say, no, it can't be that one, because we see this issue here. And so not only do I learn that my stuff isn't working, but that someone else's stuff isn't working. And so we know now we know as a result to look higher up in the chain or to look at different ones that might be related to that, and see how do we tease out this web of problems? By using effective questions, which also tends to the point of don't cry wolf unless you're absolutely sure, because it's going to distract your team. So often qas and testers will tell me, to reproduce a problem, you need to click here and then push back, and then you'll see the issue. And I found that I really need to ask a lot of questions around that, just that kind of action, clicking on a button and then pushing back. When I navigate to the next screen, do I wait for the next screen to finish loading before I push back, or do I push back while it's loading? And a lot of times qas just assume, or testers or even developers will assume that you know what they're talking about. But I'm not seeing what you're seeing necessarily. And questions are such a great way to guide that conversation, because even in software, there's such a big disparity between what people mean when they say certain terms. Certain terms carry with them different aspects for different people. Some people say crash when they might mean an error, or they might say frozen when they mean a crash, or they don't understand what lag really is. So defining these terms and asking questions around their terms, what do you mean? What are you actually seeing? Like, could you show me these kinds of questions? And an inquisitive nature and taking an interest in the problem rather than in particular people or systems, we can start teasing out exactly what's going on. One of the best questions to ask is, where did this problem come from? Who's reporting it? Is it coming from one user who called, one user who tweeted? Is it coming from our systems themselves, the diagnostic tools that we've put in place? I find that questions can be asked because of three great reasons. One is because the answer is important. We need the answer. Sometimes we ask questions because asking is important and people might, in their explanation of something, realize a component that they need to elaborate on. Sometimes the process of answering the question is important. Someone taking you through how they came to the end result that they've determined can yield the best result. I think we also need to be so careful that we don't use questions as a mechanism of intimidation, that we're careful in the way that we construct it, in the way that we communicate. Because in these times of stress, it can be so important to make sure that everyone's our friend. Because when we need to get information out of people, we want to make sure that we're getting the best possible results. A few questions that I've learned to ask is, what actually caused the problem? How are we seeing it? What are users seeing? How is a user particularly affected? How do we know that something is wrong? Is it our system? How will we know when this is fixed? What is the mechanism by which we can rightfully say that the problem is actually fixed? Can we determine how many unique users are affected right now? Sometimes a problem exists, but it's not affecting anybody. If someone can't do a particular action at 03:00 a.m. In the morning, is it really worth getting six engineers and 20 managers up at that point in time? What is the best call or the best reaction to the problem? How long has our system been down? If it's down, and how do we know what to do to fix the issue? And then also reflection, how would we do this differently afterwards? So what I would say then is, when you get called out, having tools available to you other than just the ability to ask questions, can be critical. Before we get called out, we want to know that the different mechanisms by which we analyze and look at our system are ready. So when identifying our source, we want to look at, was it social media? Was it call center complaints? Some customer experience might be determined by different devices. Is it just Android? Is it just iOS? These kinds of things. Quite often, when a user tells me there's a problem with Android, it doesn't work. My first question is, did you try it on iPhone? Because if they haven't tried it on a different system with a completely different code base, we can't be sure whether it was the back end or the front end that's failing. That is one of the easiest ways to distinguish that there's a problem on a particular platform. Did you try it on web? Do we use the same back end for web? That kind of thing. Sometimes we can look at how our historic baselines were working. So when we compare our baseline this month to last month, we can see, oh, this is just an anomaly, or, oh, this is occurring every time payday hits, sometimes Christmas and New Year's quite often result in spikes because people are bored or something like that. Excuse me. So we want to be careful that our tools are capable, that our tools are correct, and that we look at it from a number of angles. So ensure that you have access and that you know how to get access if you need it. We've had an issue in the past where one person was the only one who had the password to gain access to a production feature. And so every time there was a problem that may have involved that feature, whether it was that feature or not. We needed to call them out so that they could log in and check, and that just very quickly we resolved that, because we can't rely on one person, and we also need to make sure that access is maintained. So you don't want to have a situation where the policy says that passwords automatically or user accounts automatically dissolve after three months, because that's your security policy. What you want is when that a few days before that access is revoked, they get an email. Their line manager gets an email, and people are aware, and the team gets an email, that people are aware that access is changing and how to regain access. We had a particularly complex tool in one of my previous clients that it was so complex, and the terminology around getting access was so confusing, we actually lost track of what we needed to know to get access to that component. And so we had to come around with these runbooks just for getting access. But runbooks are an essential part. You'll never escape runbooks if you want to do production support successfully. I cannot express how important runbooks are. We used to order our runbooks by feature and ensure that all of our runbooks contained the core team responsible for that delivery. So we knew not who to blame, but who we could ask. Who can we ask our effective questions to in order to gain a proper understanding of that particular feature? We also had emergency contact information. So when something falls over in that area, others was a mechanism, maybe not a particular person, but a mechanism by which we could go and access someone that could give us the information we needed. We also included in our runbooks a status report link so we could go into from our runbooks. We could click a link and go into a reporting tool that would give us as much as possible to try and understand what that feature was doing. We also included an architecture diagram, and architecture diagrams were really useful in identifying dependencies and how dependencies relate within that system or feature, so that if there were multiple features failing and they all had an element in common in the architecture, we were capable of communicating with people. And the Google incident is so good about that, because that's not the first time I've seen a single sign on fail. There have been other cases where people couldn't access internal systems that I've seen, and you keep on thinking, oh, what's wrong with YouTube and Gmail and Google Docs? What is wrong with all these systems? Why are they failing? And then it turns out it's something on your security layer that your security keys aren't up to date, or that element is failing. We also included a repository link in our runbooks because having access to the code could help. I don't recommend trying to learn code base at 03:00 a.m. In the morning. It's not fun. But what you could do is you could go and look at what tests have been written. Is that particular feature tested, and if not, why? Or maybe make a suggestion in the washop, which I will recommend later, you could make a suggestion that teams implement tests so that these problems don't arise. And having that kind of status involved in your project beyond just the code is very effective, or we found very effective. And while reports can be very useful, I have found that reports can also give you a very skewed perspective if you only measure certain elements of that. You want to be careful with reports that you know how to read in between the lines. When you look at a report, you can't always determine exactly a situation. Our biggest request from management when doing reports on production incidents were how many unique users were affected by this problem. And if your report is just a blanket crash report, where it shows you this is how many crashes happened between this time and that time, you cannot assume that that count is the number of unique users that were impacted. A lot of times, if someone's doing a sort of primary feature activity, they might try several times, and so one person might have five tries, whereas another person could have tried ten times. And so the unique number of impacts cannot be measured necessarily by that. And I would strongly recommend against any kind of one dimensional reporting. Just having one report is just not good enough. Knowing how many sessions were alive during that time, and not in a way that could uniquely identify your customers, necessarily, because that might not be possible given your environment. If you're in a financial institution, you want to be very careful that your reports cannot uniquely identify people and accounts. You want to keep that separate from your development team. Never let the development team have access to your production financial server. The problems there are just unending. You want to be able to identify a problem for the right reason, and I would strongly recommend for all of this. Management is having one communication tool. Communication, like I've said before, is going to be your biggest asset, but it can also be your biggest detractor. In a support incident. I already spoke about the time where a colleague and I determined to have different roles, where one was working on the system itself and one was just managing communication, because we had all these different mechanisms, and especially now in the world we live in, others are so many things that can ping you. I think a lot of us are just so tired of things pinging at us. Where you've got Skype, WhatsApp teams, emails, texts. Determining one tool that you're going to communicate with will really help you focus in on the problem and effectively communicating that need to management is a skill in and of itself. I remember there was one problem that we had with internationalization in Android, and I was just trying to fix a problem with the way that the particular internationalization worked. And I kept on getting just pinged. I just want to ask, hey, how's it going? What's going on? Are you nearly done? And every 2 seconds I got this ping and I had to tell them, I'm busy working on a solution. I cannot be disturbed. But I also don't want to ignore you, so I'm not sure what to do. Can I tell you when I'm done? And eventually I set my status saying, working on this issue, please don't ping me. And I put a little warning light on it. And that really helped another time that we had a really being issue with, people were saying there was a problem with Android only, and it turned out that it was an IP six issue. So something to do with the way that one of the networks, a very popular UK network, was managing IP six packets caused a massive packet loss. And the only way we actually determined there was a problem with that because we couldn't see it on the production apps we were running, and even some of our qas and testers on different networks couldn't see it. And eventually someone who was a customer of that network realized that the problem was with the network or the service provider that our customers were using as well. The result of realizing that kind of problem, which was a really great collaborative effort and a communication effort, understanding that that was the issue, we managed to create a tool and a report as a result that showed us which networks were being used. And then that became another part of our ecosystem. And that leads me to my next point, trusting the team in our day to day development processes. For me, the way that I work and the way that a lot of people work is we're given a feature to work on. We write tests, we write the code, it goes through a code review process, it then gets merged into a branch, then hopefully it gets reviewed by a QA, then it merges into the main branch, then it gets regression tested, then it gets released, and these processes are quite long winded. This is not something you could do easily on a production support call out. And there's a reason why we have these processes in place. We take it slow because we want to be careful. The reason why we want to be careful is, again, it boils down to reputation. I think that especially maybe in a smaller team environment or one where there's less control, the temptation is to bypass code review and bypass testing in the name of an immediate fix. And this kind of reckless cowboy behavior, while you might get away with it once or twice, you always run a massive risk. And the risk that you take on is not just. Not just on your behalf personally, but when you take a risk like that, you're taking it on behalf of the entire organization. And while it might be cool to try and be a superhero, you need to ensure that you end up being a superhero every time. And that is a compounding level of risk, because once you lose that battle, once your reputation is gone, and you also undermine the entire point of having those things there in the first place. I remember years ago was one of my first jobs. I was called out for a production support incident. I had to drive into the office and sit there, and I had a number of people kind of standing over me telling me to release this app. And eventually I had a fix in place, and I said, but wait, we need to do a code review. And they were like, no, just push it out. And it's really difficult when you've got that pressure on you, even from your team, to say no. And I'm so grateful that I did. I can't say I think it was just pure luck. It wasn't so much that I'm fantastic, I'm definitely not. But I was really grateful that I called another developer, woke them up, and I said, look, please, just do a code review. I'm getting a lot of pressure. And thankfully, they were willing, and they spotted an error that could have caused a massive crash. And we managed to fix the problem without massive incidents, but it was because we did it as a team and we trust in the process that was put in place. And in that particular incident, yes, we came out with a result where the error was fixed. But there are also times when you just need to leave the problem there, that a problem has to remain unresolved until the team can wake up. If it's not worth waking up 20 developers, six qas, and five managers at that particular point in time, you might have to leave it. This happened to me a while back where someone wanted to turn a feature off, and I said, look, in order to do that, we have to take the whole app down. That means production goes down for everyone, whereas this particular feature, it was trust, a contact feature or something like that, it wasn't a primary functionality feature, it was some esoteric part of the system. And I said to them, if we do take the app down, which is a possibility, we affect everyone. And granted, this problem is not going to happen, but then we're not going to make money. And I had to put it in those terms. Unfortunately, that particular person decided to go over my head, which is perfectly reasonable. I was speaking from a logical perspective as a developer. They went over my head to another manager, who unfortunately wasn't Tools receptive to being woken up at 02:00 a.m. And they said, yeah, I'd rather leave others or there and deal with it later. But at least as a team, I made a decision. I stuck to my guns and thankfully I was corroborated. And if they had said, no, we need to fix it, we need to wake everybody up, I would have been happy to do that the same, but at least it was a discussion that happened. And also, again, not blaming anyone. People don't want errors. I remember there was a particular problem in an app, and this was one of the first apps I wrote. There was a spelling mistake, and in my mind the client was taking too long to fix the spelling mistake. And so I released a new version with that spelling mistake in there. And I felt absolutely terrible because I realized that I had taken that step of trying to be a hero, not trusting the team, not trusting management. And thankfully, my company was very gracious with me and they were very kind on that behalf. And I'm glad I had that small experience where it was a brush with failure rather than an actual failure. But it's something important to keep in mind that you don't want to be in that situation. And that leads me to my next point, which is to take down time. I think it's so important that we best, especially as people who are willing to get up at all hours in the morning in order to satisfy clients and keep the company going, because I think it shows that we value what we do, not to the point of just writing beautiful code or producing something worthwhile, but supporting the people who use our application. And we need to be really sure that we take time to look after ourselves. No one's going to offer to look after you. And I think it's so good to be able to take a step back and say, at the end of a call, I'm going to be coming in later because I need to rest and being able to say, I'm going to take a day off now because of this, and discussing this with people, create a discussion around it and agree on what will happen. You don't want to be the only person that people can rely on. You want to create very, very distinct boundaries of what you're prepared to do and what you're not prepared to do. Without communicating, we're not going to get anywhere. Those boundaries need to be effectively communicated, especially when it comes to taking time off, getting renumeration for perhaps extra work done, or being allowed to take time in lieu. These things are all part of a communication aspect, and I strongly recommend that that communication is done in writing before an incident happens, and know the sacrifice that you're going to make. Some people will get an offer for just extra money, and they think that that's a fantastic outcome of doing production support. And it can be, but knowing that you are preparing yourself to be on edge while sleeping. I was so scared of being on production support. Sometimes. There were sometimes when a new feature had just come out, people were going to use it a lot. And I would lie in bed, literally with one eye open and not being able to sleep. Nothing would happen. And then I can't take time off because I was just on call. I was just on call, but I never actually got called out. So understanding that we're making a sacrifice of sleep, of time, and that is the money worth it, is a question we really have to ask ourselves. And so for some final thoughts, I think one of the things I would really encourage is doing wash ups. And what I mean by a washop is taking a scientific view of what happened. After something's happened and you've taken your rest. You come in the next day and you talk to your team and you explain what happened, who you were speaking to, what you think the cause was, who was affected, how you resolved it, or how you came to your delegation and what decision ended up being made and who made that decision, whether you were the one who made the decision or whether someone else did. Quite often I would get a call out just from our internal systems, and I would look at it and I'd be like, oh, this was the garbage collector going crazy. We know about this problem. It's an existing issue. It's a blip. It's at 03:00 in the morning, so not a lot of users are effective. I'd put that on our slack board and then I'd go back to bed because the system had already corrected itself. But having those wash ups is so important, it allows you to communicate that you understood the problem. It shows an interest in your system. It is a way to teach other people to do production support themselves, and to show them that what systems they have access to. And this is what encourages learning and correction for the future. So yeah, please, for the sake of your own sanity, do a wash up afterwards. I think it's also really important to check your merging and release strategy. Some people just merge straight into their main branch without thinking, or they don't create proper release notes. And this can be particularly dangerous when code is just thrust into the main branch, and then your releases are branched off of your main branch. Quite often you can end up releasing features you weren't intending to, even if they're not turned on and users wouldn't see them. You want to be careful of what's going out there and how it might impact other systems. We've seen this in particular with cross platform systems, where a feature gets released for one platform but not the other, and then there's some kind of issue between that. And also having release notes can be very telling. If you've just released an app and now there's a problem, those release notes are gold at really weird times when you have to be looking at why a particular system has failed, and knowing what branch was merged into that particular release can be vital in ascertaining what a problem was. And the last element that I'd like to just quickly mention is schedule management. Having the buddy system employed on your support team, on your support roster, I think is the best decision you can make. We used to have a primary and a secondary that would alternate, and that was really useful because we would know who was primary, who was taking the main role, who was telling people what to do, and who was secondary. In terms of if primary doesn't get called out for whatever reason, they're on the tube, not that that happens lately, or they're incapacitated for some reason, or just unavailable, maybe the network's down in their area, there's a secondary who can come in and help, or if primary is feeling super overwhelmed, they can call the secondary and say, hey, I need my buddy, can you jump in? And making sure that people have access to the numbers they need to call to the right teams, what to do in the case where you need to escalate beyond the secondary. So if you've been called out, like having it in your calendar, all those vital numbers and contacts can be so useful, and putting a special tool in place that is not only accessible to everyone but that is managing who's doing what when can be the best idea you can come up with. Also you want to make sure that you don't overwhelm developed or any support engineer. You want to make sure that everybody's taking it in turns and you don't end up with people being on call for three weeks in a row or being primary all the time and having those distinctions and also enabling people to know who's primary, who's secondary. I'm secondary today. Who's the primary, who's knowing those roles and who to call in that instance can be super super useful. So schedule management is not just about having a calendar in place, it's about connecting, collaborating and communicating and yeah, so I'd like to thank you for attending this. If you have any questions please feel free to look me up on GitHub. And thank you so much for attending this. Please enjoy the rest of the conference.

See all 31 talks at this event!

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Sleeping with one eye open: Experiences with production support

Video size:

Abstract

Summary

Transcript

Quintin Balsdon

Expert Software Engineer @ Zuhlke Group

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Sleeping with one eye open: Experiences with production support

Video size:

Abstract

Summary

Transcript

Quintin Balsdon

Expert Software Engineer @ Zuhlke Group

Join the community!