Navigating SRE/Incident Management

Video size:

Abstract

Navigating job search in tech SRE/Incident management as a fresh graduate Handling critical incident bridges Managing difficult clients in stressed situations Understanding how SRE roles vary across organizations,how to adapt Bringing awareness to what SRE encompasses in various environments

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. Today I am really excited to talk about a critical aspect of ID operations that ensures the stability and reliability of our systems site reliability, engineering, or SRE as you may know, and incident management. These processes are the backbone of any organizations' ability to deliver seamless services to its users while maintaining operational efficiency. In this session we'll explore three connected processes, incident management, problem management, and change management, and how they work together to minimize disruptions, resolve issues efficiently, and ensure long-term system stability. So with this, let's begin by diving into incident management, which is the first line of defense in handling it Service disruptions. So let's talk about navigating incident management. Incident management is the process of identifying, managing, and resolving ID service disruptions to restore normal operations as quickly as possible. Think of it as the emergence response team for your IT infrastructure, but let's not confuse incidents with requests. Sometimes users or requesters will lose access to certain tools and will be unable to perform certain operations and create an incident ticket. But just to keep something in mind that those should be logged as request tickets and not an incident ticket. So that would be a huge difference between incident tickets and our ITMs or request tickets. So the key objectives of incident management, first, it aims to minimize the impact of incidents on business operations. Every minute of downtime can have significant consequences like lost revenue, frustrated customers, and damage to the brand reputation, which is sensitive, right? Second, it ensures incidents are resolved within the agreed SLA or service level agreements. So what is SLA? SLAs? Define the acceptable timeframes for resolving issues, ensuring accountability and consistency in service delivery. And why is this important? A structured incident management process not only reduces downtime. But also maintaining, maintain service quality by ensuring issues are handled efficiently and effectively. Let me illustrate this with an example. Imagine a critical e-commerce platform that goes down during a major sales event like Black Friday. Without a robust incident management process in place, it'll be utter chaos. Teams will scramble without clear roles or priorities prolonging the outage, but with a structured approach. The incident is S logged, the categorized escalated, and the issue resolved swiftly minimizing disruption with this. Let's move on to another topic, which is very much intertwined with incident management, which is. Problem management. So while incident management focuses on immediate resolution, problem management takes a step back to address the root causes of this incidents. Its goal is to prevent recurring issues by identifying and resolving systemic problems. Some of the key activities, or I would say the key activities in problem management is root cause analysis or RCA, which involves investigating underlying issues to understand why an incident occurred in the first place and implementing permanent fixes are workarounds, so once the root cause is identified, teams work on permanent resolutions or interim fixes to prevent recurrence. Now you might wonder, how does problem management relate to incident management? The two processes are closely interwind. Problem management uses data from incidents, patterns, trends, and recurring issues to identify, sorry, systemic problems that need addressing. For example imagine repeated server outages. Traced back to a misconfigured application. While incident management resolves each outage temporarily. Problem management steps in to permanently fix the configuration issue by addressing these root causes, problem management ensures long-term stability and reduces the likelihood. Of future disruptions now. Isn't that great? So let's move forward to discussing our third topic, which is change management. So this third pillar we'll discuss today is change management, which focuses on planning, approving and implementing changes. To ID systems while minimizing risks changes could include anything from software updates to infrastructure upgrades or new feature deployments. Changes often arise as solutions from problem management. For instance, if problem management identifies outdated software as the root cause of recurring issues, change management insurers. That updates are deployed safely without introducing new problems. The key principles of change management are risk assessment, which is before implementing any change is crucial to evaluate potential risks and their impact on operations. It did, it should not take down an entire development or production environment. Just saying testing and phase development would be the second point of it. As a key principle, changes should be tested rigorously in control environments before being rolled out in phases to minimize disruptions. An example would be, suppose your team identifies recurring performance issues with the database system during problem management. Change management would oversee a phased upgrade of that database system, testing it thoroughly before rolling it out across all environments to ensure minimal risk of downtime or new issues. By integrating change management with incident and problem management processes, organizations can implement fixes safely while maintaining system reliability. So before we wrap this up, let's take a moment to reflect on how these three processes, incident management, problem management, and change management work together seamlessly. As we know, incident management addresses immediate disruptions to restorative services quickly. And problem management investigates the root causes to prevent these recurring issues. And change management ensures that fixes or implements are implemented safely without introducing new risks. These interconnected processes form the foundation of it, stability and operational efficiency. So let's discuss some. Quick tips for effective incident management to standardize incident logging, categorization, and prioritization processes so teams can respond consistently and effectively. For example, a P one or priority one incident to one team member should not be a P three or priority three incident to another. The second point is to use automation tools for communication and escalation to ensure. Service level agreements or SLAs are met without delays. Sometimes sending out manual comms can interfere with timelines for sending out comms during an incident bridge, so it's best to have comms automated as per need. The third point would be regularly review SLAs and priority metricses to ensure alignment with evolving business needs. Also to note to keep track and follow up on root cause or SLAs that are assigned to respective teams so that those deadlines are not missed out as well. To conclude this section, our final thought that I would like to share is a well integrated approach to incident management. Problem management and change management is essential for any organization aiming to enhance. Operational efficiency while reducing downtime. By focusing on these processes as part of site reliability, engineering hacks, we can ensure business continuity and deliver exceptional service quality even in the face of challenges. So while wrapping this up, let's discuss some tips and tricks for dealing with difficult in incident bridges. And I promise you, you will definitely come across at least one in your lifetime if you are if you start working as SRE incident manager, or if you're already working as SRE or an incident manager. So whenever we are on an incident bridge, it is very important to understand the impact of the incident. Most times, users or requesters will incline on driving their issue as P one, not priority one. However, if the impact does not align with our priority metrics, it's best to convey the user on why it stands out as a lesser priority incident rather than only being aggressive about switching the priority. The second point that I would like to bring up is oftentimes while dealing with bridges, we tend to lack empathy. Sometimes even when we are unable to assist the user due to low priority incidents or incidents not being in our domain, it's best to guide the user are requested with next steps if we are aware of any, instead of just leaving them straight. And the third one. And one of the most important points is to maintain boundaries over bridges. Sometimes we'll get a pushback on following our standardized priority metrics and escalation procedures. It is important to be assertive when needed to prevent unnecessary use of our resources and time and invest those where there is an actual need of them. So it's very important to keep that in mind. So moving. But moving to my last slide and the final topic for discussion is something fun. It's actually interview hacks for SI or Im, or it's a general hack, I would say that I have experienced and I wanted to share that with whoever is. Struggling or looking for some tip centric for interviewing. So oftentimes, I come across fresh graduates asking if it's okay to ask questions back to the recruiter when they're reached out for a call or interview. I would recommend asking as many questions as you need until you have a clear idea on what you are signing up for. It will not only help you to do better in the interview, but also will help you to understand if you really want to move forward with the opportunity, which is very important. The second point to discuss is that know your audience, which means know your interviewers if possible, research on how they interview and what their skill sets are in the job. Which might give you some understanding on what to expect during your interview. The third point that I would like to discuss is enhancing your skillset, the smart way for interviews. Focus on short and fast courses on the skillset requirements for the job that you are interested in, rather than going for a whole week's tutorial that will help to build your confidence for the interview. And most importantly, be confident. Do not hesitate to share that you do not know the answers to something, share how much or how little so with this, I wrap up this presentation and I hope this was helpful. So thank you for your time today. I look forward to hearing your thoughts or answering any questions that you may have about this vital processes. Say hi to me on LinkedIn. Also, this is my email, shamali dot rc@gmail.com. Thanks everybody. It was great speaking for you guys.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Navigating SRE/Incident Management

Video size:

Abstract

Summary

Transcript

Slides

Shaalmali Raychaudhury

SRE @ PayPal

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Navigating SRE/Incident Management

Video size:

Abstract

Summary

Transcript

Slides

Shaalmali Raychaudhury

SRE @ PayPal

Join the community!