Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Today I am really excited to talk about a critical aspect of ID
operations that ensures the stability and reliability of our systems site
reliability, engineering, or SRE as you may know, and incident management.
These processes are the backbone of any organizations' ability to deliver
seamless services to its users while maintaining operational efficiency.
In this session we'll explore three connected processes, incident management,
problem management, and change management, and how they work together to minimize
disruptions, resolve issues efficiently, and ensure long-term system stability.
So with this, let's begin by diving into incident management, which
is the first line of defense in handling it Service disruptions.
So let's talk about navigating incident management.
Incident management is the process of identifying, managing, and resolving ID
service disruptions to restore normal operations as quickly as possible.
Think of it as the emergence response team for your IT infrastructure, but
let's not confuse incidents with requests.
Sometimes users or requesters will lose access to certain tools and will be
unable to perform certain operations and create an incident ticket.
But just to keep something in mind that those should be logged as request
tickets and not an incident ticket.
So that would be a huge difference between incident tickets and
our ITMs or request tickets.
So the key objectives of incident management, first, it
aims to minimize the impact of incidents on business operations.
Every minute of downtime can have significant consequences like lost
revenue, frustrated customers, and damage to the brand reputation,
which is sensitive, right?
Second, it ensures incidents are resolved within the agreed
SLA or service level agreements.
So what is SLA?
SLAs?
Define the acceptable timeframes for resolving issues, ensuring accountability
and consistency in service delivery.
And why is this important?
A structured incident management process not only reduces downtime.
But also maintaining, maintain service quality by ensuring issues are
handled efficiently and effectively.
Let me illustrate this with an example.
Imagine a critical e-commerce platform that goes down during a
major sales event like Black Friday.
Without a robust incident management process in place, it'll be utter chaos.
Teams will scramble without clear roles or priorities prolonging the
outage, but with a structured approach.
The incident is S logged, the categorized escalated, and the issue resolved
swiftly minimizing disruption with this.
Let's move on to another topic, which is very much intertwined
with incident management, which is.
Problem management.
So while incident management focuses on immediate resolution, problem
management takes a step back to address the root causes of this incidents.
Its goal is to prevent recurring issues by identifying and
resolving systemic problems.
Some of the key activities, or I would say the key activities in problem management
is root cause analysis or RCA, which involves investigating underlying issues
to understand why an incident occurred in the first place and implementing
permanent fixes are workarounds, so once the root cause is identified,
teams work on permanent resolutions or interim fixes to prevent recurrence.
Now you might wonder, how does problem management relate to incident management?
The two processes are closely interwind.
Problem management uses data from incidents, patterns, trends, and
recurring issues to identify, sorry, systemic problems that need addressing.
For example imagine repeated server outages.
Traced back to a misconfigured application.
While incident management resolves each outage temporarily.
Problem management steps in to permanently fix the configuration
issue by addressing these root causes, problem management ensures long-term
stability and reduces the likelihood.
Of future disruptions now.
Isn't that great?
So let's move forward to discussing our third topic, which is change management.
So this third pillar we'll discuss today is change management, which
focuses on planning, approving and implementing changes.
To ID systems while minimizing risks changes could include anything from
software updates to infrastructure upgrades or new feature deployments.
Changes often arise as solutions from problem management.
For instance, if problem management identifies outdated software
as the root cause of recurring issues, change management insurers.
That updates are deployed safely without introducing new problems.
The key principles of change management are risk assessment, which
is before implementing any change is crucial to evaluate potential
risks and their impact on operations.
It did, it should not take down an entire development or production environment.
Just saying testing and phase development would be the second point of it.
As a key principle, changes should be tested rigorously in control
environments before being rolled out in phases to minimize disruptions.
An example would be, suppose your team identifies recurring
performance issues with the database system during problem management.
Change management would oversee a phased upgrade of that database system,
testing it thoroughly before rolling it out across all environments to ensure
minimal risk of downtime or new issues.
By integrating change management with incident and problem management processes,
organizations can implement fixes safely while maintaining system reliability.
So before we wrap this up, let's take a moment to reflect on how these
three processes, incident management, problem management, and change
management work together seamlessly.
As we know, incident management addresses immediate disruptions
to restorative services quickly.
And problem management investigates the root causes to
prevent these recurring issues.
And change management ensures that fixes or implements are implemented
safely without introducing new risks.
These interconnected processes form the foundation of it, stability
and operational efficiency.
So let's discuss some.
Quick tips for effective incident management to standardize incident
logging, categorization, and prioritization processes so teams can
respond consistently and effectively.
For example, a P one or priority one incident to one team member
should not be a P three or priority three incident to another.
The second point is to use automation tools for communication
and escalation to ensure.
Service level agreements or SLAs are met without delays.
Sometimes sending out manual comms can interfere with timelines for sending out
comms during an incident bridge, so it's best to have comms automated as per need.
The third point would be regularly review SLAs and priority metricses to ensure
alignment with evolving business needs.
Also to note to keep track and follow up on root cause or SLAs that are assigned
to respective teams so that those deadlines are not missed out as well.
To conclude this section, our final thought that I would like
to share is a well integrated approach to incident management.
Problem management and change management is essential for any
organization aiming to enhance.
Operational efficiency while reducing downtime.
By focusing on these processes as part of site reliability, engineering hacks,
we can ensure business continuity and deliver exceptional service quality
even in the face of challenges.
So while wrapping this up, let's discuss some tips and tricks for dealing
with difficult in incident bridges.
And I promise you, you will definitely come across at least one in your lifetime
if you are if you start working as SRE incident manager, or if you're already
working as SRE or an incident manager.
So whenever we are on an incident bridge, it is very important to
understand the impact of the incident.
Most times, users or requesters will incline on driving their
issue as P one, not priority one.
However, if the impact does not align with our priority metrics, it's best to convey
the user on why it stands out as a lesser priority incident rather than only being
aggressive about switching the priority.
The second point that I would like to bring up is oftentimes while dealing
with bridges, we tend to lack empathy.
Sometimes even when we are unable to assist the user due to low priority
incidents or incidents not being in our domain, it's best to guide the
user are requested with next steps if we are aware of any, instead
of just leaving them straight.
And the third one.
And one of the most important points is to maintain boundaries over bridges.
Sometimes we'll get a pushback on following our standardized priority
metrics and escalation procedures.
It is important to be assertive when needed to prevent unnecessary use of
our resources and time and invest those where there is an actual need of them.
So it's very important to keep that in mind.
So moving.
But moving to my last slide and the final topic for discussion is something fun.
It's actually interview hacks for SI or Im, or it's a general hack, I would
say that I have experienced and I wanted to share that with whoever is.
Struggling or looking for some tip centric for interviewing.
So oftentimes, I come across fresh graduates asking if it's
okay to ask questions back to the recruiter when they're reached
out for a call or interview.
I would recommend asking as many questions as you need until you have a clear
idea on what you are signing up for.
It will not only help you to do better in the interview, but also
will help you to understand if you really want to move forward with the
opportunity, which is very important.
The second point to discuss is that know your audience, which means
know your interviewers if possible, research on how they interview and
what their skill sets are in the job.
Which might give you some understanding on what to expect during your interview.
The third point that I would like to discuss is enhancing your skillset,
the smart way for interviews.
Focus on short and fast courses on the skillset requirements for the job that you
are interested in, rather than going for a whole week's tutorial that will help to
build your confidence for the interview.
And most importantly, be confident.
Do not hesitate to share that you do not know the answers to something,
share how much or how little so with this, I wrap up this presentation
and I hope this was helpful.
So thank you for your time today.
I look forward to hearing your thoughts or answering any questions that you
may have about this vital processes.
Say hi to me on LinkedIn.
Also, this is my email, shamali dot rc@gmail.com.
Thanks everybody.
It was great speaking for you guys.