Conf42 Incident Management 2023 - Online

Unlocking Insights through Incident Management Reports

Video size:

Abstract

In the presentation we will attempt to understand the significant impact of incident management reports. These reports serve as powerful tools, providing a wealth of information that can transform the way we operate and enhance the experience for both our team and customers.

Here are the key points we will explore:

  1. The Essence of Incident Management Reports: We’ll start by search into what makes incident management reports indispensable. These documents are not just summaries of past incidents; they hold the keys to future improvements.

  2. Mining Insights from Incident Data: Next, we’ll discuss how analyzing the results and volume of incidents can reveal patterns and trends. This data-driven approach allows us to identify areas of major affected functionalities and spot potential vulnerabilities proactively.

  3. Fueling Platform Improvement: The heart of our discussion will revolve around how incident reports act as catalysts for platform enhancements. By leveraging insights, we can prioritize development efforts, making the platform more reliable and resilient.

  4. Customer-Centric Perspective: We’ll emphasize how these improvements translate directly into a better customer experience. When we address pain points swiftly, customers benefit from a smoother and more satisfying interaction with our platform.

  5. Optimizing Resource Allocation: Incident management reports aren’t just about fixing issues. They guide us in allocating resources efficiently, reducing incident occurrence, and minimizing resolution time.

  6. Enhancing Incident Reaction: Lastly, we’ll explore how a data-driven approach can revolutionize our incident reaction strategies. Armed with insights, we can respond faster and more effectively when incidents do occur.

Summary

  • Andre works for Sporting Tech as an incident team leader. His presentation focuses on unlocking sites through incident management reports. Incident management team lead mining, insights from incident data, fueling platform improvements.
  • Incident management team lead can have a role, a purpose and key metrics that they can differ company to company. These key metrics include severity of the case, MTTR, time to resolve the case and slos. And then my favorite part is visualize.
  • Moving on how also incidents reports can give you more insights on other areas. Resource allocation. With best resource allocation, people feel less stress, they work better, they areas happier at work. This only brings benefits to the company.
  • By leveraging our past, we can prevent or predict till an extent the future. This will lead to increase of the revenue that can be translated into more investment into the platform. I hope I was able to waken up a little bit of curiosity in yourselves about reporting.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Everyone and welcome to Conf France 42 and thank you for attending my presentation on unlocking sites through incident management reports. My name is Andre, I work for Sporting Tech as an incident team leader and report is part of my day to day weekly report, monthly reports, let's say internal reports, reports for stakeholders. So this is being not only on this current job, but previously also a big part of, of my work and actually a personal passion. So thank you so much for attending this presentation here with me. We're going to have a light presentation. We're going to talk about basically six points. The essence. Incident management team lead mining, insights from incident data, fueling platform improvements, customer centric perspective, optimizing resource allocation and enhancing incident reaction. So as I said, very live presentation. We were just going to talk about overall topics here and how these incidents help. Incident management team lead SE helps us achieve better platform and better customer satisfaction in the end of the day. So first of all, we need to understand what is the incident cycle. This is the first step. So incident cycle, we can divide it in five major steps. First detection where what's going on, we identify an issue and we need to tackle it and we pass to step number two, investigation, understand root cause and find out a way to mitigate or permanently fix it. Step three, we are ready to fix the issues we are fixing can issue and we're analyzing all the fixes, deploy new codes, do the necessary communicating with stakeholders and clients that we are fixing these issue four point recovery. Here is the part where we going to monitor, basically we recovered, we fixed the issue, we are recovering, we are seeing all different dashboards to understand these platform health and we go on to take care of all the actions that are left after an incident, see if there's something that needs to be revised and so on. Fifth point is post event activity where we collect data along all the points. So this data increases from detection to recovery. And on points five is where we're going to analyze and we're going to actually start here with our incident management team lead where we have data to see how we can improve all this incidents cycle and our time for reaction and SLA for that. Incident management team lead can have a role, a purpose and key metrics that they can differ company to company. So I just try to list here the ones that I consider the most important ones for the role, the role of the incident management team lead to capture incidents, what's going on, how many capture the information that happened during the month to coordinate between teams for a better reaction in the future. For same incidents or for new incidents documents, all the process that we are dealing with and how we are working and to prevent further incidents like this so we can learn and prevent incidents to have a more reliable platform. These key metrics, and again, this can differ company by company. I'm just listing here the four ones that I think that are most important, ones that are severity of the case, MTTR, time to resolve the case, slos. So the agreements that we have with our clients and internal or external slos and frequency, how many times these things are happening and how many they are repetitive. So basically this is what I consider the most important key metrics for incident management, team lead. And we go on, and we go step by step, month by month, and we are collecting data. Now, this data that we collect, again, will depend on the company. Some companies are going to focus more on the financial parts, others on the report part. So this depends actually on the company that you're working for. So basically, month by month here, the idea is you collect affected services, you collect quantities, you can collect quantity of incidents happening, you can collect how was the SLA, how long it was to resolve the different issues, the average to mitigate an issue and so on. So here is going to differ again, as I said, company by company. But the idea here is month by month, you collect information day by day. Even in one single incident, as mentioned previously, you can collect data on different steps of it. And now we have data, we collected data and now we need to understand what are we going to do with this data. So data per se doesn't say anything, it's just random numbers, text, not even formulas yet we don't know what's going on there. So the data, to be able for us to work with the data, we need to basically, first we need to can the data, we need to revise all the data, understand that if there's any typos, making sure that all the fields are matching the ones that we want, make sure that there's no empty values, or if there's empty values, they have a reason. So basically make sure that the data is the most trustful possible. Second, structure this data. So we need to find a way to relate data. Relate information from, for example, if you work with Excel, I love to use Excel to not only to clean, but also to structure my information. Sometimes you have different tables and you need to find a way to relate them. Or in order to use a pivot table, for example, you need to make sure that this has a structure on this data and on the same report. And then my favorite part is visualize. Here is the part where we create the graphs, the charts where you start seeing things happen in a most more easy way to be honest. For me is my third part of it. And now let's see an example on how this works. So after we clean, after we structure, now we visualize the data basically. So these we can identify and with this we can identify patterns and trends. So transactional services were the most affected ones let's say as an example, the increase of incidents in the transactional services lead to a drop of 5% of revenue. This we can see also that there's an increase of incidents quarter by quarter. So here we are starting to relate three different metrics. Logging access teams are performing quick and efficient on these incidents with low MTTR. They are on average quarter by quarter happening on the same amount. And we can see that these MTTR, sorry, is low. So here give us more input on how we cross. Information tools attack another example with low or no end user impact. So although it's happening with increase on the last two quarters, we can see that there's basically no impact. Now we have different topics, different information that we crossed. How can we prioritize the things that we need to do. And how we're going to do that is by using a prioritize matrix. So every time that you don't know how to prioritize, just look at this, it's very easy. I love this tool. And here basically we're going to cross again your urgency with impact. So we're going to understand what is these critical thing that we need to address the most. And as for my experience, I know that on the previous graph the transactional services will be the first one to address followed by these login access and the DDoS stack due to the cross information. But every time that you can have here this prioritize matrix, that helps you understand what can you do. And with this after improvements, after you address the things you can see, then compare year by year how your efforts result in a positive way of working and a positive revenue for these company. You can see that using the insights of incident report and understanding what's going on, you will be able to see decrease on your incidents increase on the MTTR, on slos and reduce the impact for customers which returns in a higher revenue for the different companies. And this is not only internally. So these improvements are like a cycle. So like a sign BIOS if you got my point, because customers experience also improvements customers, the service gets more popularity among customers. One friend tells the other friend, the family says, comment this service is reliable, you guys need to try it. So basically by improving our platform, we can improve customer satisfaction, which again will improve the company and will help companies expand and gain popularity along the market. That areas operating. Moving on how also incidents reports can give you more insights on other areas. Let's talk about resource allocation. Resource allocation. So there's a new project in the company or there's a development or something that we need to fix came from an incident for example. And we're going to define what we need to do. We're going to plan the actions and now is the part to identify the technical resources needed and identify the workforce needed. Here is where incident management reports can give a big insight into this. With the incident management report we can see that we have x amount of incidents and this took us x amount of developers and this amount of hours. With these information, with gathering more and more information throughout the different months, we can understand that in order to tackle certain type of issues, we will need more or less people. So here is where incident management report can give a big insight on this and after implement six points. Follow up, follow up document everything, see how this can impact and use the incident management report to understand how was the development and how these actions that were just taken by the company improve the company on the incident. Part of you, part of it. Sorry. And the benefits are clear. Best MTTR, best Slos increase on the GGR. And my favorite one, employee satisfaction with less work. With best resource allocation, people feel less stress, they work better, they areas happier at work and this only brings benefits to the company. Now with all of this said and to conclude the presentation, let's see how revolutionizing incident reaction can happen with incident management reports. We saw all of these steps till now and we can collect information with every single thing. By leveraging our past, we can prevent or predict till an extent the future. This is just a representation on how we can just start improvements, alerts and preventive actions before things happen. Because we know, we learn now that things might happen this way or that. So we can prevent them to happen and be more preventive than reactive. We can improve our platform and increase our stability on the platform to reduce the amount of incidents that we have. And all of this will lead to increase of the revenue that can be translated into more investment into the platform. Again, learning our past can teach us a lot of ways to tackle the situations and to act in the more preventive way and not so reactive. And that was it for me, guys. I hope you enjoy. I hope I was able to waken up a little bit of curiosity in yourselves about reporting and hope to see you soon.
...

Andre Carvalho

Incident Management Team Lead @ Sportingtech

Andre Carvalho's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways