Conf42 Cloud Native 2025 - Online

- premiere 5PM GMT

AI-Powered Root Cause Analysis: Transforming Problem-Solving with Data-Driven Insights

Video size:

Abstract

Discover how AI-powered Root Cause Analysis slashes downtime costs by 50%, detects system faults with 99% accuracy, and predicts failures before they happen. Don’t miss this game-changing talk on AI’s role in next-gen automation and business resilience

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Machine learning and AI are so often used in current industries. We use to generate code or comments around the code or writing up some email, et cetera. This has become normal after, GPT came into the picture. when we take the same. machine learning and apply it to anomaly detection on the day to day issues like production outages or security issues or on call issues with one single goal of diagnosing the problem before a developer or support tries to debug. This concept leads to AI powered RCA or Root Cause Analysis. If we look at traditional RCA, it is a pain, as support or on call engineer trying to go through logs and reaching out to stakeholders and trying to verify product behavior and creating a detailed documentation. This is so stressful and it's so manual effort. And whereas a lot of things can go wrong in this manual process, right? So in experience, in my experience, my first few on call shifts were. Heavily dependent on senior engineers and I barely learned anything over time. I was able to understand the product better and able to debug and gain significant experience so that I can predict what might be the root cause of an issue and start the investigation in the root in that route. so providing the same learning curve to machines and enhancing the concept of RCA, this will definitely help us detect. hidden patterns and provide data points on what went wrong. This reduces analysis by at least 70 percent and it is accurate while the system is learning continuously from each investigation. That sounds cool to me, so I lean towards AI powered RCAs. Moving to the core concepts of AI powered RCAs. There are machine learning algorithms is one of the core concept which processes millions of data points and correlates to patterns. These are self improving systems which learn from each analysis in turn increases the accuracy. And the second one is anomaly detection. Once we establish the baseline pattern metrics, anomaly detection comes into the picture with instantly flagging deviations. with precision and the last block is more of a natural language processing is required to analyze unstructured data like technical logs, feedback and support tickets to identify emerging issues. So in order to implement the AI powered RCA's, there are some prerequisites. So data collection is not easy, as scraping some web pages as such. So this is the fundamental part of, the prerequisites. And here we need human power to categorize things like logs, metrics, historical incident reports, into structured versus unstructured data. This is core step. I cannot emphasize enough on the step, and its importance to achieve AI powered RCA. So AI or ML, the second one is AI or ML model selection itself. it's selecting and training process of a pre trained model or custom built models. with respect to the domain is dependent on each, industry and company itself. Then comes the integration with existing IT ecosystems. Another core step of collaboration with teams for human in the loop validation. Then comes governance and security measures where we should be operating under All privacy laws and role based access controls. The last one is managing an incident and on each incident level, we should manage feedback loop so that continuous learning is supported. When it comes to implementation of an AI powered RCA, the data collection and pre processing, this part we can dive a little bit deeper and try to understand what are the things which are technically involved here. data collection and pre processing, consists of a human intervention into, aggregating the logs, metrics, establishing the baselines of metrics, and collecting event data from Kubernetes, let's say, and microservices. and cloud providers, implementing the ETL pipelines itself and feature engineering so that there can be scoped. each feature can be scoped. Each issue can be scoped kind of stuff. then moving to the next step, which is AI or model development itself. So using models like random forest, deep learning and NLP for log analysis and applying cash casual inference and SHAP for explainability in RCA. SHAP is nothing but Shapely Additive Explanations is a technique that explains how machine learning models make predictions. Moving to the next step, which is real time anomaly detection. deploying with Kafka or Prometheus or Elk can be more technical. Technical side of things and using super wise learning and graph based dependency mapping for RCA and the next step is automated remediation and decision support, whereas we integrate with various it tools like ServiceNow, PagerDuty for auto ticketing and leveraging the reinforcement from there. This, these all things will obviously lead towards the continuous learning of AI and security compliance, which is compliance with various privacy laws, let's take a deep dive on actual, understanding of what happens in a model training flow. So if we take a look at actually collecting the data, sources and logs and metrics. Or post postmortems as well, feeding that to, basically filtering that more in the preprocessing and scoping. For feature engineering, then feeding it to the model itself is the core aspect and model then categorizes itself into supervised learning, unsupervised learning and graph networks where it can map various key aspects and, Understand it by itself and whereas validation and we store some data to validate and provide the feed feedback as well. Then comes the real time, RCA deployment where the anomaly detection can actually happen and prioritized root causes can be created. This leads to a human feedback loop. So where we evaluate, what, Categorizes as an RCA or whether the root cause is actually valid or not. And then update the data accordingly. And that entire loop goes, again to get trained or to train the model itself. This in turn leads to remediation automation and, suggest fix and fixes and also runbooks. So key challenges to implement, right? there are many issues and these are the top five. The key challenges are like high initial investment. This entire scale is not easy to implement as such. We need to have a lot of data storage, processing power, and AI expertise itself. Picking a model, making sure they are trained well and all this stuff. And the second thing is data quality and availability issues. Incomplete, inconsistent or noisy data always leads to model inaccuracies and requires effective ETL process so that the real time data pipelines don't give any of the third key challenges with third key challenge. which is false positives and negatives in RCA. needs continuous feedback loops and human validation. If not, the false positives are very, common in nature. And the fourth one is integration complexity with existing systems, like must work with diverse IT environments, requires API standardization so that seamless inter, interactions can be happening. And the fifth one is security and compliance risk itself. Let's look at the benefits. the benefits of AI powered RCA are faster incident resolution, AI automates the root cause analysis, reduces the time to detect, And resolve the issues. it'll be proactive in nature and pre and takes the measures to pre, prevent the issues. And the third one is improve accuracy and consistency, which leads to the cost savings and efficiency. this is definitely an scalable architecture and adaptable architecture as well. in order to give some quantifying. details are of the benefits itself, if we see the table, which shows the performance indicator, between legacy RCA systems and AI enhanced RCA's, the manufacture system diagnostic accuracy increased from 78 to 95 in the industries, incident resolution time. Decreased from eight hours to four hours, annual system downtime costs decreased by 20 billion to 10 billion. the data processing efficiency on a scale of one to 10, it's one for legacy RCA system, which is more of manual process. Whereas AI enhanced, it's 10, but I don't quote me on that. systems scalability rating, that's definitely improved from three to nine. So this entire thing reveals that AI powered RCA delivers sustainability improvements across all metrics. The future of AI powered root cause analysis, continuous innovation. This is not going to stop just. Just over here. it's more of going to transform the entire industry and AI powered RCS will become standard across different other industries like healthcare, manufacturing and finance. And technology convergence is around the corner because integration with IoT sensors, 5G networks and edge computing will enable instantaneous analysis. Strategic imperative is one of the future for sure, which organizations will leverage AIRCA as cornerstone of digital transformations. As we move deeper into this digital age, AI powered root cause analysis will become fundamental to organizational success. Companies that harnesses these advanced capabilities will achieve unprecedented levels of operational efficiency, dramatically reduce downtime, gain agility, and also solve complex technical, technological issues. In the landscape itself this evolution in analytical capability won't just be an operational advantage It will become a critical differentiator in maintaining market leadership by driving innovation Thank you for your time. have a nice day See you
...

Nikhil Ramashasthri

Staff Software Engineer @ Turo

Nikhil Ramashasthri's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)