Transcript
This transcript was autogenerated. To make changes, submit a PR.
Machine learning and AI are so often used in current industries.
We use to generate code or comments around the code or
writing up some email, et cetera.
This has become normal after, GPT came into the picture.
when we take the same.
machine learning and apply it to anomaly detection on the day to day issues like
production outages or security issues or on call issues with one single goal
of diagnosing the problem before a developer or support tries to debug.
This concept leads to AI powered RCA or Root Cause Analysis.
If we look at traditional RCA, it is a pain, as support or on call
engineer trying to go through logs and reaching out to stakeholders and
trying to verify product behavior and creating a detailed documentation.
This is so stressful and it's so manual effort.
And whereas a lot of things can go wrong in this manual process, right?
So in experience, in my experience, my first few on call shifts were.
Heavily dependent on senior engineers and I barely learned anything over time.
I was able to understand the product better and able to debug and gain
significant experience so that I can predict what might be the root
cause of an issue and start the investigation in the root in that route.
so providing the same learning curve to machines and enhancing the concept of
RCA, this will definitely help us detect.
hidden patterns and provide data points on what went wrong.
This reduces analysis by at least 70 percent and it is accurate
while the system is learning continuously from each investigation.
That sounds cool to me, so I lean towards AI powered RCAs.
Moving to the core concepts of AI powered RCAs.
There are machine learning algorithms is one of the core concept which
processes millions of data points and correlates to patterns.
These are self improving systems which learn from each analysis
in turn increases the accuracy.
And the second one is anomaly detection.
Once we establish the baseline pattern metrics, anomaly detection
comes into the picture with instantly flagging deviations.
with precision and the last block is more of a natural language processing
is required to analyze unstructured data like technical logs, feedback and support
tickets to identify emerging issues.
So in order to implement the AI powered RCA's, there are some prerequisites.
So data collection is not easy, as scraping some web pages as such.
So this is the fundamental part of, the prerequisites.
And here we need human power to categorize things like logs, metrics,
historical incident reports, into structured versus unstructured data.
This is core step.
I cannot emphasize enough on the step, and its importance to achieve AI powered RCA.
So AI or ML, the second one is AI or ML model selection itself.
it's selecting and training process of a pre trained model or custom built models.
with respect to the domain is dependent on each, industry and company itself.
Then comes the integration with existing IT ecosystems.
Another core step of collaboration with teams for human in the loop validation.
Then comes governance and security measures where we should be
operating under All privacy laws and role based access controls.
The last one is managing an incident and on each incident level, we
should manage feedback loop so that continuous learning is supported.
When it comes to implementation of an AI powered RCA, the data collection
and pre processing, this part we can dive a little bit deeper and try
to understand what are the things which are technically involved here.
data collection and pre processing, consists of a human intervention
into, aggregating the logs, metrics, establishing the baselines of metrics,
and collecting event data from Kubernetes, let's say, and microservices.
and cloud providers, implementing the ETL pipelines itself and feature
engineering so that there can be scoped.
each feature can be scoped.
Each issue can be scoped kind of stuff.
then moving to the next step, which is AI or model development itself.
So using models like random forest, deep learning and NLP for log analysis
and applying cash casual inference and SHAP for explainability in RCA.
SHAP is nothing but Shapely Additive Explanations is a
technique that explains how machine learning models make predictions.
Moving to the next step, which is real time anomaly detection.
deploying with Kafka or Prometheus or Elk can be more technical.
Technical side of things and using super wise learning and graph based
dependency mapping for RCA and the next step is automated remediation and
decision support, whereas we integrate with various it tools like ServiceNow,
PagerDuty for auto ticketing and leveraging the reinforcement from there.
This, these all things will obviously lead towards the continuous learning
of AI and security compliance, which is compliance with various privacy laws,
let's take a deep dive on actual, understanding of what
happens in a model training flow.
So if we take a look at actually collecting the data,
sources and logs and metrics.
Or post postmortems as well, feeding that to, basically filtering that
more in the preprocessing and scoping.
For feature engineering, then feeding it to the model itself is the core
aspect and model then categorizes itself into supervised learning, unsupervised
learning and graph networks where it can map various key aspects and, Understand
it by itself and whereas validation and we store some data to validate
and provide the feed feedback as well.
Then comes the real time, RCA deployment where the anomaly
detection can actually happen and prioritized root causes can be created.
This leads to a human feedback loop.
So where we evaluate, what, Categorizes as an RCA or whether the
root cause is actually valid or not.
And then update the data accordingly.
And that entire loop goes, again to get trained or to train the model itself.
This in turn leads to remediation automation and, suggest fix
and fixes and also runbooks.
So key challenges to implement, right?
there are many issues and these are the top five.
The key challenges are like high initial investment.
This entire scale is not easy to implement as such.
We need to have a lot of data storage, processing power, and AI expertise itself.
Picking a model, making sure they are trained well and all this stuff.
And the second thing is data quality and availability issues.
Incomplete, inconsistent or noisy data always leads to model inaccuracies
and requires effective ETL process so that the real time data pipelines
don't give any of the third key challenges with third key challenge.
which is false positives and negatives in RCA.
needs continuous feedback loops and human validation.
If not, the false positives are very, common in nature.
And the fourth one is integration complexity with existing systems,
like must work with diverse IT environments, requires API
standardization so that seamless inter, interactions can be happening.
And the fifth one is security and compliance risk itself.
Let's look at the benefits.
the benefits of AI powered RCA are faster incident resolution, AI automates
the root cause analysis, reduces the time to detect, And resolve the issues.
it'll be proactive in nature and pre and takes the measures
to pre, prevent the issues.
And the third one is improve accuracy and consistency, which leads to
the cost savings and efficiency.
this is definitely an scalable architecture and adaptable
architecture as well.
in order to give some quantifying.
details are of the benefits itself, if we see the table, which shows the
performance indicator, between legacy RCA systems and AI enhanced RCA's,
the manufacture system diagnostic accuracy increased from 78 to 95 in the
industries, incident resolution time.
Decreased from eight hours to four hours, annual system downtime costs
decreased by 20 billion to 10 billion.
the data processing efficiency on a scale of one to 10, it's one for legacy RCA
system, which is more of manual process.
Whereas AI enhanced, it's 10, but I don't quote me on that.
systems scalability rating, that's definitely improved from three to nine.
So this entire thing reveals that AI powered RCA delivers sustainability
improvements across all metrics.
The future of AI powered root cause analysis, continuous innovation.
This is not going to stop just.
Just over here.
it's more of going to transform the entire industry and AI powered
RCS will become standard across different other industries like
healthcare, manufacturing and finance.
And technology convergence is around the corner because integration with IoT
sensors, 5G networks and edge computing will enable instantaneous analysis.
Strategic imperative is one of the future for sure, which organizations
will leverage AIRCA as cornerstone of digital transformations.
As we move deeper into this digital age, AI powered root cause analysis will become
fundamental to organizational success.
Companies that harnesses these advanced capabilities will achieve unprecedented
levels of operational efficiency, dramatically reduce downtime, gain
agility, and also solve complex technical, technological issues.
In the landscape itself this evolution in analytical capability won't just be
an operational advantage It will become a critical differentiator in maintaining
market leadership by driving innovation
Thank you for your time.
have a nice day See you