Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
This is I'm a solution architect working at Callans.
Welcome to the conference 42 site, reliable Engineering,
occurring on April 17th.
Without having much delay, let me get into the, today's topic into the deck.
Yeah.
Today's topic.
I'm speaking about how we can enhance SRE practices using ai.
And how we can build more resilient cloud native platforms.
In this session, we'll identify how to AI capabilities are transforming
site reliability, engineering practices in the modern admirations.
And also we'll explore some practical implementations that
drive that improvements in different aspects such as reliability,
efficiency, and cost management.
Yeah, we can first see the convergence between AI and SRE for this.
We'll first go to SRE practices where manual incident response workflows
requiring human intervention in every aspect of reporting the incidents.
Also, the reactive monitoring focus on current system status, not any
other future we cannot predict here.
And also we have mostly is a static rule based alerting when that threshold meets
of static particular point or something.
And also, we are not going to do the, any historical it is always
go with the historical analysis and do the necessary performance.
Things.
This is how, if we have been dealing with the traditional practices,
when comes when it comes to ai.
Practices.
Here you are doing some kind of automation.
And also we will have self ailing capabilities for any problem that arises.
Along with that, it also does this react to monitoring that
forecast potential issues.
Then we'll have the, rather than static alerts here, we will have
the intelligent alerts and also dynamic priority adjustments.
Forward looking insights through pattern recognitions.
These two areas are very diff important when you compare with
the traditional and AI manual.
Traditional, you are always with the handling with the static
related things and manual efforts.
When it comes to ai in SRE space, you are doing the more automated
predictive as well as intelligence.
It'll, you'll apply some kind of patterns recognitions based on that, these two
approaches, you can easily identify where I'm navigating in my father's slides.
The, this way you are going to reduce Mt. Dity is going to be 25%
where it can reduce identify the detection and potential issues.
These are the, some of the key performance improvements.
Also service availability.
It is going to improve by almost 30%.
You'll identify the detect problems early around 23% improvement here.
In case of pallets, it's going to be reduced almost 27% because of the
a involvement in the site level ing
in this triangular.
We'll go from bottom to top.
First we'll take the near driven observability.
First we'll take the top bottom one, which is a comprehensive telemetry
where s data collections that optimizes a signal to noise ratio while ensuring
complete visibility across the stack.
It is the more robust area where you'll take, capture all the information and
does the required processing here.
Then we'll put apply the machine learning models.
Where it automate the baselines, how it helps is you can have dynamic
thresholds that continuously adapt to a seasonal variations and growth trends
and use this patterns on top of it.
You'll have a, once you have the comprehensive tele automated
baseline, you can find the patterns.
That is what pattern recognition plays advanced algorithms that detect
a non-obvious relationship between seemingly unrelated systems and services.
On top of patterns, we will have the intelligent visualization.
How you look at, from the a point of view, it is more reliable reliable, real time,
context based dashboards and automatically highlight what are the critical metrics
and potential issues that may arise.
Now we'll go for anomaly detection using mission learning.
First, we'll start with the base establishment.
In base establishment, we will have the stimulated algorithms where it learns
normal operation patterns and performance signatures across the distributor systems.
Continuous monitoring, realtime telemetry systems are our streams
are analyzed, diagnosed, established baselines with microsecond precisions.
Deviation detection, how it moves, okay, it is going in the parallel line.
What is the planned one, which is in the threshold or not, and
how it behaves based on, then it goes to the deviation detection.
Okay?
Once you have this setup, if something is happening, false positives
are behavior of this algorithms.
Then it'll go to the model refinement.
Once you have the mission learning models continuously evolve.
Through automated feedback loops in this spiral we can
improve the accuracy war time.
That's where the machine learning helps to find the anomalies in SRE space.
This slide talks about how AI powering the cas engineering.
Intelligent test designs AI algorithms, analyze system dependencies to
identify critical failure points and design targeted experiments
with maximum learning potential.
Second point is automated execution precisely controls,
failures simultaneously.
Deploy during the low traffic window with the comprehensive safety mechanisms to
prevent cascading production impacts, to, we need to identify, okay, what was the
low business hours so that we can exclude the necessary scripts so that the any
problem comes, the impact will reduce.
We'll have the real time analysis, sophisticated monitoring, capture
system, degradation pattern, and compare actual re resilience metrics against
with the machine learning predictor failure responses, resilience, implement
identified vulnerabilities, trigger automated remediation, workflows that
implement infrastructure changes, enhancing the system durability,
agonist feature disruptions.
How the natural language processing for post analysis.
Once we have those things, how your natural language processing
and LP works for analyze purpose.
In this, we'll go one after one automated instant documentation.
AI transcribes users organizes to communication During the instance
it identify the pattern recognition.
System identifies similar with the previous instance like that.
Then I work on the root cause analysis, NLP extract all ca casual
relationship from technical discussions.
Then it find out all the necessary information for the
analysis based enhancements.
Automatically.
And it updates runbooks as well as documentation related to the entire
postma analysis, anything happen, what happened, why any problem occur, and how
we can improve our predictive algorithms.
In future.
This helps.
This step is very important for lesser lens purpose.
Let's take one case study here before jumping into the other topic.
Mainly we are focusing now on financial service platform here.
It is all related to dollars numbers.
The challenge is if you take a critical payment processing system,
if any outage happens in a particular business let's say it is almost 150
K revenue loss per hour downtime.
If you implement this sophisticated algorithms in machine learning,
you can predict, identify the transaction flow and a analyze
before they cascade into the phases.
Meantime to resolutions dramatically reduce from 45 minutes to 12 minutes,
improving customer experience that automatically helps to bring up uptime
is more then it'll in inherently helps the customer satisfaction.
A return on investments, if you say it creates almost like a $3.2 million savings
through enhances system availability and prevents revenue impacting outages.
Later on
here we go to a enhanced capacity planning.
In this session, I probably highlight the top headings.
It is mainly historical analysis.
Demand predictions, scenario modeling and automated provisioning.
It is like a spiral loop or keep.
We had to do, we had to identify the, what is the available data.
Then we need to find out that what is the demand?
Then find out the model that works.
Then we have to automate that steps, orchestrate those areas, how we can
practically scale the infrastructure.
Are any problem coming up and how we can effectively utilize,
eliminate any cost or provisioning?
Yeah, it is like simply, okay, how much you need that much.
You assign the infrastructure concept if it is a peak load is coming here.
Since our models are efficient, it identify and improve the elastic means.
How we can implement this strategy.
First, we need to assess the current maturity how we need
to evaluate the existing SRA practices and identify the gaps.
You have to start with small and experiment around it.
Then build internal area, develop cross-functional
SRE skills with the teams.
Then we go with the successful scaling of all the entire organization level.
It is all okay, first try test it, then you feel comfortable.
Then talk to the all the, within the teams and implement
within the your develop area.
Then we can go with the entire company level changes.
That's how we can scale up
in this exercise.
What are the key takeaways
you can measure The impact?
Based on this practices you can identify what how to quantify the
implements and reliability of the metrics and cost efficiency, how
it evolution, not a revolution.
It is all start small starts with the targeted implementation that
complement existing practices, then improve the skills.
Any successful thing you need, have invest in both technical capabilities
and our national culture that helps a smooth transition and have a better
future with advancing technologies.
Then of course, you have to look at the forward looking areas where
SRE will increasingly rely on the artificial intelligence to manage
complex or distributed systems.
Now we'll go with the, what is the role in.
Enterprise application integration in a modern cloud environments for SRE.
In today's fast paced and distributor systems, enterprises are leveraging cloud
infrastructure to host, manage, and a wide variety of applications, especially.
Integration of this application is critical.
To have a unified system that functions smoothly regardless of
the underlying infrastructure.
Why SRE and integration matters every time you go with the a p
protocols to talk to your backend systems and front end applications.
That is where the integration also plays a very critical role where
I've been expertise in my career.
If you go with the cloud-based integration, the cloud provides.
Flexibility, scalability for integrating the applications, data sources and systems
across Mpromise, hybrid, R Multicloud, enrollments, tenants, anywhere.
If you go with the IPAs, that is integration as a platform, as service.
There are multiple IPAs solutions like MuleSoft, Azure Logic apps
help integrate applications and data sources in the cloud.
Not only IPOs, you have a past application such as SAB
integration Suite and other areas.
Where you can develop your own integration models.
That also helps in the integration area.
When it comes to a p management, these are the critical for cloud a EI In
SRU space, managing and monitoring these APIs is crucial to ensure the
uptime, scalability, and reliability.
That's where I was talking about EI and SRE relation.
There are so many APIs you need to talk back and forth using rest
services or, lightweight applications, lightweight protocols, all the stuff.
What are the key takeaways techniques applied in AI?
For site relatability engineering, you have predictive analytics,
anomaly detection, root cause analysis.
This is automated decision making.
In predictive, AI driven predictive analytics can forecast system behavior.
Also detect potential failures before it attack so that SRE teams can proactively
manage and address those risks.
For example, predicting traffic spikes and resource bottlenecks,
enabling the auto scaling in cloud environments, anomaly detection.
EA models can analyze large volumes of data.
To detect anomalies or performance integration systems allowing
for faster identification of issues in the cloud environment.
Okay.
For example, like a port systems can detect usual patterns in
a p calls of database queries.
Triggering automated alerts to SRE teams in root cause can assist in d
diagnosing that root cause of failures.
By analyzing the logs metrics system status across integrated applications,
this accelerate troubleshooting time and also incident response.
Example, like a, those can correlate performance issues across microservices
and identify the exact service causing latency, aiding the faster decision for
the teams who are working in this area.
Also automated decision making
AI can automate decision making in cloud ai, such as load balancing.
Yeah, you have load balances even nowadays with F five and done, but this
helps much more better to boost those as well scaling operations based on
the realtime analysis of application performance and infrastructure health,
using the mission learning models.
Predict to server, predict server load, and automatically adjust
resources to ensure opt optimal performance and optimum uptime.
In conclusion, the future of EI cloud AI and SRE, it is a continuation evolution.
E you know that cloud services and AI technologies value every day.
The SRE teams will have even more powerful tools in coming days and to ensure.
The reliability performance of integrated applications from across
the enterprise in during this process, you can improve the automation.
The combination of cloud based integration, AI, SRE principles will
lead to more automated systems in the manual intervention and improving
the system uptime at the same time.
In this fast paced world, you have to have good collaboration.
In this space, SRA teams will need to have a good collaboration with
the DevOps teams, cloud architects, and data scientists for what purpose?
To build a resilient, scalable, and intelligent system.
That is the goal for today's SRE topic, including a finance system.
Example, integration examples.
How we can enable the systems in the past, future technologies,
how we can utilize our machine learning models, how we can use your
natural language processing models.
That helps to have the better and the best SRE teams going forward
into the enterprise applications across the company organization.
Thank you.