Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm Sona.
I'm here to discuss how Auto ML is transforming enterprise site
reliability engineering practices.
So we will discuss in detail how the organizations are getting
leverage with this process.
So when it comes to the site reliability engineering, it's a
fundamental transformations through auto ml, the evaluation approaches
redefining how organizations monitor.
Predict and response.
This reliability challenges, it is applicable for across industries.
We will explore how auto ML is democratizing a power reliable analysis
and predictions all the maximum services which we can integrate with this auto
ml. So earlier in the traditional operations, the manual monitoring was.
Quite challenging and it was time taking process and it was,
uh, giving lot of errors also.
And sometimes the results will go in different directions.
So if it is the L-A-S-R-E adaptions, the software engineering principles
with the basic automations or log monitoring or anomalies, these kind of
protocols is integrated in adoption.
But coming to the modern SRE.
It is emphasizing on service level objectives, how the instance
management work and how it can be leveraged with this process.
We can go for the modern SRE, but nowadays with this cloud, I. We have
a Auto ml, enhanced sre, so with this integrated auto ml, we have all the
predictive analytics and alerting mechanism and all the mutation of
this alerts, anomalies, or errors.
All this we can predict in advance so that.
We can get benefit out of this and the projects also will run successfully.
So this journey is from traditional operations to SRE, how this continuous
process, and it's a proactive process.
So we need to check back and forth and success system complexities and
we can incorporate this process.
So when it comes to the why, actually enterprises needed this site,
reliability engineering, because nowadays this scalability is.
High.
This applications are, the traffic of applications are high,
and the capabilities of this infrastructure rules has became high.
And even the architecture's complexities has been increased.
And when it comes to the development, it's a continuous, uh, processes evaluating
and cycles, release cycles, SDLC process, uh, all are being increased rapidly.
And user expectations also, they don't want to tolerate even for
the fraction of seconds downtime.
And even the standards that they believe that the revenue loss, they
cannot afford it and financial impact, eventually these outages can impact
the financial processes and, uh, engineering efforts also will be more.
So this is a brand damage.
Also, we occur with this if we do not identify the process in elder
stage and we, if we don't address it.
So here how auto ML is transforming this SRS.
Some use cases is kind of instant predictions or preventions.
If you can predict any of the server failure in advance or
any traffic or error spikes.
That is, we can mitigate in advance by predicting this one so that we
can prevent it, and sometimes we can go for the root cause analysis.
Most of the time it is users.
Because it's correlate metrics and it identifies the pattern, how the
logs has been, identifying how the events are continuously occurring.
These we can, most likely we can go for the root causes
and it reduces the downtime.
For helping SREs to fix the things, because if we know the problem, then
we can identify the fix immediately and we will be very preventive.
We fix all those, these issues.
So that's, that's the reason root cause analysis will be easy
and helpful to fix faster way.
And when it comes to the capacity planning, so when automated
ML is focused system outages.
So you can mention that how this traffic searches and, uh, what
type of overloads can occur and, uh, if we prevent it and, uh.
Or else if we predict it, we can increase the size and, uh, you know,
infrastructure allocations, all this, uh, bandwidth so that we can save the
money in the cloud services, it is auto-scaling, so whenever the services
are increasing, so it'll be in the cost will be I, if we don't have a services
much, I mean, if we are not leveraging most of the time, so then we can reduce
the compute so that we can save the time.
Then other one is automated alert tuning if too many alerts are coming,
and the automated ML will identify whether these are true or false.
If it is false, then it'll ignore because sometimes we don't even
know that a spam kind of males, like the alerts also not accurate.
And uh, we may tend to that to be considered and, uh, towards
we end up the time to fix those things, which is not even.
Um, impact of any of the process, then that kind of will be analyzed in the
automated alert tuning, so then only the, we will have a meaningful alerts, so send
to the engineers so that they can only concentrate on, which are really useful.
And self failing systems, we can define some of the process.
If, for example, the downtime is high in during particular time and only the
restart required that we can tune it if it is reaching this threshold value.
And then, uh, we can mention that the rules, which are, we can define
easily and automatically it restarts so that attention we, we need not
to give, so it'll automatically fix.
And, uh, eventually all these things will, we can get implemented how
earlier the traditional MLS was going.
So it was a massive process where time taking and the efforts also more.
So when it comes to the data acquisitions, so all the cases that,
uh, predefined data will acquire it, even the traditional and automated.
But after this, when it comes to the data exploration, two, data preparation and
feature engineering, then model selection, model training, and the parameters for.
Performance tuning and model evaluations.
These all process will in, in the existing, in the traditional ml, but
whereas in the ml. So all these will.
Come up, come up with the machine learning automated system where we
can only in insert the data and we can make, give some metrics and, uh,
sometimes we can, if we are really budget friendly, we can give the time
and, uh, cost so that it'll consider all these parameters which you have given.
And this is the black box.
And where you can also mention if you are really.
Sure about the process.
Then you can mention if you are not, then you can take the
approaches from this automated ml that, uh, features, algorithms.
All these parameters.
We can mention it.
And then after this, it'll consider various, uh, train and test status and
it'll give the results and, uh, that will give a rank and, uh, accuracy for us.
So out of this, it's your choice to discuss and, uh, decide among of this
and you can get benefit out of this.
So this is how that, um, auto ML process is very easy and, uh, without having
much knowledge in the data scientist science, also we can go for it.
So, and coming to the auto ML transforms, uh, SRE workflows, we can.
Do this in a better way as first we can go for the data collections and, uh,
the future engineering and monitoring resources can derive from there.
And the model selections and training this algorithms is the models, this
only the reliability scenarios.
Whether it is a prediction, uh, then it'll choose the prediction models and anomaly
detections if it is continuous monitoring and recognization behaviors that exist.
So it'll, it'll identify those and it'll monitor for it.
And predictive responses also automated the two actions.
This adjustments, whatever the predict insights it, it gives, if you can
search, mention in that parameter so that it'll automatically consider it.
So this streamlined, uh, analytic workflow, it's eliminate all the
manual efforts and model tuning, also improving this accuracy coming to the
practical implementation framework.
So whenever you wanted to implement in your organizations, so then you can
go for the assessments and planning.
Then evaluate the current SRE maturity, whether it is working fine or you
wanted to go for the automation.
If it is a maturity of your manual process, then it is fine, but most of
the cases, manual intervention is more then, so then identify this high value
use cases and what is the success metrics.
Then whatever the existing process, you can document it
and you can see the automation.
Then, uh, you can see the differences.
If it has come to the data preparation that whatever the data monitoring
data is available, and you can establish the quality baselines and
uh, some labels, you can create it and even the tips it is used for the
supervised learning platform selections.
There are, uh, three.
Platforms, if you can assume it's the most, uh, predominantly using by
organizations is, uh, Azure to ml, and Google also is having this feature
and, uh, AWS also, it's up to the organizations what services they can.
Use it, what cloud?
They can go for it.
They can select it.
And, uh, most of the cases, every, every, uh, platform is
working in the similar approach.
So then that specific use cases we can consider cloud native
and, and vendor neutral options.
Then pilot implementations, you can please start with the, uh, modified
pilot projects and assess it and how the transition, this automated
process, you can, uh, go for the implementation, then scale and optimize.
Expand this additional use cases if you are comfortable with the model
performance, and you can go for the large group of things and you can go
for the continuous implement, so that that's how the project will go live.
And, uh, you will be, you can compare the results and you will, you will be good.
Then the manual results versus automated results and when it comes to the key
operational metrics, implement now.
It has been around as per the statistics, it has been 68%.
Uh, man, time to deductions decreased this process and faster 42%.
Meantime, resolution has improved across, uh, with 42.
That 48 percentage and alert accuracy are very high.
This is around 91% as per the statisticians and the coverage expansion
is tripled and these all, uh, key metrics which organization considering to it.
And if next steps, if you wanted to go for it, then the conductor, some
audit cases and some take of the two to three high level use cases and
discuss with the cross-functional teams because all stakeholders, engineers,
and data scientists, if they involve, so we will have a good understanding.
What are the.
Major problems and how we can address those issues and how we
can come out of the situation by reducing the downtime or anomalies.
Error Pro.
And this, you can launch the pilot project in the auto ML later once
you are comparing with the manual, uh, results versus auto ML results.
Yes, there are some implementation challenges here too.
For example, if you go for the data quality issues, there is
an inconsist and incomplete in the monitoring data that time.
What we can suggest in this solution, the data validation pipelines, is.
Input meet inputs and all.
They should meet the standard process because if the data is accurate and
if incorrect data and if the relation between the data, so then you will have a
correct output so you can check when you are passing this data to the model, then
automatically the quality issues will be resolved and the scoring also will impact
if you do not post the data properly.
And organization resistance.
Yes.
SRE teams is trusting.
Automated process is difficult because there are, uh, some critical
decisions also be taken, of course.
But what I suggest, you can go for the non-critical systems and if
you have a confidence and to go for the side by side comparison, how.
The manual results and how the automatic decisions.
So eventually, if you see it so you can go for these approaches.
Integration complexity.
Yes.
Connecting automate auto ML outputs to existing this monitoring tools.
For example, if you have a, uh, platform specific tools and.
It is hard to integrate at one stretch with the auto ML process.
So for that, we can leverage standard APIs and some architectures to create
some loosely coupled integration points.
But now with the Cloud auto MLS are providing lot of s, we can get leverage
with those and we can go for it.
So the successful implementations all the way it is required.
Thoughtful discussions and the technical and organization challenges.
Some domains I can, most of the domains are using this one, but how much
the domains are getting leveraged.
So I just wanted to point out some of the areas, for example, if you
take the financial services, this automatic predicts, the how much
load during the market hours are stock exchange timings, if it.
Predicts it.
And then during that we can make a decision during this peak
hours whether we can scale out and, uh, increase the resources.
We, we can decide in pre preventive measures.
Also, we can take it.
And it can detect anomalies in the transactions, our server
level logs, if any fraud or system values, that also we can detect it.
So hence we can take a good action on it.
And if you go for the e-commerce, e-commerce, if it is.
We predict the traffic spikes, how the sales time be Black Friday or New
Year, new Year time, are any of the, uh, during holidays that time the, you
know, sales services is high and we can expect the peak covers during that.
So it automatically predicts and we can, it pop up to us in advance so that we can
take a decisions to how to improve the.
Compete engine are.
This is also detect the anomalies and if any revenue laws from the field, payments
and all so that we can take care, for example, if the one payment process is not
going through well all the time so that we can have a preventive measures to opt it.
Then, uh, other healthcare.
This healthcare monitors iot devices and patient data as systems.
This EMR medical reports also is the predictive maintenance can be happen,
and the same way telecom industries.
Also, we can detect outages in the real time and the failover roads.
Also, we can identify, we can forecast usage patterns, how
the bandwidth allocations and the infrastructure deployment.
Here.
I would like to conclude saying ROML is not a replace of sre.
It only enhances the tool is getting used.
The capabilities of.
SRE practices, how we can automatic routine analysis, our decision makings,
our anomalies, our fraud detections.
These kind of scenarios are P covers, these SRE practices, whatever
we are doing in normal basis.
I mean manual braces.
So we can go for this approach and we can think in the strategical way and
then we can have decisions out of this.
This allows SRE to focus on only strategic tasks, but when this heavy
load is there, then that time it is hard to take your decisions with
the specific tools and um, you know, the tool compatibilities and all.
But, uh, whereas if it's ML it handles very well.
So this is what I wanted to conclude here.
So thank you so much for this giving this opportunity.