Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I am Sunil Yadav.
I am a seasoned database administrator with the years of experience on Oracle
E-Business Suite, Oracle Code databases implementation upgrades, tuning.
Migrations maintenance, patching cloning in different industries like telecom,
banking retail and various other places.
I worked on many non-oral data databases as well, like MSSQL
server, MySQL server MongoDB.
Little bit of data analytics golden Gate ODI and some other integrated software
technologies across different continents.
So here I am I'm going to talk about this topic of ml, which is
machine learning powering databases.
Resilience via predictive analytics for 99.98 percentage
of Oracle Cloud availability.
Now, as a fact, I. Database downtime, cost any, any huge enterprise more than
9,000 US dollars per minute, which is like typical outage goes beyond NR when we talk
about production databases, which are big in size and complicated in architecture.
So it takes about 60 minutes to fix a problem that cost about more than 700,000.
Dollars per incident if we calculate the math per minute wise.
So the, our machine learning solutions, they transform the, the Oracle cloud
reliability based on the implementations across, you know, hundreds of
enterprise throughout the globe.
So we are going to talk about it.
So as an introduction.
Yeah.
This is what it is.
So.
Like I said, the cost of the downtime is more than $9,000 per minute
for average production issues for a database and which lasts for.
More than 60 minutes.
Even if it's a small problem, the, the limitation and, you know, the
mitigation to solve the problem, the finally fix and implementation.
And, you know, that takes this much time, which costs more than $7,000
per incident as a total finance impact of running a database with 99,
more than 99 percentage of uptime.
Now what is actually machine learning based anatomy detection?
So there are three pillars of this entire concept.
First is early detection of a problem.
So they say that more than 70% of the problems can be identified within like 30
minutes or before they're happening, like.
Space getting filled.
CPU getting choked to more than 99%.
So there are different examples like that.
So if we detect that early, we can prevent that early.
Okay.
Preventive action.
So, you know, it can, can that preventive action be immediately?
Followed and to get the fix available.
That's the, the action to be taken now, improved uptime is the solution, the
outcome of pillar one and pillar two.
So if we do one, two, we have the improved uptime for our database.
So for machine learning, what we need is we need to.
Ensure the deep learning of the performance matrix is done by the database
and the entire solution to, to deliver.
So there is a pattern.
Recognition is the first step into it.
Like I mentioned in the few slides before, pattern recognition, like if there is a
month end or quarter end, or there's a.
Peak load of application workload going on.
So space or, you know, gets transaction log or archive logs
start getting filled up quickly.
So if there's a patent to it, like on few days of a month or a week,
if something like had happens.
So recognize that patent do the action immediately.
Do and recognition should not be in a delayed manner or
like 30 minutes or one hour.
It should be quick enough, like an interval of five to 10 seconds so that
the pattern recognition can be identified and action can be taken immediately.
There are invisible signals also, which is called you can leverage the neural
networks of the entire solution to detect the correlated pattern from the
monitoring logs to fix the problem.
Now when we talk about the entire database, the solution is a machine
learning for uptime of 99.9.
We have different technologies like Oracle, RAC, which is real application
clusters which provide availability and scalability both for the entire
database platform to be available throughout the tough times of month and
close and, or, you know, if there is a.
Disaster somewhere, or one server crashes or something happens to the network, the
other cluster nodes are able to cater.
So, and with that, to re informance learning of the rack will
optimize the entire solution.
So basically there are four parts to it.
One is monitoring, then learn, optimize, and validate.
So to begin with.
The monitoring of the log should be continuous performance
tracking of each and every log.
Like when and what happens.
When the logs are generated, what information is GA gathered into it?
Learn from those information, the logs which are generated and generate analyze
the pattern that anomalies into it.
Like what happens when something.
Happens like in the log, which can be related to a patent.
Then third is to optimize when you identify this happened.
And because of this, this was the analysis done.
So what could be the remediation to, to fix the problem and then.
Once that is delivered, you validate for the res of the database that,
okay, because of this, we did this and did that solve the problem.
So it's like feedback of the solution and, and that's how we,
we, we reinforce the, the learning to the entire database solution.
Now moving on to the, one of the points which we discussed was.
Neural networks for data guard tuning.
So data guard is one, like most of you know, who are into database technologies.
Data guard is a terminology used by Oracle databases primarily
with for the standby databases.
So they basically.
Keep a synchronous or asynchronous copy of the primary database to a standby
site or in the terms of replication so that if anything happens to the primary
side, the stand back can be activated.
In case of failure, disasters or crashes and also neural networks
for data guard tuning is basically synchronized replication like.
Archive system should be very minimum delay between primary and secondary, like
in sub milliseconds for the replication lag between primary and standby databases.
So this will only ensure there is a net zero data loss during fail or events.
Like the solution is only suitable if there's no data loss, right?
We don't want any transaction to be lost if there is a data.
Failover, a database failover between primary and secondary.
Otherwise, the solution holds no value if there is a lag, which
results in transaction loss.
So the most important thing is there should be milliseconds of lag
between primary and second secondary.
So as soon as any transaction is committed on the primary, it should be.
Committed or, you know, rollback or whatever happens in the primary, it should
be image of it should be on the secondary side within milliseconds of the gap.
And second important factor is if you want to achieve this data guy tuning
like with milliseconds of gap, it does not mean that the primary should
be, it should take a performance hit because of this kind of a a millisecond
replication between primary and secondary.
So with the minimum performance impact on the primary, the secondary should
be in sync without any data loss.
That is the most important thing, again, after a minimum data loss.
Then of course.
The third key point is ultra fast recovery.
So, like in Oracle there is something called, called fast MMDR.
So like, which is whenever there is a failover happens, it, the failover
should be quick enough that the secondary should be available if the
there is a failover, failover occurring.
To keep it simple, we don't want.
The application to be unavailable while the failover happens.
Now, a few seconds or a few minutes.
Of downtime probably is agreeable to most of the enterprise applications
because there are very less enterprise applications which demand a hundred
percent uptime because there is always a few minutes of downtime.
So it's like 99.99 percentage of uptime is agreeable, so.
Like I said, so ultra fast recovery on the standby side is going to still
give you a few minutes of downtime, but it should be happening quickly enough
so that the secondary or the failover database should be available within
few minutes, if not less than a minute.
So yeah, that is about the data guard tuning.
Now, moving on to, another important point.
It's called ROI, which is rate of, sorry, return on investment of
machine learning based resilience.
So.
What, what, what is the, the return on investment for any enterprise company
on all this kind, this kind of machine learning and modeling and all it, there
has to be a value to everything, right?
So with the machine learning based resilience of databases in cloud,
Oracle Cloud they, they claim more than.
To 80% of return of investment, meaning it delivers exceptional
financial return across enterprise implementations over a three year period.
So whatever investment is done to today, it's going to give more than 280% of
return within three years of time.
What does it mean?
It means the value.
Any enterprise or any organization derive from the investment they do on this Oracle
Cloud infrastructure database as a service or disaster recovery or any solution.
So there is a pretty nice return on investment.
The second point is 8.4 month payback.
What does it mean is within like accelerated investment
recovery with measurable costs, saving from prevented outage.
So like I said, 99.9 percentage availability.
That means there are outages which are prevented.
By using machine learning, by doing pattern analysis, by digging the
logs, by making the analysis of the outages that what can be prevented,
what action can be taken 30 minutes before a pattern is recognized.
So it takes about eight months to start paying back on this machine
learning based Oracle Cloud solution.
Again, 99.98 per availability is ensures the, the, all the critical
systems, they are operational for 8,758 hours annually, which is like only a
few minutes of downtime is allowed.
Across the industry wide, and there are, there are agreeable parameters through
SOC and all, all the compliance where how many minutes are allowed of downtime to
achieve that standard of availability of.
Databases in Oracle or any other technology.
So it basically cost, minimizes the cost of disruptions.
So that's the plus point.
So moving to the next important point is what actually, how
do we do machine learning?
In, in Oracle cloud computing.
So overall it's predictive autoscaling benefits.
So what, what is predictive autoscaling benefits?
So before going into the technical slides and all, I'll just share one example.
Like when, when a month end happens, we know that there will be lot of
transactions, there will be lot of reports running in, so there'll be lot of.
Hit on the CPU of a, a database of database server.
So autoscaling is as, as the month had approaches and as the
system recognizes that, okay, CPU started getting hitting the peak.
Autoscaling can enable add.
Few CPUs more as the demand grows.
So that's the predictive auto-scaling the for a database.
Similarly for memory similarly for storage, if there is a store
growth of databases happening very high, at a very high rate.
So the system should be autoscaled enough to add more space or at
least give the alerts for the system engineer to provision more space.
So that gives some prediction for it.
So that actually helps in avoiding many outages.
So.
Coming to the, the, what we have in the slide is for auto-scaling benefits.
So one is resource optimization, second is cost reduction.
So resource optimization is machine learning models reduce over
provisioning by 47% while ensuring capacity for unexpected workloads.
Like I said, we don't want to auto scale.
24 by 7, 365 days, right?
We want to auto scale only when the peak load comes, which can be predicted.
So by doing this, by machine learning models, we can reduce o
over provisioning by almost 50%, if not LA more while ensuring the.
Unexpected workloads can be catered.
So not always provision higher, CPU memory compute to the server.
But whenever the production is there, then it should be there.
Otherwise it should scale down.
Second is cost direction.
So everything has a value, right?
So when we do auto scaling, it's.
Per CPU per memory any cloud organization will charge accordingly if it is not done
predictive auto scaling or down scaling.
So by using machine learning, we can reduce the cost depending
on the predictive workload.
So intelligent storage tier based on access patterns decreases cost by 60%.
So if there is no load on the servers, there's no workload, there's no
prediction, downscale the infrastructure.
That's what it means.
It helps in reducing the overall cost.
Moving to the next.
Slide.
So here we are discussing about the case studies of 99.98, availability
of the enterprise solutions, so finance, services, healthcare, retail.
These are, you know, few of the top industries where Oracle
cloud compute is utilized and.
There are benefits of it, like more than 99.98 availability, which helps all these
organization meet finance, healthcare, retail, grow their business on day-to-day
basis because there are no outages, they can cater to their customers around the
clock, 24 by seven across the geographies because Oracle provides, multi region
cloud-based solutions where there is no outage and things are available in
the matter of few seconds or minutes.
Moving to the next slide, so scalable AI framework, so this is
what we are talking about, right?
So cross architecture compatibility.
So it's not only Oracle Cloud, which is to be, point of discussion here.
There are multiple solutions like Azure AWS Amazon Cloud
GCP, Google Cloud IBM Cloud.
So machine learning runs through, cuts through all these scalable AI framework
to get what information is required for.
The enterprise solution to work efficiently, effectively.
Second is automated anomaly detection.
So it's not that manually they, these have to be analyzed by
DBA or a engineer sitting there.
But there are automated jobs which can, which are, which can be scheduled
to capture the logs and information, and later on analysis is done.
For, to be able to publish the reports and boost the operational efficiency.
Overall, it's all about efficiency, right?
Thank the next point is engineering productivity.
So it frees data I. Teams from manual process.
Like I said, it's all automated jobs.
Once the jobs are there, you can customize the jobs as per your need,
and you can enable the focus on high priority task and innovation.
Right?
Other than the mundane manual jobs.
Now coming to the last topic of the discussion, so transforming
data base resilience, so.
There are four steps to it.
You discover the problem, identify the the pattern, you implement the solution
using deploy machine learning driven monitoring and optimization tool.
Tools, then you optimize what you have learned from, from there, and then you
archive you, sorry, you achieve your 99.98% availability while reducing
the overall running cost of the entire enterprise architect solution.
Yeah.
So team this is about the presentation.
I thank you for all your time and listening to me.
You have a good day.
Thanks.
Bye.