ML-Powered Database Resilience: Predictive Analytics for 99.98% Oracle Cloud Availability

Video size:

Abstract

Discover how ML predicts 83% of database failures before they happen! Learn how 750+ enterprises achieve 99.98% availability while cutting costs by 47%. Transform database resilience with predictive analytics that deliver 300% ROI.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I am Sunil Yadav. I am a seasoned database administrator with the years of experience on Oracle E-Business Suite, Oracle Code databases implementation upgrades, tuning. Migrations maintenance, patching cloning in different industries like telecom, banking retail and various other places. I worked on many non-oral data databases as well, like MSSQL server, MySQL server MongoDB. Little bit of data analytics golden Gate ODI and some other integrated software technologies across different continents. So here I am I'm going to talk about this topic of ml, which is machine learning powering databases. Resilience via predictive analytics for 99.98 percentage of Oracle Cloud availability. Now, as a fact, I. Database downtime, cost any, any huge enterprise more than 9,000 US dollars per minute, which is like typical outage goes beyond NR when we talk about production databases, which are big in size and complicated in architecture. So it takes about 60 minutes to fix a problem that cost about more than 700,000. Dollars per incident if we calculate the math per minute wise. So the, our machine learning solutions, they transform the, the Oracle cloud reliability based on the implementations across, you know, hundreds of enterprise throughout the globe. So we are going to talk about it. So as an introduction. Yeah. This is what it is. So. Like I said, the cost of the downtime is more than $9,000 per minute for average production issues for a database and which lasts for. More than 60 minutes. Even if it's a small problem, the, the limitation and, you know, the mitigation to solve the problem, the finally fix and implementation. And, you know, that takes this much time, which costs more than $7,000 per incident as a total finance impact of running a database with 99, more than 99 percentage of uptime. Now what is actually machine learning based anatomy detection? So there are three pillars of this entire concept. First is early detection of a problem. So they say that more than 70% of the problems can be identified within like 30 minutes or before they're happening, like. Space getting filled. CPU getting choked to more than 99%. So there are different examples like that. So if we detect that early, we can prevent that early. Okay. Preventive action. So, you know, it can, can that preventive action be immediately? Followed and to get the fix available. That's the, the action to be taken now, improved uptime is the solution, the outcome of pillar one and pillar two. So if we do one, two, we have the improved uptime for our database. So for machine learning, what we need is we need to. Ensure the deep learning of the performance matrix is done by the database and the entire solution to, to deliver. So there is a pattern. Recognition is the first step into it. Like I mentioned in the few slides before, pattern recognition, like if there is a month end or quarter end, or there's a. Peak load of application workload going on. So space or, you know, gets transaction log or archive logs start getting filled up quickly. So if there's a patent to it, like on few days of a month or a week, if something like had happens. So recognize that patent do the action immediately. Do and recognition should not be in a delayed manner or like 30 minutes or one hour. It should be quick enough, like an interval of five to 10 seconds so that the pattern recognition can be identified and action can be taken immediately. There are invisible signals also, which is called you can leverage the neural networks of the entire solution to detect the correlated pattern from the monitoring logs to fix the problem. Now when we talk about the entire database, the solution is a machine learning for uptime of 99.9. We have different technologies like Oracle, RAC, which is real application clusters which provide availability and scalability both for the entire database platform to be available throughout the tough times of month and close and, or, you know, if there is a. Disaster somewhere, or one server crashes or something happens to the network, the other cluster nodes are able to cater. So, and with that, to re informance learning of the rack will optimize the entire solution. So basically there are four parts to it. One is monitoring, then learn, optimize, and validate. So to begin with. The monitoring of the log should be continuous performance tracking of each and every log. Like when and what happens. When the logs are generated, what information is GA gathered into it? Learn from those information, the logs which are generated and generate analyze the pattern that anomalies into it. Like what happens when something. Happens like in the log, which can be related to a patent. Then third is to optimize when you identify this happened. And because of this, this was the analysis done. So what could be the remediation to, to fix the problem and then. Once that is delivered, you validate for the res of the database that, okay, because of this, we did this and did that solve the problem. So it's like feedback of the solution and, and that's how we, we, we reinforce the, the learning to the entire database solution. Now moving on to the, one of the points which we discussed was. Neural networks for data guard tuning. So data guard is one, like most of you know, who are into database technologies. Data guard is a terminology used by Oracle databases primarily with for the standby databases. So they basically. Keep a synchronous or asynchronous copy of the primary database to a standby site or in the terms of replication so that if anything happens to the primary side, the stand back can be activated. In case of failure, disasters or crashes and also neural networks for data guard tuning is basically synchronized replication like. Archive system should be very minimum delay between primary and secondary, like in sub milliseconds for the replication lag between primary and standby databases. So this will only ensure there is a net zero data loss during fail or events. Like the solution is only suitable if there's no data loss, right? We don't want any transaction to be lost if there is a data. Failover, a database failover between primary and secondary. Otherwise, the solution holds no value if there is a lag, which results in transaction loss. So the most important thing is there should be milliseconds of lag between primary and second secondary. So as soon as any transaction is committed on the primary, it should be. Committed or, you know, rollback or whatever happens in the primary, it should be image of it should be on the secondary side within milliseconds of the gap. And second important factor is if you want to achieve this data guy tuning like with milliseconds of gap, it does not mean that the primary should be, it should take a performance hit because of this kind of a a millisecond replication between primary and secondary. So with the minimum performance impact on the primary, the secondary should be in sync without any data loss. That is the most important thing, again, after a minimum data loss. Then of course. The third key point is ultra fast recovery. So, like in Oracle there is something called, called fast MMDR. So like, which is whenever there is a failover happens, it, the failover should be quick enough that the secondary should be available if the there is a failover, failover occurring. To keep it simple, we don't want. The application to be unavailable while the failover happens. Now, a few seconds or a few minutes. Of downtime probably is agreeable to most of the enterprise applications because there are very less enterprise applications which demand a hundred percent uptime because there is always a few minutes of downtime. So it's like 99.99 percentage of uptime is agreeable, so. Like I said, so ultra fast recovery on the standby side is going to still give you a few minutes of downtime, but it should be happening quickly enough so that the secondary or the failover database should be available within few minutes, if not less than a minute. So yeah, that is about the data guard tuning. Now, moving on to, another important point. It's called ROI, which is rate of, sorry, return on investment of machine learning based resilience. So. What, what, what is the, the return on investment for any enterprise company on all this kind, this kind of machine learning and modeling and all it, there has to be a value to everything, right? So with the machine learning based resilience of databases in cloud, Oracle Cloud they, they claim more than. To 80% of return of investment, meaning it delivers exceptional financial return across enterprise implementations over a three year period. So whatever investment is done to today, it's going to give more than 280% of return within three years of time. What does it mean? It means the value. Any enterprise or any organization derive from the investment they do on this Oracle Cloud infrastructure database as a service or disaster recovery or any solution. So there is a pretty nice return on investment. The second point is 8.4 month payback. What does it mean is within like accelerated investment recovery with measurable costs, saving from prevented outage. So like I said, 99.9 percentage availability. That means there are outages which are prevented. By using machine learning, by doing pattern analysis, by digging the logs, by making the analysis of the outages that what can be prevented, what action can be taken 30 minutes before a pattern is recognized. So it takes about eight months to start paying back on this machine learning based Oracle Cloud solution. Again, 99.98 per availability is ensures the, the, all the critical systems, they are operational for 8,758 hours annually, which is like only a few minutes of downtime is allowed. Across the industry wide, and there are, there are agreeable parameters through SOC and all, all the compliance where how many minutes are allowed of downtime to achieve that standard of availability of. Databases in Oracle or any other technology. So it basically cost, minimizes the cost of disruptions. So that's the plus point. So moving to the next important point is what actually, how do we do machine learning? In, in Oracle cloud computing. So overall it's predictive autoscaling benefits. So what, what is predictive autoscaling benefits? So before going into the technical slides and all, I'll just share one example. Like when, when a month end happens, we know that there will be lot of transactions, there will be lot of reports running in, so there'll be lot of. Hit on the CPU of a, a database of database server. So autoscaling is as, as the month had approaches and as the system recognizes that, okay, CPU started getting hitting the peak. Autoscaling can enable add. Few CPUs more as the demand grows. So that's the predictive auto-scaling the for a database. Similarly for memory similarly for storage, if there is a store growth of databases happening very high, at a very high rate. So the system should be autoscaled enough to add more space or at least give the alerts for the system engineer to provision more space. So that gives some prediction for it. So that actually helps in avoiding many outages. So. Coming to the, the, what we have in the slide is for auto-scaling benefits. So one is resource optimization, second is cost reduction. So resource optimization is machine learning models reduce over provisioning by 47% while ensuring capacity for unexpected workloads. Like I said, we don't want to auto scale. 24 by 7, 365 days, right? We want to auto scale only when the peak load comes, which can be predicted. So by doing this, by machine learning models, we can reduce o over provisioning by almost 50%, if not LA more while ensuring the. Unexpected workloads can be catered. So not always provision higher, CPU memory compute to the server. But whenever the production is there, then it should be there. Otherwise it should scale down. Second is cost direction. So everything has a value, right? So when we do auto scaling, it's. Per CPU per memory any cloud organization will charge accordingly if it is not done predictive auto scaling or down scaling. So by using machine learning, we can reduce the cost depending on the predictive workload. So intelligent storage tier based on access patterns decreases cost by 60%. So if there is no load on the servers, there's no workload, there's no prediction, downscale the infrastructure. That's what it means. It helps in reducing the overall cost. Moving to the next. Slide. So here we are discussing about the case studies of 99.98, availability of the enterprise solutions, so finance, services, healthcare, retail. These are, you know, few of the top industries where Oracle cloud compute is utilized and. There are benefits of it, like more than 99.98 availability, which helps all these organization meet finance, healthcare, retail, grow their business on day-to-day basis because there are no outages, they can cater to their customers around the clock, 24 by seven across the geographies because Oracle provides, multi region cloud-based solutions where there is no outage and things are available in the matter of few seconds or minutes. Moving to the next slide, so scalable AI framework, so this is what we are talking about, right? So cross architecture compatibility. So it's not only Oracle Cloud, which is to be, point of discussion here. There are multiple solutions like Azure AWS Amazon Cloud GCP, Google Cloud IBM Cloud. So machine learning runs through, cuts through all these scalable AI framework to get what information is required for. The enterprise solution to work efficiently, effectively. Second is automated anomaly detection. So it's not that manually they, these have to be analyzed by DBA or a engineer sitting there. But there are automated jobs which can, which are, which can be scheduled to capture the logs and information, and later on analysis is done. For, to be able to publish the reports and boost the operational efficiency. Overall, it's all about efficiency, right? Thank the next point is engineering productivity. So it frees data I. Teams from manual process. Like I said, it's all automated jobs. Once the jobs are there, you can customize the jobs as per your need, and you can enable the focus on high priority task and innovation. Right? Other than the mundane manual jobs. Now coming to the last topic of the discussion, so transforming data base resilience, so. There are four steps to it. You discover the problem, identify the the pattern, you implement the solution using deploy machine learning driven monitoring and optimization tool. Tools, then you optimize what you have learned from, from there, and then you archive you, sorry, you achieve your 99.98% availability while reducing the overall running cost of the entire enterprise architect solution. Yeah. So team this is about the presentation. I thank you for all your time and listening to me. You have a good day. Thanks. Bye.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

ML-Powered Database Resilience: Predictive Analytics for 99.98% Oracle Cloud Availability

Video size:

Abstract

Summary

Transcript

Slides

Sunil Yadav

@ University of Pune

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

ML-Powered Database Resilience: Predictive Analytics for 99.98% Oracle Cloud Availability

Video size:

Abstract

Summary

Transcript

Slides

Sunil Yadav

@ University of Pune

Join the community!