Reliability at Scale: How AutoML is Transforming Enterprise SRE Practices

Video size:

Abstract

Discover how AutoML transforms SRE by converting system data into predictive reliability insights—no data science expertise needed. Learn how enterprises slash MTTR, eliminate false positives, and prevent outages with practical AutoML strategies you can implement immediately.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I'm Sona. I'm here to discuss how Auto ML is transforming enterprise site reliability engineering practices. So we will discuss in detail how the organizations are getting leverage with this process. So when it comes to the site reliability engineering, it's a fundamental transformations through auto ml, the evaluation approaches redefining how organizations monitor. Predict and response. This reliability challenges, it is applicable for across industries. We will explore how auto ML is democratizing a power reliable analysis and predictions all the maximum services which we can integrate with this auto ml. So earlier in the traditional operations, the manual monitoring was. Quite challenging and it was time taking process and it was, uh, giving lot of errors also. And sometimes the results will go in different directions. So if it is the L-A-S-R-E adaptions, the software engineering principles with the basic automations or log monitoring or anomalies, these kind of protocols is integrated in adoption. But coming to the modern SRE. It is emphasizing on service level objectives, how the instance management work and how it can be leveraged with this process. We can go for the modern SRE, but nowadays with this cloud, I. We have a Auto ml, enhanced sre, so with this integrated auto ml, we have all the predictive analytics and alerting mechanism and all the mutation of this alerts, anomalies, or errors. All this we can predict in advance so that. We can get benefit out of this and the projects also will run successfully. So this journey is from traditional operations to SRE, how this continuous process, and it's a proactive process. So we need to check back and forth and success system complexities and we can incorporate this process. So when it comes to the why, actually enterprises needed this site, reliability engineering, because nowadays this scalability is. High. This applications are, the traffic of applications are high, and the capabilities of this infrastructure rules has became high. And even the architecture's complexities has been increased. And when it comes to the development, it's a continuous, uh, processes evaluating and cycles, release cycles, SDLC process, uh, all are being increased rapidly. And user expectations also, they don't want to tolerate even for the fraction of seconds downtime. And even the standards that they believe that the revenue loss, they cannot afford it and financial impact, eventually these outages can impact the financial processes and, uh, engineering efforts also will be more. So this is a brand damage. Also, we occur with this if we do not identify the process in elder stage and we, if we don't address it. So here how auto ML is transforming this SRS. Some use cases is kind of instant predictions or preventions. If you can predict any of the server failure in advance or any traffic or error spikes. That is, we can mitigate in advance by predicting this one so that we can prevent it, and sometimes we can go for the root cause analysis. Most of the time it is users. Because it's correlate metrics and it identifies the pattern, how the logs has been, identifying how the events are continuously occurring. These we can, most likely we can go for the root causes and it reduces the downtime. For helping SREs to fix the things, because if we know the problem, then we can identify the fix immediately and we will be very preventive. We fix all those, these issues. So that's, that's the reason root cause analysis will be easy and helpful to fix faster way. And when it comes to the capacity planning, so when automated ML is focused system outages. So you can mention that how this traffic searches and, uh, what type of overloads can occur and, uh, if we prevent it and, uh. Or else if we predict it, we can increase the size and, uh, you know, infrastructure allocations, all this, uh, bandwidth so that we can save the money in the cloud services, it is auto-scaling, so whenever the services are increasing, so it'll be in the cost will be I, if we don't have a services much, I mean, if we are not leveraging most of the time, so then we can reduce the compute so that we can save the time. Then other one is automated alert tuning if too many alerts are coming, and the automated ML will identify whether these are true or false. If it is false, then it'll ignore because sometimes we don't even know that a spam kind of males, like the alerts also not accurate. And uh, we may tend to that to be considered and, uh, towards we end up the time to fix those things, which is not even. Um, impact of any of the process, then that kind of will be analyzed in the automated alert tuning, so then only the, we will have a meaningful alerts, so send to the engineers so that they can only concentrate on, which are really useful. And self failing systems, we can define some of the process. If, for example, the downtime is high in during particular time and only the restart required that we can tune it if it is reaching this threshold value. And then, uh, we can mention that the rules, which are, we can define easily and automatically it restarts so that attention we, we need not to give, so it'll automatically fix. And, uh, eventually all these things will, we can get implemented how earlier the traditional MLS was going. So it was a massive process where time taking and the efforts also more. So when it comes to the data acquisitions, so all the cases that, uh, predefined data will acquire it, even the traditional and automated. But after this, when it comes to the data exploration, two, data preparation and feature engineering, then model selection, model training, and the parameters for. Performance tuning and model evaluations. These all process will in, in the existing, in the traditional ml, but whereas in the ml. So all these will. Come up, come up with the machine learning automated system where we can only in insert the data and we can make, give some metrics and, uh, sometimes we can, if we are really budget friendly, we can give the time and, uh, cost so that it'll consider all these parameters which you have given. And this is the black box. And where you can also mention if you are really. Sure about the process. Then you can mention if you are not, then you can take the approaches from this automated ml that, uh, features, algorithms. All these parameters. We can mention it. And then after this, it'll consider various, uh, train and test status and it'll give the results and, uh, that will give a rank and, uh, accuracy for us. So out of this, it's your choice to discuss and, uh, decide among of this and you can get benefit out of this. So this is how that, um, auto ML process is very easy and, uh, without having much knowledge in the data scientist science, also we can go for it. So, and coming to the auto ML transforms, uh, SRE workflows, we can. Do this in a better way as first we can go for the data collections and, uh, the future engineering and monitoring resources can derive from there. And the model selections and training this algorithms is the models, this only the reliability scenarios. Whether it is a prediction, uh, then it'll choose the prediction models and anomaly detections if it is continuous monitoring and recognization behaviors that exist. So it'll, it'll identify those and it'll monitor for it. And predictive responses also automated the two actions. This adjustments, whatever the predict insights it, it gives, if you can search, mention in that parameter so that it'll automatically consider it. So this streamlined, uh, analytic workflow, it's eliminate all the manual efforts and model tuning, also improving this accuracy coming to the practical implementation framework. So whenever you wanted to implement in your organizations, so then you can go for the assessments and planning. Then evaluate the current SRE maturity, whether it is working fine or you wanted to go for the automation. If it is a maturity of your manual process, then it is fine, but most of the cases, manual intervention is more then, so then identify this high value use cases and what is the success metrics. Then whatever the existing process, you can document it and you can see the automation. Then, uh, you can see the differences. If it has come to the data preparation that whatever the data monitoring data is available, and you can establish the quality baselines and uh, some labels, you can create it and even the tips it is used for the supervised learning platform selections. There are, uh, three. Platforms, if you can assume it's the most, uh, predominantly using by organizations is, uh, Azure to ml, and Google also is having this feature and, uh, AWS also, it's up to the organizations what services they can. Use it, what cloud? They can go for it. They can select it. And, uh, most of the cases, every, every, uh, platform is working in the similar approach. So then that specific use cases we can consider cloud native and, and vendor neutral options. Then pilot implementations, you can please start with the, uh, modified pilot projects and assess it and how the transition, this automated process, you can, uh, go for the implementation, then scale and optimize. Expand this additional use cases if you are comfortable with the model performance, and you can go for the large group of things and you can go for the continuous implement, so that that's how the project will go live. And, uh, you will be, you can compare the results and you will, you will be good. Then the manual results versus automated results and when it comes to the key operational metrics, implement now. It has been around as per the statistics, it has been 68%. Uh, man, time to deductions decreased this process and faster 42%. Meantime, resolution has improved across, uh, with 42. That 48 percentage and alert accuracy are very high. This is around 91% as per the statisticians and the coverage expansion is tripled and these all, uh, key metrics which organization considering to it. And if next steps, if you wanted to go for it, then the conductor, some audit cases and some take of the two to three high level use cases and discuss with the cross-functional teams because all stakeholders, engineers, and data scientists, if they involve, so we will have a good understanding. What are the. Major problems and how we can address those issues and how we can come out of the situation by reducing the downtime or anomalies. Error Pro. And this, you can launch the pilot project in the auto ML later once you are comparing with the manual, uh, results versus auto ML results. Yes, there are some implementation challenges here too. For example, if you go for the data quality issues, there is an inconsist and incomplete in the monitoring data that time. What we can suggest in this solution, the data validation pipelines, is. Input meet inputs and all. They should meet the standard process because if the data is accurate and if incorrect data and if the relation between the data, so then you will have a correct output so you can check when you are passing this data to the model, then automatically the quality issues will be resolved and the scoring also will impact if you do not post the data properly. And organization resistance. Yes. SRE teams is trusting. Automated process is difficult because there are, uh, some critical decisions also be taken, of course. But what I suggest, you can go for the non-critical systems and if you have a confidence and to go for the side by side comparison, how. The manual results and how the automatic decisions. So eventually, if you see it so you can go for these approaches. Integration complexity. Yes. Connecting automate auto ML outputs to existing this monitoring tools. For example, if you have a, uh, platform specific tools and. It is hard to integrate at one stretch with the auto ML process. So for that, we can leverage standard APIs and some architectures to create some loosely coupled integration points. But now with the Cloud auto MLS are providing lot of s, we can get leverage with those and we can go for it. So the successful implementations all the way it is required. Thoughtful discussions and the technical and organization challenges. Some domains I can, most of the domains are using this one, but how much the domains are getting leveraged. So I just wanted to point out some of the areas, for example, if you take the financial services, this automatic predicts, the how much load during the market hours are stock exchange timings, if it. Predicts it. And then during that we can make a decision during this peak hours whether we can scale out and, uh, increase the resources. We, we can decide in pre preventive measures. Also, we can take it. And it can detect anomalies in the transactions, our server level logs, if any fraud or system values, that also we can detect it. So hence we can take a good action on it. And if you go for the e-commerce, e-commerce, if it is. We predict the traffic spikes, how the sales time be Black Friday or New Year, new Year time, are any of the, uh, during holidays that time the, you know, sales services is high and we can expect the peak covers during that. So it automatically predicts and we can, it pop up to us in advance so that we can take a decisions to how to improve the. Compete engine are. This is also detect the anomalies and if any revenue laws from the field, payments and all so that we can take care, for example, if the one payment process is not going through well all the time so that we can have a preventive measures to opt it. Then, uh, other healthcare. This healthcare monitors iot devices and patient data as systems. This EMR medical reports also is the predictive maintenance can be happen, and the same way telecom industries. Also, we can detect outages in the real time and the failover roads. Also, we can identify, we can forecast usage patterns, how the bandwidth allocations and the infrastructure deployment. Here. I would like to conclude saying ROML is not a replace of sre. It only enhances the tool is getting used. The capabilities of. SRE practices, how we can automatic routine analysis, our decision makings, our anomalies, our fraud detections. These kind of scenarios are P covers, these SRE practices, whatever we are doing in normal basis. I mean manual braces. So we can go for this approach and we can think in the strategical way and then we can have decisions out of this. This allows SRE to focus on only strategic tasks, but when this heavy load is there, then that time it is hard to take your decisions with the specific tools and um, you know, the tool compatibilities and all. But, uh, whereas if it's ML it handles very well. So this is what I wanted to conclude here. So thank you so much for this giving this opportunity.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Reliability at Scale: How AutoML is Transforming Enterprise SRE Practices

Video size:

Abstract

Summary

Transcript

Slides

Swapna Anugu

Cloud Architect, Lead Data Engineer, Data Scientist @ Manasi Information Technologies

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Reliability at Scale: How AutoML is Transforming Enterprise SRE Practices

Video size:

Abstract

Summary

Transcript

Slides

Swapna Anugu

Cloud Architect, Lead Data Engineer, Data Scientist @ Manasi Information Technologies

Join the community!