Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

SRE and Healthcare Data Compliance: Balancing Security and Reliability

Video size:

Abstract

Site Reliability Engineering (SRE) is transforming how healthcare organizations maintain uptime, security, and performance in critical systems. In this talk,

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Hi everyone. I'm Shar Alala. I'm super excited to be here at con 42. Today I'm talking about something I'm really passionate about how AI is changing the game. In Sari, we are moving from just reacting to issues, to predicting and preventing them. It's going from putting fires out to building systems that do not catch fire in the first place. Let's start with a quick introduction. So AI driven SRE basically means using machine learning and smart automation in how we keep systems reliable instead of just waiting on, waiting for alerts to go off and scrambling to fix things. AI help us spot the signs earlier and the act before anything breaks. It's a big shift in how we work and it's already happening in a lot of teams. Let's deep dive into AI driven SRE. So what exactly is AI driven? SRE, it's a mix of the usual SRE stuff like monitoring, alerts, reliability, work over the years that whatever that we are doing. Plus in combination with machine learning, ly detection and forecasting. So the goal is to make systems that are not just reactive, but smarter and proactive. Your system says, Hey I've seen this pattern before, and. It usually means something about to go wrong. So then that means it is giving you a heads up before users even notice them. How do we enable this proactive IT operations? One of the biggest changes in ai with AI in SREs, how we shift from reacting to just preventing, instead of waiting for an outage to hit and then jumping in. Ai, let us cat. Act early. It's part signals and frames that would take a human hours or days to even notice. This means fewer surprises and way fewer fire drills. Trust me. How do we enhance system reliability? How does this help with reliability? When you are catching issues earlier, your systems stay up more, right? That means less downtime, few angry customers and happier team A, help us get more control over our systems instead of just reacting to chaos. And how do you optimize your infrastructure maintenance? Here's where it gets really cool. AI can also tell you. When parts of your infrastructure are starting to wear out, isn't that cool? Maybe a server is acting weird or a disc is about to fail. Anything like that? It depends on organization to organization or system to system. AI looks at patterns in the data and says, Hey something is off here so you can fix things before they even break and schedule downtime when you want, not when the system forces you to do let's talk about reactive ops challenge. Let's talk about the struggle with reactive ops. First of all, because we all have been here, like stuff's break out of nowhere. Alerts do not make sense. Root cause analysis takes forever. And we are always playing catch up. This way of working is really stressful, tiring, and honestly it doesn't scale. That's why AI driven SRE is such a breath of fresh air. So let's talk about an important factor, which is AI powered anomaly detection. So one of the biggest thing about AI is anomaly detection. It's like giving your system superpowers. It can watch thousands of metrics and logs at once and instantly say, Hey something is off here. Instead of hundreds of alerts, you just get one clear signal and you can act fast often before anything actually breaks. Let's talk about predictive maintenance. So what is predictive maintenance? Predictive maintenance is another big win. Think of it as your system knows when something going to fail based on history or behavior. So instead of reacting after server crashes, AI tells you before it happens, you schedule a fix and everything keeps running smoothly. So predictive maintenance techniques. What are the maintenance techniques here? There are a bunch of techniques for this, as I just spoke about anomaly detection remaining useful life estimation, like basically how much longer something will last and a condition based monitoring. These are all to understand the health of your system and actively, and one other important one is like forecasting your infrastructure demand. This part is super helpful too. As I said, forecasting AI looks at past usage, your traffic patterns, even seasonality like spikes during holidays and helps you plan ahead so you do not under provision and crash or over provision or waste resources. That way you'll be able to optimize your uptown and reduce downturn. So when you put all this together, anomaly detection, predictive maintenance, and forecasting, you get more uptime, fewer outages, and smoother operations. AI helps you run things bit better, smarter, and with less stress. And your customers, obviously they don't even know where that they were. There was almost a problem. So let's get into real world case studies. This isn't just theory, it's already working. And I've seen companies use these tools to catch weird behavior, early fix stuff. Faster and seriously cut down on downtime. AI helps teams collaborate better and stay ahead of issues. Today I would like to walk you through or discuss about a case study in one of the leading tech company where I worked with I. Quick story at a tech company I worked with. We rolled out AI based SRE tools. We trained models to spot problems early, set up dashboards, and gave the team way more visibility. And that helped team a lot. And guess what downtime dropped by over 60% in eight months. The team had fewer late night. Incidents and reliability showed up. Let, so this is I had, I just wanted to discuss about that case study and let's move on to the future of AI and sre. So what what's next for AI and SE We are seeing faster. Anomaly detection as we spoke about, and better predictions because of the anomaly detection. And you have a self-healing infrastructure mechanism out there. And AI that helps us meet SLS automatically. So it's moving fast and it is super exciting to be part of it. So finally to wrap up, AI driven SRE is here and it's changing how we do reliability. It. It helps us move from reacting to prediction, from chaos to control. So if you are an SRE, an engineer or just someone who wants fewer two pages, this is the direction we are heading. Thanks for watching again, and a huge thanks to Con two for the platform. And if you ever want to chat more about this, feel free to reach out to me. Thank you.
...

Sidhartha Velishala

Technical SME @ Humana

Sidhartha Velishala's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)