Building Intelligent Platform Infrastructure: AI-Driven Observability and Self-Healing Systems at Scale

Video size:

Abstract

Turn your infrastructure into a self-healing machine! Learn how AI cut our downtime from 4.38 hours to 5 minutes yearly, saved $3.2M in ops costs, and made deployments 3x faster. Real code, real metrics, zero hype - just battle-tested ML that actually works in production.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I am Ajani. We are here to talk about how to build an intelligent platform. Over the years at IBM and at and t, I've seen what it takes to run some of the most demanding high scale platforms in the world. At IBM, we work. Multiple enterprise clients building hybrid cloud platforms at and TI was part of projects operating at telecom scale where millions of customers rely on services 24 by seven. That experience taught me an important lesson, so the infrastructure doesn't just need to scale in size, it also needs to scale in intelligence. So that's what we'll talk about here today, how we can use AI driven observability and self-healing to make platforms resilient and future ready. So picture this, it's 3:00 AM and it get a call for incident. So you log into dashboards. CPU looks fine, memory looks stable, and you dive into logs. Millions of lines and traces missing half of the context. By the time you correlated the pieces, the customer impact is already done and maybe there might be an outage. So that's the world. World. So lots of data, little insight. So our goal here is to flip that equation and fear for alarms and faster root cause detection. Ideally, no calls at 3:00 AM So let's look at the agenda. So to get there, what should we do? We have to build intelligence intelligent platforms and use AI where it is essential to scale and identify problems beforehand. And how do we architect and implement this air driven observability in the systems. Advance anomaly detection and what goes beyond simple thresholds. And then we will look at self-healing infrastructure, which closes the loop. And finally, an implementation roadmap with real examples from my work. And by the end, you'll have a sense of not just of the why, but also how. Okay, let's move ahead. So e evolution of, the platforms, right? So let's rewind a bit. At IBMI worked on the enterprise platform where infrastructure meant racks of servers, VMware clusters, or private clouds. So it was all about uptime, patching, and careful provisioning. Fast forward. The platform supporting millions of API calls every second for telecom services. We had to manage real time traffic, scaling microservices across the hybrid environments while meeting the strict SLS. The, this shows the journey of the platform engineering from managing the servers to managing the cloud, to now enabling intelligent platforms. It's no longer enough to provision the infrastructure. The platform itself has to detect, diagnose, and in some cases heal itself. That's the future. So the observability is a foundation of resilience, but traditional observability has major flas scale metrics, logs, traces, work grade when you are running tens of services. But when you are running services in thousands, metrics spike constantly. So log logs, just looking at the logs is going to be noisy. 90% of them doesn't matter. So the traces are fragmented. So I've seen teams at and t drowning in alerts. Each microservices produces alarms, and most of them are false positives. So instead of helping observability becomes distraction. So the challenge is this. How do you. Extract signal from noises. How do you go from data overload to real insight? Okay, that's where AI driven observability comes in. So AI is how we bridge the gap, how it changes the pillars, metrics. So these are the four pillars. So the metrics AI looks for the subtle drifts, seasonality, and predictive spikes and logs. NLP models sift through millions of lines to highlight the unusual patterns in the error messages. And then traces the graph. Graph analysis connects transactions across the services showing cause, causality, and not just correlation instead of C of alerts. AI gives you context. It doesn't just tell you something is wrong. It suggests what's wrong and where it started and why it's spreading. So think of it like moving from RAs CCTV footage to smart assistant that points out the suspicion suspicious activity before it escalates. Okay, so think about this. How do we build this? At and t, our architecture looks something like this. So data ingestion, so streaming streams of APIs, network logs and metrics pipelines, and then there's a processing layer. The tools like Kafka, spark and flunk handling millions of events are requests per second. And then there are AI ML models trained to detect anomalies, forecast the demand, and. Correlating the incidents before they occur and integration and coming to integration. The results are pushed back into existing systems, dashboards, and incident management, or even CICD pipelines. So integrating anomaly detection to our CICD pipelines instead of deploying the broken builds. And discovering the problems. Later, we flagged the regression early. That meant our saved fewer customer visible issues and better reliability, and coming to advanced anomaly detection. Thresholds are built blunt instruments. AI gives us more refined tools. Time series forecast, time series forecasting, so predict future, future loads and degradation. Graph based correlation, NLP on logs, for example, in telecom signaling services. So a service might show tiny latency increase. Humans would miss it. An AI model trained in historical outages can flag it precursor to a major issue. So that's the difference between firefighting after the outage versus preventing it all together And then comes the root cause analysis. So root cause analysis is where AI shines. Traditionally. Root cause analysis means hours of bridges on no calls and networking teams, database teams, application teams, all sitting together on call for hours and hours together to come up to what caused this issue. AI changes this by mapping dependencies. And it can highlight one failing node that triggered thousand downstream alarms at and t using AI enhanced dependency mapping. Cut our RCA time def dramatically. So from ours, two mere minutes. Thus the difference between a four hour outage to a 20 minute hiccup. So let's talk about self-healing. Infrastructure detection is great, but the next step is healing. Think of infrastructure is like immune system. I, it fights back. The cycle looks like this. Detect anomalies, diagnose the. Problem, root cause. Decide on remediation and then learn, feed the results back to the model example, as an example, automatically restarting a failure container, scaling APIs during peak, promotional events like Apple. Apple release launch, so on and so forth, and isolating dependency before taking down the system. Of course, we can guardrail, we can add guardrails like thresholds, approvals, audit logs. Automation must be powerful, but also safe. How do you adapt AI driven observability? The roadmap usually looks like this. Assess your observability gaps. Run a pilot. Start a small, maybe one critical service. Start on a critical service, then scale it across multiple services and environments. The long term maintenance, continuously refining the models, learn from the incidents, and then the key, incremental key is in the incremental progress. You don't need to. I, you don't need a big bang. Transformation. Start from where the pain is the greatest. So where are we heading? I see three trends. So reinforcement learning applied to operations systems that optimize themselves by trial and feedback. Safety aware automation. Balancing autonomy with guardrails and then industry-wide adaptation of self-healing by default, where resilience isn't an add-on, but a core design principle. This is the future of platform engineering. Okay, let me leave you with this thought. The future of platform engineering is not about fighting outages. It's about building fireproof systems. AI driven observability, self-healing infrastructure helps us move from reactive firefighting to proactive reliability. It's about giving engineers more time to innovate and less chasing around the logs. Thank you.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Intelligent Platform Infrastructure: AI-Driven Observability and Self-Healing Systems at Scale

Video size:

Abstract

Summary

Transcript

Slides

Ajay Averineni

Lead Application Developer @ IBM

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Intelligent Platform Infrastructure: AI-Driven Observability and Self-Healing Systems at Scale

Video size:

Abstract

Summary

Transcript

Slides

Ajay Averineni

Lead Application Developer @ IBM

Join the community!