Beyond Manual Firefighting: Building Self-Healing Data Pipelines with AI for Platform Engineering

Video size:

Abstract

End 3am pipeline firefighting forever, AI-powered self-healing systems autonomously fix production failures in seconds. See real Fortune 500 deployments achieving 99.9% uptime and 40% team productivity gains. Live demos and battle-tested code included

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. Welcome to conference 42, platform Engineering 2025. This is LaMi. I bring over 20 plus years of experience in building enterprise data lakes and enterprise data analytics platforms across both on pre. Cloud environments. Today's talk is focusing on beyond manual firefighting, building, self feeling data pipelines with AI for platform engineering. Platform engineers today face unprecedented operational challenges because of exponential. Growth of data and exponential growth of data pipelines across the enterprises. Every transaction or user interaction generates a lot of data that has to be ingested, transformed, validated and provide and available to data analytics platform. Even though we have most advancements in orchestration frameworks and cloud native platforms, pipeline failures remains one of the persistent pain point across all the production environments. In the next slide, we talk about current challenges. Engineers spend mata of the time debugging the failures. Fixing the broken data pipelines manually restoring service continuity. This talk explores transformative to potential of VA A driven self-healing data pipelines by leveraging anomaly detection, root cause analysis, autonomous remediation. So that enterprises can move from react to firefighting, to ProAct to eventually autonomous operation. In this talk, we'll be focusing on three use cases across retail sector and health, healthcare domain, as well as financial. Domain and also we talk practical implementation strategies. What are the factors to be considered for governance, and what would be the measurable written investment? Adapting, yay a in platform engineering. The introduction, we talk here. The modern enterprises all defined by data, every strategic decision, customer experience, complaints reports all depends upon seamless data flow from source system to analytical platforms and dashboards. To handle this demand platform, engineers has to design complex data pipelines that operates with high reliability due to the complexity of these pipelines. Say small schema mismatch, our network outage configuration transformation, breaking the enter downstream workflow, which impacts losing the revenues, opportunities, compliance violation, and violation, and delayed insights of the decision making. In this slide, we are going to talk how the current traditional monitoring is working. The traditional monitoring system notifies users when something fails and remediation is almost manual. As of today, 60% of platform engineers operational bandwidth will go. Troubleshooting and recovering the job and the amount of jobs or data pipelines or interconnected jobs. The probability of at least one failure daily is near certain, so by. Seeing all these failures, utilization of Yaya power, self-healing data pipelines makes a paradigm shift. Instead, engineers reacting to the failures, pipelines can autonomously detect the anomalies, diagnose the root cause, and take the character action. By doing that, the role of engineers now become. Firefighters to supervisor focusing on the governance, auditing, and improving intelligence of these autonomous systems. So the current, what is the burden of this manual? Troubleshooting is for the, any enterprise is cost association because any downtime of critical pipeline in retail organization delay the inventory updates. Across the thousands of stores, whereas in the case of financial services, broken fraud detection pipeline can lose millions of dollars for that institute. In this slide, we are going to talk about the key challenges of manual pipeline management. The manual process has highly failure frequency. Because of hundreds of interconnected jobs interconnected jobs and also failure of one job can impact downstreams and difficult to stress the root cause and locks are often augmented across the systems, requires engineers to stitch together and close manually. This, the, this situation is not sustainable for data ecosystems. So the promise of AA in self filling data pipelines decides on these following four capabilities. One is anomaly detection. This indicates algorithms continuously analyze the matrix logs. Data quality checks do detect any abnormal behavior. You by using trans transformer based models. That demonstrated exceeding 90% in identifying the anomalies. The second one, root, the second capability, root cause analysis. With the utilization of natural language models, GPT-4, we can analyze the logs, trace the dependency graphs, identify the. Most probable root cause with the minimal human impact. The third capability, autonomous remediation once the cause is identified. Reinforcement learning agents are rule-based. Automation trigger can take the character actions. Continuous learning is by learning from previous spoilers, successful remediation and human feedback. This is one of the case study in the retail sector where the global retail sector, our company managing two terabytes of data ingesting from source to target is facing frequent data pipeline failures because of s screamer changes or high volume traffic spikes. The manual process often took covers leading avail, leading to delayed product availability on e-commerce platform with ai. The k 99% of time across critical data pipelines are achieved and anomaly detection with 94% accuracy is identifying schema mismatches, ude, and automated recovery of workflows from two hours to five minutes. As achieved, as well as operational cost, estimated from a million dollars to reduced downtime. In the healthcare, another case study in the healthcare sector, the IT uses GPT-4 to automatically pass the locks. Identify the failure, authentication, or misconfiguration policies with autonomous remediation. We can isolate the failing notes, rerouting the data, and hand over it to the engineers to review the EA actions to take care of high risk scenarios. So the results were for profound. The recovery times are dropped by 70%. Complaints, audits conformed, all a driven actions were logged and auditable. Another case study that conducted was in the FinTech company. In the FinTech sector, real time fraud detection demands subsecond responsiveness. The leading tech form faces the challenges when even where even shortages could explore platform to fraudulent transaction with AA data driven. Self feeling. Data pipelines are reinforcement learning agents, dynamic resource allocation as and when the volume is more immediately allocating more compute resources to fraud detection models. So that it's automatically identifying frauds in subsecond, responsiveness and automated failure. One is automatically all the automated failures to backup pipelines without human intervention and recovery times are dropped from subsecond levels ensuring uninterrupted. Fraud detection. So the adoption of self-healing pipelines will help not only reducing the fraud risk and also reassuring regulators and customers of the platforms reliability. So the whole technical foundation of the self filling pipelines relies on detect, analyze, optimize. So the transform transformer based anomaly detection, identifies time series matrix log anomaly detection natural language models, identifies the root cause analysis. Reinforcement learning agents will use resource optimization agent continuously learn. Optimal policies for scaling routing resource allocation, the how the enterprises can implement these self filling data pipelines. The architecture patterns will be three ways. One is centralized a layer, other one is decentralized embedded agents. Third one is hybrid models. So the governance and trust plays a critical role whenever any organization enterprise adapts ai. So the, so if. The that the self feeling data pipelines, make sure it is compatible for editability, which means every variation has logged with explanation and industry specific regulations as followed in the compliance framework. Human AI collaboration engineers should remain supervisory roles, reviewing high impact remediation. These autonomous systems operate within guardrails of gain trust, so building this trust in self-healing system any requires transparency, continuous monitoring and cultural change within organization. So the role, the information business impact when any organization implements the self filling data pipelines, what we saw here is the downtime reduction was approximately a. Very reduced on term reduction was 99% are achievable and there is huge number of dollars are saved in terms of handling the volume spikes as well as routing the failure data to different node and avail and making sure platform is available for all the time. And the team got a productivity gain of 70% because of. Automatically rev running the data pipelines for any failures. So the benefit of self-healing pipelines extends beyond technical resilience, customer satisfaction with faster, more reliable services. The retina investment often calculated, comparing the avoided downtime, cost, reducing fraud exposure, and engineering hours saved the, so the future of self-healing pipelines. As the data grows, enterprise data pipelines grows. With the utilizing this a power cell feeling pipelines, it'll be battle tested solution, and that is already transforming in retail healthcare finance. So by adapting anomaly detection, natural language return, root cause analysis, reinforcement learning, our continuous learning organizations can achieve unprecedented reliability, compliance efficiency. That's all for that today's talk. Thank you. I.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Beyond Manual Firefighting: Building Self-Healing Data Pipelines with AI for Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Lakshmi Srinivasarao Kothamasu

Principal Data Engineer @ Fidelity Investments

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Beyond Manual Firefighting: Building Self-Healing Data Pipelines with AI for Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Lakshmi Srinivasarao Kothamasu

Principal Data Engineer @ Fidelity Investments

Join the community!