Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Welcome to conference 42, platform Engineering 2025.
This is LaMi.
I bring over 20 plus years of experience in building enterprise
data lakes and enterprise data analytics platforms across both on pre.
Cloud environments.
Today's talk is focusing on beyond manual firefighting, building,
self feeling data pipelines with AI for platform engineering.
Platform engineers today face unprecedented operational
challenges because of exponential.
Growth of data and exponential growth of data pipelines across the enterprises.
Every transaction or user interaction generates a lot of data that
has to be ingested, transformed, validated and provide and available
to data analytics platform.
Even though we have most advancements in orchestration frameworks and cloud
native platforms, pipeline failures remains one of the persistent pain point
across all the production environments.
In the next slide, we talk about current challenges.
Engineers spend mata of the time debugging the failures.
Fixing the broken data pipelines manually restoring service continuity.
This talk explores transformative to potential of VA A driven self-healing
data pipelines by leveraging anomaly detection, root cause
analysis, autonomous remediation.
So that enterprises can move from react to firefighting, to ProAct
to eventually autonomous operation.
In this talk, we'll be focusing on three use cases across retail sector and health,
healthcare domain, as well as financial.
Domain and also we talk practical implementation strategies.
What are the factors to be considered for governance, and what would be
the measurable written investment?
Adapting, yay a in platform engineering.
The introduction, we talk here.
The modern enterprises all defined by data, every strategic decision, customer
experience, complaints reports all depends upon seamless data flow from source system
to analytical platforms and dashboards.
To handle this demand platform, engineers has to design complex data pipelines
that operates with high reliability due to the complexity of these pipelines.
Say small schema mismatch, our network outage configuration transformation,
breaking the enter downstream workflow, which impacts losing the
revenues, opportunities, compliance violation, and violation, and delayed
insights of the decision making.
In this slide, we are going to talk how the current traditional
monitoring is working.
The traditional monitoring system notifies users when something fails
and remediation is almost manual.
As of today, 60% of platform engineers operational bandwidth will go.
Troubleshooting and recovering the job and the amount of jobs or data
pipelines or interconnected jobs.
The probability of at least one failure daily is near certain, so by.
Seeing all these failures, utilization of Yaya power, self-healing data
pipelines makes a paradigm shift.
Instead, engineers reacting to the failures, pipelines can autonomously
detect the anomalies, diagnose the root cause, and take the character action.
By doing that, the role of engineers now become.
Firefighters to supervisor focusing on the governance, auditing, and improving
intelligence of these autonomous systems.
So the current, what is the burden of this manual?
Troubleshooting is for the, any enterprise is cost association because any
downtime of critical pipeline in retail organization delay the inventory updates.
Across the thousands of stores, whereas in the case of financial services,
broken fraud detection pipeline can lose millions of dollars for that institute.
In this slide, we are going to talk about the key challenges
of manual pipeline management.
The manual process has highly failure frequency.
Because of hundreds of interconnected jobs interconnected jobs and also failure
of one job can impact downstreams and difficult to stress the root cause
and locks are often augmented across the systems, requires engineers to
stitch together and close manually.
This, the, this situation is not sustainable for data ecosystems.
So the promise of AA in self filling data pipelines decides on
these following four capabilities.
One is anomaly detection.
This indicates algorithms continuously analyze the matrix logs.
Data quality checks do detect any abnormal behavior.
You by using trans transformer based models.
That demonstrated exceeding 90% in identifying the anomalies.
The second one, root, the second capability, root cause analysis.
With the utilization of natural language models, GPT-4, we can analyze the logs,
trace the dependency graphs, identify the.
Most probable root cause with the minimal human impact.
The third capability, autonomous remediation once the cause is identified.
Reinforcement learning agents are rule-based.
Automation trigger can take the character actions.
Continuous learning is by learning from previous spoilers, successful
remediation and human feedback.
This is one of the case study in the retail sector where the global retail
sector, our company managing two terabytes of data ingesting from source
to target is facing frequent data pipeline failures because of s screamer
changes or high volume traffic spikes.
The manual process often took covers leading avail, leading
to delayed product availability on e-commerce platform with ai.
The k 99% of time across critical data pipelines are achieved and
anomaly detection with 94% accuracy is identifying schema mismatches, ude,
and automated recovery of workflows from two hours to five minutes.
As achieved, as well as operational cost, estimated from a million
dollars to reduced downtime.
In the healthcare, another case study in the healthcare sector, the IT uses
GPT-4 to automatically pass the locks.
Identify the failure, authentication, or misconfiguration policies
with autonomous remediation.
We can isolate the failing notes, rerouting the data, and hand over it to
the engineers to review the EA actions to take care of high risk scenarios.
So the results were for profound.
The recovery times are dropped by 70%.
Complaints, audits conformed, all a driven actions were logged and auditable.
Another case study that conducted was in the FinTech company.
In the FinTech sector, real time fraud detection demands
subsecond responsiveness.
The leading tech form faces the challenges when even where even shortages
could explore platform to fraudulent transaction with AA data driven.
Self feeling.
Data pipelines are reinforcement learning agents, dynamic resource
allocation as and when the volume is more immediately allocating more compute
resources to fraud detection models.
So that it's automatically identifying frauds in subsecond,
responsiveness and automated failure.
One is automatically all the automated failures to backup pipelines without
human intervention and recovery times are dropped from subsecond
levels ensuring uninterrupted.
Fraud detection.
So the adoption of self-healing pipelines will help not only reducing the fraud
risk and also reassuring regulators and customers of the platforms reliability.
So the whole technical foundation of the self filling pipelines
relies on detect, analyze, optimize.
So the transform transformer based anomaly detection, identifies
time series matrix log anomaly detection natural language models,
identifies the root cause analysis.
Reinforcement learning agents will use resource optimization
agent continuously learn.
Optimal policies for scaling routing resource allocation, the
how the enterprises can implement these self filling data pipelines.
The architecture patterns will be three ways.
One is centralized a layer, other one is decentralized embedded agents.
Third one is hybrid models.
So the governance and trust plays a critical role whenever any
organization enterprise adapts ai.
So the, so if.
The that the self feeling data pipelines, make sure it is compatible
for editability, which means every variation has logged with explanation
and industry specific regulations as followed in the compliance framework.
Human AI collaboration engineers should remain supervisory roles,
reviewing high impact remediation.
These autonomous systems operate within guardrails of gain trust, so building this
trust in self-healing system any requires transparency, continuous monitoring and
cultural change within organization.
So the role, the information business impact when any organization implements
the self filling data pipelines, what we saw here is the downtime reduction was
approximately a. Very reduced on term reduction was 99% are achievable and
there is huge number of dollars are saved in terms of handling the volume spikes
as well as routing the failure data to different node and avail and making sure
platform is available for all the time.
And the team got a productivity gain of 70% because of.
Automatically rev running the data pipelines for any failures.
So the benefit of self-healing pipelines extends beyond technical
resilience, customer satisfaction with faster, more reliable services.
The retina investment often calculated, comparing the avoided downtime,
cost, reducing fraud exposure, and engineering hours saved the, so the
future of self-healing pipelines.
As the data grows, enterprise data pipelines grows.
With the utilizing this a power cell feeling pipelines, it'll be battle
tested solution, and that is already transforming in retail healthcare finance.
So by adapting anomaly detection, natural language return, root
cause analysis, reinforcement learning, our continuous learning
organizations can achieve unprecedented reliability, compliance efficiency.
That's all for that today's talk.
Thank you.
I.