Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Anna Patricia Revu.
I'm alumni of n it.
I graduated in year 2003 and I have total 22 years of software
experience in IT industry.
Overall, I have been working on different technologies all my career, starting
from Java G to middleware technologies, and now to the cloud applications.
Today we will be, and yeah, thanks to go 40 two.com for giving me
this opportunity to talk about self-healing data pipelines, right?
So what is the problem statement which we are trying to solve?
So what is the growing challenge we have?
Over the period of time earlier, back in days, we used to
integrate applications, right?
So there was a data transformation data.
Data integration happening different applications, but of late in
this AI world now we see that, there is sudden influx of data.
Data is the key now, right?
All stakeholders are looking for data.
They want to build visualizations.
They want to do some sort of analytics.
There are some machine learning models, which you want to.
Train on those particular data sets.
So there's certain influx of data from all the corners and data engineers build
data pipelines day in and day out, right?
In, in a very high magnitude.
The problem here is if something gets broken in the pipeline, then
how do we make sure that, the pipeline is back working properly.
Because there is a downside of it because it affects our stakeholders
to take key business decisions.
So to address this we traditionally, what we have done is we always
used a rule based systems.
Correct.
So we say that, there is a rule saying that.
If the CPU percentage is this much higher, then add a new node or, stop
the server or restart the server.
If it a system is hanged, then restart the server.
If there is a network failure, then you know, do something else, right?
And throw an alert or send a email alert to the users, right?
So this is what a traditional rule-based system does and it
is basically static in nature.
Main critical pain points, what we have seen over the period of
time is the schema drift, right?
And in a typical data pipeline, obviously, just to oversimplify it, you have a
system source and a system target.
You want to, get the data from source to the target system.
Now you built your data pipelines on particular schema, and then
suddenly over the period of time.
The schema has changed for some reason, right?
Obviously, your pipeline, your ETL transformations, whatever you've written,
might break because of the schema drift, and suddenly the data pipeline is broken,
and then you get an alert, and then you go and fix, try to fix it, and then
rewrite the whole, rewrite that whole transformation with the new schema.
This is called the schema drift, right?
So this are the pain points.
So basically it, again, brings down your system and then there
is a loss of business hours there.
Then another problem is resource contention.
Different in entire pipeline.
You might have your airflow ds working in the background, glue jobs, working
in the background or different virtual machines working in the background.
To process the data and due to sudden influx of data, the resource is not
capable to handle that data, right?
And the system goes down.
So that can be one of the pain points, and that happens quite frequently, right?
Even sometimes your data warehouse tools may not be able to handle the queries.
The ETL transformations.
Inserts and insertion data into the tables or, data visualization queries coming
and, hitting to the database server.
There can be so many issues there.
And then there are system anomalies, obviously, right?
There are unseen errors, there is network failures happening.
So these are the typical critical pain points of a typical data
pipeline set up, as in today.
The limitations we already have discussed, right?
The limitations of current temperatures is it is very rigid.
You define a rule, you say if the CPU percentage is this much higher,
then you do something to fix it.
Correct?
And you if there is certain deflection, it, there is no rule
for that, and then you are not able to solve that particular problem.
Another problem is unable to adopt a new failure patterns, right?
There, there will be constantly new, new failures happening and you might not have
written rules to take care of that, right?
Obviously, high maintenance, because there is a manual intervention required, there
is a downtime required and to do the rule updates and obviously poor performance
in some complex scenarios, right?
We enterprise scale data operations demand more intelligent
and adapt solutions, right?
And that's why there's a new paradigm shift, which is
like autonomous self-healing.
So what if the pipelines self heal, self-heal themselves, right?
That's a thought which which we should think on.
So what the paradigm shift is moving from reactive to proactive, right?
And basically a policy driven framework.
And then through intelligent prompt engineering.
We can actually implement this autonomous self-healing, right?
So there are two, three concepts which you may use here.
One is definitely called as a reinforcement learning.
You can use machine learning for, checking anomalies.
You can have reinforcement learning and you have these agents which can
actually think of it like you are, twins.
Who can actually go there and then fix the problems automatically for you.
So that is what this RL agent is all about.
And then you can do prompt engineering.
You can get this data, vectorize it, and then you can use
any agents, agent ai right?
And then, develop some flows and then query the database in the backend.
And it cannot, obviously, it'll give you the information of what actually
went wrong into the systems, right?
So how these RL agents learn this resilience, right?
So basically this RL agents or reinforcement learning agents, they
observe this pipeline telemetry data.
So what we do is we gather the data of all the systems in that entire pipeline.
So the pipeline can be your, orchestration tools let's say I'm just taking
example of Apache Airflow, right?
Or a glue job in AWS or a Lambda function in AWS or any, virtual
machine, how it is performing, right?
So those telemetry data you can, and then you have your data
leaks and your data warehouses.
So whatever systems are there into your entire pipeline.
You can actually collect them, collect the metrics data, the
tele telemetry data, right?
So what is the resource utilization?
What is the, error reads and what is their system health?
You can capture all this information and repeatedly through trial and error.
Try to discover the best possible way to handle that error scenario
or that failure scenario.
This is what a typical RR agent would do, right?
The agent basically learns dynamically at just the resource allocation.
The it can reconfigure the ETL workflows, apply corrective actions.
If it's, if it has to it can do schema mapping in case of schema.
It can do some intelligent retries like exponential backup, something
like that, and then put some adaptive back pressure before failures cascade.
So this is how this typical RL agent would would work.
The framework architecture for this is, definitely important thing is
you need to collect the telemetry data of your system, right?
So historically, whatever telemetry data is there, you can collect.
In AWS, if you talk, you will have your CloudWatch metrics
and all that stuff, and then can continuously stream these metrics.
And then using simulative fall data, you can develop a comprehensive learning
on top of this particular data, right?
And then once you get this particular data set, then you can do your reinforcement
learning agent training, right?
Continuously.
These agents will be looking at the metrics data.
And, it'll identify the patterns, it'll identify the failure scenarios
and it'll try to fix it through autonomous response, right?
Then real time.
These agents can take a corrective actions including resource allocation, right?
Spin of another node or drop one node, do some sort of workflow adjustments.
It can do all that stuff.
So this is a typical framework architecture.
We already talked about just now about the prompt engineering for intelligent,
prompt engineering for pipelines, right?
First thing first is you can actually go and then put a prompt onto the system
saying anomaly detection prompts, right?
Using your natural language query, you can say, Hey tell me like from past 24 hours
what is the CPU utilization of my system?
Or how many error this airflow DAG has been throwing from past couple of hours.
And then it can it can give you an anomaly based upon that, right?
So you don't need to go to a dashboard and, click on it.
You can just do some intelligent prompt engineering, and then you can get
that information at your fingertips after you get that information.
And then you can also ask.
Hey, okay these many machines have been failed.
Can you tell me what is the root cause?
And that prompt can actually trigger some agent, which can actually try
to figure out, it would go and read the logs that traces and metrics
and pinpoint exactly what happened.
With respect to the co and it'll give you contextual explanations on based upon
your anomaly detection prompts, right?
And then you can also say, okay, you can instruct it because
now you know what is a problem.
And then being human, you have that intelligence power.
You can just say, okay, go.
And, you go and spin off another instance EC2 instance, for example,
or spin off another node or spin off another worker node in Redshift
or something like that, right?
And the agent can go back and actually do that task for you.
So definitely this prompt engineering is an added advantage.
Now in this today's AI world.
We can take use of marvelous, amazing LLMs.
And we can put LLM, we can, vectorize the data in the database and you
can build your quick agent flows on top of it and you can actually
automate this entire process.
And then.
Yeah.
A typical process flow of a data pipeline would look like this, right?
I'm just taking example of Oracle Database.
Think of this as an external source, right?
This is not part of the the part of your pipeline.
And then you have DMS data management service in a ws, right?
It constantly pulls the data from the database.
Using CDC technology and then it has the schema information and
then the airflow deck kicks in.
And then it basically does the ETL.
The pre-processing and post processing of that particular data set.
You can have blue jobs as well, and then you put monitoring layer on
top of all these all these nodes here in a typical data pipeline.
After this, it'll be going to the Redshift and, data lakes or data
warehouses I'm just using, it can be Snowflake, anything, right?
And then typically there will be monitoring layer on top of
all these applications wherein it'll be collecting the data.
And, then there will be an anomaly detection layer, which is you can use
a typical machine learning algorithm, like isolation forest algorithm to
do an anomaly detection continuously on the data or the monitoring data.
And if you find any problem, then automatically RL agents will
kick in and do the self feeling actions like restarting the server.
And then giving the feedback, like what exactly?
Once that our religion does it work?
It can.
Give a feedback that, oh, the system has been started and it has been working fine.
Our system is not working fine.
Still it does having some trouble.
So this is a continuous loop, which it'll go into, right?
So this is the call a feedback loop, right?
So this is a typical flow of RL agent based self-healing data pipelines.
So what are the layers, which will be used to implement this?
One thing is the monitoring layer, like as I said, continuously track the pipeline,
health resource usage, data quality across DMS, entire monitoring layer.
I'm just taking an example just for you to understand what are the
different components in data pipeline In AWS setup, it for Azure and for
GCP or for your proprietary data pipelines, it can be different.
And then you have this analysis layer.
You can use machine learning algorithms.
As I told isolation.
Forest algorithms really great.
It can actually figure out what exactly that anomaly is.
What is a resource bottlenecks, what are the data inconsist inconsistencies,
and it can trigger an alarm.
When alarm is triggered, there is another layer, which will be that RN agent layer.
Which will evaluate the pipeline, state and select optimal recovery actions.
Example is rescheduling the jobs or scaling the glue resources or remapping
the schemas or retrying the DMS loads.
And then you have your execution there wherein automate automation
of engines implement recovery actions across AWS services.
So it'll be executing those distance what your RL agent has been taking, right?
Minimizing the downtime or manual intervention, right?
Be required.
If not, it's not fixing.
Then there's a feedback layer, then aggregates it.
Basically, e executes the action.
And then it also sees for.
Whether that action has been properly implemented and it is
giving the desire results or not.
And this is for our agents for its continuous improvement, right?
So we talked about this whole, reinforcement learning
machine learning and all.
I just wanted to take an opportunity for the people who doesn't
understand reinforcement learning and probably they understand machine
learning and AI and all that stuff.
It's kind of machine learning, but little bit different in the
sense like, in machine learning you have independent variables.
You have X and X variables, like you know all the attributes.
And then you have a dependent variable, which is why, which is a target variable.
And your job is to predict what the target variable will be, right?
But in reinforcement learning that X, which we're talking about the independent
variables think of it as a state.
Right that entire row, you can think of it as a state.
Sorry.
And that state consists of this whole metrics.
Like for example, what, like for example, my, it'll say schema mismatch, right?
What action can be taken to solve that particular problem?
Or what is the metrics there?
Is there a network latency.
Is it a, the mismatch of the mismatch of the attributes those kind of, data sets?
There will be there like it can be, the columns can be error type, what stage
it is, what is the latency, right?
These can be it's attributes state at.
And then there will be one more.
You can think of this as state temple, state action, reward, and next state
you can think of as a tap, right?
In Python.
So action is like on this particular state, right?
The state is like there's a timeout in the transformation, right?
So the system is down.
So what action would we like to do?
Probably want to read, right?
So that is action, which you'll be taking example, restarting that task, right?
And then once the state you saw, you took an action, then there is a reward system.
The reward system is basically how good a job you did, right?
And think of that your professor is giving you extra marks for actually
doing a good job and is punishing you for not doing any good job.
So this is very critical, and I'll explain you in a bit why reward is very important.
And then there is a next state.
So once you take a state, you took an action and you get a reward, obviously
it'll transition to a next state.
So this whole, quadruple double.
Is the key of reinforcement learning.
The, this is the data structure which it'll be using, right?
I just give an example of the data, right?
So the state can be having error type action, taken stage, and latency, right?
So this, these are the attributes which will be going into the
reinforcement learning models, and then you specify the rewards.
If you say that, if the next state is healthy and you do did this, so you
say, plus one, if the next state is not healthy, you probably punish it.
Or yeah you basically will say that, you are not done a good job and
probably give a minus one to it.
Same way there's a time where transformation is the action
taken, and latency is two seconds.
And action taken is retried.
The reward is zero, right?
You are not doing it again because of timeout it time out again.
So you are basically punishing to the action you have taken, right?
This is, these values are nothing but called a Q values.
And this is what is drives everything in reinforcement learning.
It basically looks at that skew values.
It means maintains executable.
And for each and every action, it actually looks at the particular Q
table and applies the rewards, right?
And it's a continuous process.
The goal of any reinforcement learning program, or the agent is given a
state, what action should I take to maximize my long term reward?
So the more the rewards, the more the reinforcement learning agents understand
that, if I do this way, I'll be rewarded.
Rewarded and then this is the right way to solve particular problem.
Right now, what are the key capabilities?
It's like real time monitoring.
Continuous pipeline performance will be tracked.
Your dynamic resource management can be done right.
Intelligently can reallocate computational resources based on current demand.
ETL workflow optimizations can be done right.
If the ETL take is taking a lot of time or it is not working up
to the mark, then probably your ETL processes can be adjusted.
It can even do schema remapping, how schema remapping.
It can crawl over the, it can crawl over the source schema and
figure out what the schema is.
Then applied those using, those, those schemas back to your ETL
using some LMS in the background.
And then it does automatically schema deployment and automatically, sorry, it
automatically remaps the schema and then through CICD it automatically deploys
that particular solution into the server.
So how cool It's right.
So those, these are the things which agent will be doing once it figures
out that this is a problem statement, advanced response mechanisms.
Also, they're like targeted, retry, exponential, back off, right?
So what is exponential back off?
You want to retry, but you are continuously retrying.
After every 30 seconds, you're not doing good because you're blocking the whole,
you're choking the whole pipeline because.
Obviously the target system is not up and you're still making a call
and you're bombarding the request.
So you're joking with your request, right?
So ideal thing is you do have exponential back of right, 30 seconds.
Now you did maybe after two.
After two minutes you try, maybe after 10 minutes you try.
So this way you are not choking that entire network, right?
With your a request to the server.
Dynamic flow control can be also done to prevent system overload, which is like
a throttling of your data ingestions, which is called adaptive back pressure.
You can do resource reallocation, a automatically like automatic
scaling, and then you can do some workflow adjustments, right?
Like a deployment you can.
Use Kubernetes based or, ECS.
It's your wish.
And you can categorize this particular application and then you can deploy
it at any cloud based application.
Not a big problem there.
Evaluation.
How do we evaluate your methodology, right?
Testing conditions, the best thing is, first of all, you need to figure
out how to get the data, gather the data, and simulate that whole.
Error scenarios, good scenarios, bad scenarios.
Probably you should create another environment and with simulated
data, you can, you, you can create these situations and then evaluate
whether you are RL agents are working appropriately or not, right?
Yeah.
These are the key metrics like we have seen over the period of time.
By doing this, use response period three x, your service
level objectives are like 94%.
It's improved, and meantime to recovery is like drastically reduced system benefits.
Obviously.
It is we are moving from reactive to proactive data pipeline management
with the learning optimized strategies.
Autonomous self-management, less human intervention and
enterprise scale capability, right?
It can scale to any sort of magnitude of data pipelines you have set up.
It is continuous policy volution.
At the end of the day, RL agents come up with a policy, right?
And it never stops.
If a new pattern comes up again, it reads it and it tries to
solve that particular problem.
Th this never stops.
So it has this continuous policy evaluation strategy.
If you want to build yourself failing pipeline, what need to do, right?
Like basically deploy comprehensive observability across your pipeline.
Stack tric logs, traces which should be used to, feed that RL agent.
You can create a simulation environment, right?
Build a framework that mirrors production, as I told you, and then.
Define recovery actions like you need to know that if this
is a problem, this can happen.
So you need to have your actions.
What actions should be taken?
Because again, this has to be put into that reward system.
Remember, that's why, right?
And deploy andrate slowly go, don't go big bang, right?
Probably you just try to solve one problem at a time and then add one more problem,
and then add one more problem now rather than going with a big bank approach.
Probably you're just looking for ski mantra for the time being, or
probably you're looking for resource contention and then you can add for
network latency and stuff like that.
So what is the future intelligent and adaptive integration?
So surfing pipelines represent a fundamental shift in how we
architect data infrastructure.
By combining RL with prompt driven intelligence, we create systems
that don't just fall gracefully.
They learn, adapt, and evolve.
So basically we are shifting from reactive troubleshooting to
proactive policy driven resolution.
And yeah.
Thank you so much.
Thanks for listening to my thanks for listening to the presentation and I'm you
can reach out to me on LinkedIn if you have any questions or anything about it.
Thank you so much.