Transcript
This transcript was autogenerated. To make changes, submit a PR.
For many of the data engineers, ETL has been the backbone of
data engineers for decades.
You extract the data, transform it, and load it.
We have built pipelines, batch jobs, and data warehouses, but what
I've seen is that data engineering is just about moving data anymore.
We are being asked to deliver data faster, more reliable, and
at bigger scale and smarter too.
It's not enough to simply.
Pipeline data into a warehouse and quality data.
What's happening now is AI and machine learning are no longer just
a downstream consumers of data.
They're moving up upstream into our engineering process itself.
AI is becoming embedded inside our pipelines, detecting lys, generating
transformation logic, auto-scaling infrastructure, and automatically
automating the data quality checks.
Suddenly we weren't spending days combining through bad data.
We are getting real time alerts with anomaly explanation explanations
that freed us up to focus on deeper improvements rather than
constantly fire fighting data errors.
Before we go deeper into this topic, let me quickly introduce myself and
why this journey into AI powered data engineering has been so meaningful for me.
I am Jonathan and I've been working as a lead data engineer and solutions architect
for over a decade with hands on, hands-on experience building and modernizing
data platforms at organizations like.
Capital One, FINRA and F-M-C-S-A.
It is an entity of our depart DOT Department of Transportation.
At Capital One, I was part of a major modernization effort moving
away from legacy mainframe systems to micro based microservice based
cloud platform to support credit card applications like acquisitions,
approvals, and risk decisions.
This gave me firsthand exposure to transforming complex nested Avro data
into clean analytic ready data mods while having no while having to deal
with the schema drift, realtime data ingestion, and the data quality at scale.
And at finra, I have worked with one of the largest data pipelines I was ever
seen, ingesting over two 50 billion daily trade events across the US stock market.
It.
We weren't just pro processing huge volumes of data.
We had strict regulatory timelines and need and needed zero data loss,
high accuracy and fast availability.
That's where I saw AI and automation set up in, not just as an add-on for
analysis, but as a core part of the engineering process, helping us monitor
data quality, detect ly in ingestion and auto tune processing pipelines.
And most recently at DOT fm F-M-C-S-A, a late data initiative focused on
safety and compliance reporting for commercial motor carriers.
And when you are supporting public safety decisions, you can't afford manual
errors or delay in the data pipeline.
Like here we had to deliver inspection and violation reports for all 50 states
based on millions of records per month.
From various inspection systems, we implemented tools like P DQ for automated
data quality checks, and even explored using ML models to PLA flag suspicious
inspection records in real time.
What I've seen, what I've learned across the these roles in is that
data engineering is no longer just about pipelines and warehouses.
It's about reliability, scalability, and increasing
increasingly about intelligence.
Let's talk about the reality of data engineering today and why EA has become
such a critical piece of the conversation.
When I first saw these numbers, the reality hit home 68% of it and engineers.
Time is spent on maintenance rather than innovation.
And four and a half hours a day is spent on debugging and fixing pipeline issues.
And 23% of data quality problems are discovered until they
are already in production.
Honestly, that feels about right because I've seen, I've been in
those trenches a lot of time.
A lot of time wasn't spent building new features or
optimizing for business outcomes.
It was spent answering questions like, why did this pipeline fail?
Now last night, is this a schema issue or a data quality issue?
Did something happen upstream change that no one told us about.
And when pipeline breaks, the pressure builds fast.
Our, for decades, ETL has followed a very traditional, very rule-based model.
You extract the data from source systems, you map it to, into a target schema, you
transform it with predefined rules and tho then load it into a data warehouse.
Sounds simple enough until you are dealing with hundreds of source systems
constantly evolving schema, dirty data, and complex transformation logic.
Here is what the traditional ETL world looks like.
Time consuming manual mapping between complex data schemes.
A very rigid rule-based transformation that struggles to adapt to change, like
transformation that are hardcore logic.
Or pre predefined logic.
If X is Y effects, then Y else a Z. But what happens when upstream data changes?
No new columns, different formats.
These rigid rules break and require manual inter interventions to patch
them, and reactive error handling.
In traditional world, you don't find out about errors until
they have been already broken.
Your pipeline are corrupter corrupted your data.
Data stream data, downstream data.
Either way you are reacting after the damage is done.
Static performance tuning, and let's not forget the over suspended, manually
tuning sparko configuration or query parameters to try and optimize runtime
for a specific data set or a workload.
Once tuned, it's static.
Now compare that with the AI powered ETL approach, automatic schema discovery,
and intelligent hands-free mapping.
Instead of developers manually mapping every field, AI algorithms can scan source
data in for data types and recommend or auto map fields to the target schema.
Even for complex nested J or evolving source schemas, adapt
to transformation, powered by.
Machine learning rather than static rules.
AI models can learn transformation patterns from that data, proactive
error de detection and resolution.
AI doesn't wait until the pipeline fails at runtime.
It proactively analyzes incoming data before execution, predicts
potential errors and can either auto resolve or flag issues.
It's a shift from reactive to predictive pipeline management.
Self optimize, optimizing real time execution, and perhaps most exciting AI
can monitor job execution in real time, automatically tuning configurations
like partition partitioning, parallelism memory allocation based on current
workload and resource conditions.
So you are always operating near peak efficiency without manual 20.
So beyond just the technical improvements, let's talk about
the strategic impact AI brings to ETL and data engineering Overall.
First,
self-optimizing.
Realtime execution means our pipelines are just on the fly, whether it's
handling sudden data volume spikes, or adapting to infrastructure bottlenecks.
No more endless manual tuning or constant babysitting.
But more importantly, it redirect it.
It redirects engineering effect.
Instead of spend spending our days fixing failed jobs or remapping schemas,
engineers can focus on innovation, new features and high value of work,
and the numbers speak for themselves.
85% reduction in manual schema mapping, which speeds up
project timelines dramatically.
92% of data analysts are intercepted proactively before
they ever reach production reports.
And 43% reduction pipeline maintenance score freeing up
both budget and bandwidth.
When I think about my experience at Capital One and capital where even
small data IRS could ripple into complex risks, having AI practically
catch issues before the surface is in.
Just a time saver.
It's a risk user.
This isn't just about automation.
It's about elevating the entire role of data engineering into a
more strategic, proactive function.
Oh, first, like AWS Glue, S Serverless, ETL platform that uses, machine
learning to automatically detect data schemas and generate mappings.
This drastic drastically reduces the need for manual coding,
accelerating the time it takes to integrate and prep, prepare data.
Next Databricks Auto Loader, which brings intelligent data ingestion
with built-in schema evaluation.
Handling this tool ensures that changing data structure doesn't break your
pipelines by dynamically adjusting to new fields or formats as they arrive.
And finally, anomaly detection system.
These leverage machine learning to continuously scan data
flows for unusual patterns.
They proactively identify anomalies and potential issues before they impact
business decisions or downstream systems.
If you look at the chart, you'll see dramatic improvements
across four key areas.
Inquiry query creation time was cut by more than half.
Debugging time dropped from nearly 90 minutes to about 30, 30 minutes.
Error rates also reduces significantly, and performance during tuning time was
slash from two hours to under an hour.
Overall, the firms are seeing 65, 60 4% of productivity increase.
This demonstrate how AI not only speeds a protein task.
But also improves accuracy and performance in SQL development workflows, or the
transformation development workflows.
This slide shows how AI is transforming pipeline maintenance
with self-healing data pipelines.
It starts by automatically detecting anomalies, machine learning models,
fla unusual data patterns before they snowball into bigger issues.
Then the system moves to diagnose the issue co using root cause
analysis to classify what went wrong.
Once the cost is are identified, it moves into the apply fix stage,
pulling an automated correction from a prebuilt library of solution.
No manual intervention required.
Finally, it enters.
A learn and improve where the solution is recorded.
So the system keeps getting smarter and faster at fixing
similar issues in the future.
Future.
This closer loop approach reduces downtime, boosts reliability, and
shifts engineering engineers away from firefighting towards strategic work.
We are looking at quality validation frameworks, which are critical for
maintaining trust in our, in the data pipelines, especially as we scale with
ai, it starts with defining expectation, setting rules, either based on domain
expertise or even automatically discovering patterns from the data itself.
Then we test continuously, not just at the end, but validating data at every stage.
Like ingestion, transformation and pre-published, if something fails, those
checks we don't have to jump in manually.
Instead, an auto remediate layer kicks in, applying machine learning,
recommended fixes for common issues.
And finally, we monitor metrics over time to track both data quality scores
and how effective our fixes really are.
This process.
Shifts us from react to firefighting to P to continuous data quality
assurance, something that's incredibly value in high stakes environment
like a finance or the government.
Let's talk about where we are and why, where we are heading.
This slides lays out an industry adoption timeline for ai powder data engineering.
In 2023 to 24, we have seen AI equal assistance and automated sche
mine inference becoming mainstream tools in many organizations.
Moving into 24 and 25 were on top for self-healing pipelines, automated
anomaly detection and correction, reaching about 60% enterprise adoption.
By 25 and 26, the industry shifts towards self.
Optimizing systems where infrastructure dynamically scales and allocate
resources based on business priorities and looking ahead into 26 and 27.
The big milestone is fully autonomous data platforms, autonomous data
platform systems that monitor, monitor, optimize, and remediate
with a minimal human oversight.
The key takeaway, the AI is.
In data engineering isn't theoretical anymore.
It's a fast moving wave that's fundamentally changing how we design,
operate, and scale data platforms.
Now that we have seen the potential of AI in data engineering,
let's talk about how to actually implement it in a structured way.
This framework has four key steps.
First, assess opportunities.
Look for the RE two hire first task in your workflows.
This could be manual schema mapping, tedious data
validation, or re two debugging.
Next, start small.
Don't try to automate everything at once.
Pick one critical pro pipeline or the process and test a automation there.
Third, measure impact.
Track time, save improvements in data quality.
And even team satisfaction to prove the value.
And finally, once you have shown success, scale gradually expand what work to more
pipelines and parts of your data pla platform retrospect your like, previous
experience from existing pipelines.
This approach helps.
Reduce risk while ensuring adoption is sustainable and measurable.
The key takeaways
of this presentation, the first one is Yay, is transforming ETL.
From manual to autonomous, we are seeing up to 85% reduction in schema mapping
effort, and 43% cut in maintenance cost.
Second, a self failing pipelines proactively catch 92% of anomalies
before the cost downstream issues.
That's a huge boost in data quality and trust.
Third, start small, but start today.
Pick one high effort repeated to process to automate for quick wins and momentum.
And finally, free your team for strategic work.
Okay.
Every hour saved on maintenance is an hour unlock for innovation and business value.
Thank you so much.
Excited to see how we all move forward in this AI driven future of data engineering.
Thank you so much.