Conf42 Machine Learning 2025 - Online

- premiere 5PM GMT

Beyond ETL: How AI Is Revolutionizing Data Engineering Workflows

Video size:

Abstract

Discover how AI transforms data engineering from manual drudgery to strategic advantage. Learn how leading companies use intelligent automation to slash pipeline costs, prevent errors, and free teams for high-value work. Get practical frameworks to implement these game-changers today.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
For many of the data engineers, ETL has been the backbone of data engineers for decades. You extract the data, transform it, and load it. We have built pipelines, batch jobs, and data warehouses, but what I've seen is that data engineering is just about moving data anymore. We are being asked to deliver data faster, more reliable, and at bigger scale and smarter too. It's not enough to simply. Pipeline data into a warehouse and quality data. What's happening now is AI and machine learning are no longer just a downstream consumers of data. They're moving up upstream into our engineering process itself. AI is becoming embedded inside our pipelines, detecting lys, generating transformation logic, auto-scaling infrastructure, and automatically automating the data quality checks. Suddenly we weren't spending days combining through bad data. We are getting real time alerts with anomaly explanation explanations that freed us up to focus on deeper improvements rather than constantly fire fighting data errors. Before we go deeper into this topic, let me quickly introduce myself and why this journey into AI powered data engineering has been so meaningful for me. I am Jonathan and I've been working as a lead data engineer and solutions architect for over a decade with hands on, hands-on experience building and modernizing data platforms at organizations like. Capital One, FINRA and F-M-C-S-A. It is an entity of our depart DOT Department of Transportation. At Capital One, I was part of a major modernization effort moving away from legacy mainframe systems to micro based microservice based cloud platform to support credit card applications like acquisitions, approvals, and risk decisions. This gave me firsthand exposure to transforming complex nested Avro data into clean analytic ready data mods while having no while having to deal with the schema drift, realtime data ingestion, and the data quality at scale. And at finra, I have worked with one of the largest data pipelines I was ever seen, ingesting over two 50 billion daily trade events across the US stock market. It. We weren't just pro processing huge volumes of data. We had strict regulatory timelines and need and needed zero data loss, high accuracy and fast availability. That's where I saw AI and automation set up in, not just as an add-on for analysis, but as a core part of the engineering process, helping us monitor data quality, detect ly in ingestion and auto tune processing pipelines. And most recently at DOT fm F-M-C-S-A, a late data initiative focused on safety and compliance reporting for commercial motor carriers. And when you are supporting public safety decisions, you can't afford manual errors or delay in the data pipeline. Like here we had to deliver inspection and violation reports for all 50 states based on millions of records per month. From various inspection systems, we implemented tools like P DQ for automated data quality checks, and even explored using ML models to PLA flag suspicious inspection records in real time. What I've seen, what I've learned across the these roles in is that data engineering is no longer just about pipelines and warehouses. It's about reliability, scalability, and increasing increasingly about intelligence. Let's talk about the reality of data engineering today and why EA has become such a critical piece of the conversation. When I first saw these numbers, the reality hit home 68% of it and engineers. Time is spent on maintenance rather than innovation. And four and a half hours a day is spent on debugging and fixing pipeline issues. And 23% of data quality problems are discovered until they are already in production. Honestly, that feels about right because I've seen, I've been in those trenches a lot of time. A lot of time wasn't spent building new features or optimizing for business outcomes. It was spent answering questions like, why did this pipeline fail? Now last night, is this a schema issue or a data quality issue? Did something happen upstream change that no one told us about. And when pipeline breaks, the pressure builds fast. Our, for decades, ETL has followed a very traditional, very rule-based model. You extract the data from source systems, you map it to, into a target schema, you transform it with predefined rules and tho then load it into a data warehouse. Sounds simple enough until you are dealing with hundreds of source systems constantly evolving schema, dirty data, and complex transformation logic. Here is what the traditional ETL world looks like. Time consuming manual mapping between complex data schemes. A very rigid rule-based transformation that struggles to adapt to change, like transformation that are hardcore logic. Or pre predefined logic. If X is Y effects, then Y else a Z. But what happens when upstream data changes? No new columns, different formats. These rigid rules break and require manual inter interventions to patch them, and reactive error handling. In traditional world, you don't find out about errors until they have been already broken. Your pipeline are corrupter corrupted your data. Data stream data, downstream data. Either way you are reacting after the damage is done. Static performance tuning, and let's not forget the over suspended, manually tuning sparko configuration or query parameters to try and optimize runtime for a specific data set or a workload. Once tuned, it's static. Now compare that with the AI powered ETL approach, automatic schema discovery, and intelligent hands-free mapping. Instead of developers manually mapping every field, AI algorithms can scan source data in for data types and recommend or auto map fields to the target schema. Even for complex nested J or evolving source schemas, adapt to transformation, powered by. Machine learning rather than static rules. AI models can learn transformation patterns from that data, proactive error de detection and resolution. AI doesn't wait until the pipeline fails at runtime. It proactively analyzes incoming data before execution, predicts potential errors and can either auto resolve or flag issues. It's a shift from reactive to predictive pipeline management. Self optimize, optimizing real time execution, and perhaps most exciting AI can monitor job execution in real time, automatically tuning configurations like partition partitioning, parallelism memory allocation based on current workload and resource conditions. So you are always operating near peak efficiency without manual 20. So beyond just the technical improvements, let's talk about the strategic impact AI brings to ETL and data engineering Overall. First, self-optimizing. Realtime execution means our pipelines are just on the fly, whether it's handling sudden data volume spikes, or adapting to infrastructure bottlenecks. No more endless manual tuning or constant babysitting. But more importantly, it redirect it. It redirects engineering effect. Instead of spend spending our days fixing failed jobs or remapping schemas, engineers can focus on innovation, new features and high value of work, and the numbers speak for themselves. 85% reduction in manual schema mapping, which speeds up project timelines dramatically. 92% of data analysts are intercepted proactively before they ever reach production reports. And 43% reduction pipeline maintenance score freeing up both budget and bandwidth. When I think about my experience at Capital One and capital where even small data IRS could ripple into complex risks, having AI practically catch issues before the surface is in. Just a time saver. It's a risk user. This isn't just about automation. It's about elevating the entire role of data engineering into a more strategic, proactive function. Oh, first, like AWS Glue, S Serverless, ETL platform that uses, machine learning to automatically detect data schemas and generate mappings. This drastic drastically reduces the need for manual coding, accelerating the time it takes to integrate and prep, prepare data. Next Databricks Auto Loader, which brings intelligent data ingestion with built-in schema evaluation. Handling this tool ensures that changing data structure doesn't break your pipelines by dynamically adjusting to new fields or formats as they arrive. And finally, anomaly detection system. These leverage machine learning to continuously scan data flows for unusual patterns. They proactively identify anomalies and potential issues before they impact business decisions or downstream systems. If you look at the chart, you'll see dramatic improvements across four key areas. Inquiry query creation time was cut by more than half. Debugging time dropped from nearly 90 minutes to about 30, 30 minutes. Error rates also reduces significantly, and performance during tuning time was slash from two hours to under an hour. Overall, the firms are seeing 65, 60 4% of productivity increase. This demonstrate how AI not only speeds a protein task. But also improves accuracy and performance in SQL development workflows, or the transformation development workflows. This slide shows how AI is transforming pipeline maintenance with self-healing data pipelines. It starts by automatically detecting anomalies, machine learning models, fla unusual data patterns before they snowball into bigger issues. Then the system moves to diagnose the issue co using root cause analysis to classify what went wrong. Once the cost is are identified, it moves into the apply fix stage, pulling an automated correction from a prebuilt library of solution. No manual intervention required. Finally, it enters. A learn and improve where the solution is recorded. So the system keeps getting smarter and faster at fixing similar issues in the future. Future. This closer loop approach reduces downtime, boosts reliability, and shifts engineering engineers away from firefighting towards strategic work. We are looking at quality validation frameworks, which are critical for maintaining trust in our, in the data pipelines, especially as we scale with ai, it starts with defining expectation, setting rules, either based on domain expertise or even automatically discovering patterns from the data itself. Then we test continuously, not just at the end, but validating data at every stage. Like ingestion, transformation and pre-published, if something fails, those checks we don't have to jump in manually. Instead, an auto remediate layer kicks in, applying machine learning, recommended fixes for common issues. And finally, we monitor metrics over time to track both data quality scores and how effective our fixes really are. This process. Shifts us from react to firefighting to P to continuous data quality assurance, something that's incredibly value in high stakes environment like a finance or the government. Let's talk about where we are and why, where we are heading. This slides lays out an industry adoption timeline for ai powder data engineering. In 2023 to 24, we have seen AI equal assistance and automated sche mine inference becoming mainstream tools in many organizations. Moving into 24 and 25 were on top for self-healing pipelines, automated anomaly detection and correction, reaching about 60% enterprise adoption. By 25 and 26, the industry shifts towards self. Optimizing systems where infrastructure dynamically scales and allocate resources based on business priorities and looking ahead into 26 and 27. The big milestone is fully autonomous data platforms, autonomous data platform systems that monitor, monitor, optimize, and remediate with a minimal human oversight. The key takeaway, the AI is. In data engineering isn't theoretical anymore. It's a fast moving wave that's fundamentally changing how we design, operate, and scale data platforms. Now that we have seen the potential of AI in data engineering, let's talk about how to actually implement it in a structured way. This framework has four key steps. First, assess opportunities. Look for the RE two hire first task in your workflows. This could be manual schema mapping, tedious data validation, or re two debugging. Next, start small. Don't try to automate everything at once. Pick one critical pro pipeline or the process and test a automation there. Third, measure impact. Track time, save improvements in data quality. And even team satisfaction to prove the value. And finally, once you have shown success, scale gradually expand what work to more pipelines and parts of your data pla platform retrospect your like, previous experience from existing pipelines. This approach helps. Reduce risk while ensuring adoption is sustainable and measurable. The key takeaways of this presentation, the first one is Yay, is transforming ETL. From manual to autonomous, we are seeing up to 85% reduction in schema mapping effort, and 43% cut in maintenance cost. Second, a self failing pipelines proactively catch 92% of anomalies before the cost downstream issues. That's a huge boost in data quality and trust. Third, start small, but start today. Pick one high effort repeated to process to automate for quick wins and momentum. And finally, free your team for strategic work. Okay. Every hour saved on maintenance is an hour unlock for innovation and business value. Thank you so much. Excited to see how we all move forward in this AI driven future of data engineering. Thank you so much.
...

Janardhan Reddy Kasireddy

@ Reveal Global Consulting



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)