Kube-Native ETL at Scale: Optimizing PySpark + Airflow Workflows in Cloud-Native Environments

Video size:

Abstract

Discover how to scale PySpark + Airflow ETL in Kubernetes-native environments! Learn real-world tuning tactics that cut latency by 50%, enabled 1,000+ DAGs/day, and supercharged cloud-native orchestration for data platforms at scale.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, I'm Ra Herra, and today we are going to end to end on cube native ETL at scale, specifically how to make PI spark plus airflow fast, reliable, and cost efficient in a cloud native in Beran. So I'll start with the architecture that keeps us stable. Then the spark tuning that actually moves the needle. Then the orchestration PA patterns that reduce operational pain. And I'll finish with the phased rollout. You can start next week. My goal is zero fluff and maximum practicality. Let's align on the route and we'll take, so each section builds on the last. Here's the agenda and we'll move through five building blocks. So the first one the cloud native ETL architecture where we'll discuss the containers, Kubernetes and the IEC. So we stop the firefighting in me. Second, the Bipar performance opt optimization, where we'll talk about the memory join and strategies and how to tame skew. Third, the airflow orchestration techniques where we have the dynamic dags resilient patterns and monitoring to stay ahead of failures. Fourth, a real world scale. What a multi terabyte pipeline would look like when this is done well. Fifth and implementation roadmap faced steps to adopt safely without risking production. Before the how, let's be honest and discuss about why, what breaks at the scale. These are the ETL scaling challenges that we face. And when the data volume and the velocity climb, the same four issues resurface. One, the resource misallocation. So clusters are either over provisioned or expensive or under provisioned and unreliable because sizing is guesswork and provisioning is slow. Two, limited visibility and complex dependencies. When a job fails at 2:00 AM you can't see lineage or timing clearly, so recovery is manual. Three, there is brittle monoliths, so small changes like a new column, can ripple through a monster script and break everything. Four exploding data volumes. More data increases shuffle cost, storage cost, and blast radius of bad input. Together these costs, delays and unpredictable costs and reactive firefighting. So a better way is agile, automated and scalable architecture. Resources flex with load dependencies are explicit, and reliability is a feature, not luck. Let's pour a stable foundation first and then we'll talk about speed. So we rely on three mutual reinforcing pillars. So the first pillar is the container. So where here the code is packaged and dependencies together. So the jobs are reproducible across dev test and fraud. That alone eliminates the works on my machine surprises and lets each pipeline declare exactly the CPU memory and the JVM it needs. Pillar number two is Kubernetes as a compute fabrics. So namespace isolate teams. Autoscaling expands worker pods under loads and core task. Keep noisy S in check for multi-tenant analytics. This is essential. Number three is the infrastructure as code. We've used Terraform and H and this helps to deploy the platform and airflow declaratively so we can review the changes via request and promote via jet tops. Turns infrastructure into version and testable artifacts. Here's a quick example. So if a team needs a Python upgrade or a new connector, you shape a new connector container and roll it out safely through Helm values. No Snowflake servers, no drift with the runway built and the plane is sparked. Let's tune it to fly. So the spark performance reduces two, three layers, right? So the first is memory, then the joints and skew. When it comes to memory, you can start by right sizing the executors. Approximately four to eight GB per executor is a p pragmatic range. Cash. Sparsing with persist only. Wear a reuse payoff. Set memory fractions around 0.64. Execution and 0.2 for storage as baseline adjust with real metrics, not hunches. And then coming to joins. Use broadcast hash joins for small dimensions for about like less than a hundred MD two steps. Sidestep shuffles for a large fact to fact. Joints lean on short merge and reduce the number of wide stage. If you see ballooning shuffle read times your joint strategy is usually the culprit. Then coming to this queue, a handful of hotkey can stall an entire job, use salting or pre aggregation on hotkey to, so the partitions finish at similar times, reliability touches. So add checkpointing to cut lineage depth, right and enable. Predicate push downs and dynamic partition discovery to trim io and make all rights item potent. K eights aware retries don't double, right? So here's a mini story. We had a dimension table at 60 MB moving to broadcast. Join at a stage from 19 minutes to three. Another pipeline had a single customer representing. 40% of the rows a simple salt with 16 buckets, level the partition and remove the straggler, but don't trust wipes. Prove it. So always take a baseline. So driver or executor counts, right? Memory, shuffle, read or write. Stage times and spill metrics change one variable at a time. You'll typically see winds from fewer shuffle. And better executor sizing and smarter joints. When you confirm a win, bake it into the project templates, so the next pipeline inherits the improvement by default. Now let us see, let's orchestrate at scale without turning sari into a midnight hub. So here is like the airflow on Kubernetes. So running the airflow on Kubernetes aligns the control plane With data plane, you get scheduler, scalability. 2000 plus DAC runs per day pod level isolation. So each tasks de declares CPU and memory workload, identity and manage secrets for built in security and cloud native logging or monitoring out of the box in practice teams around 99.9% scheduler uptime with drastically less operational overhead. So here's a practical tip. You get to define a small set of pot templates. A small I. Heavy joints and machine learnings. So tasks, pick the right shape by convention. So orchestration is more than just scheduling and it's patterns that's. Scale people. So here are the few advanced airflow patterns for resilient ETLs adopt these four patterns to raise reliability without raising head counts. First one is the dynamic tag generation, so keep the ConX. In S3 or GCS and programmatically generate the tags, hundreds of similar pipelines with minimal copy base. Second is the intelligent branching here you can use, a branch Python operator to route around known bad data or to choose the light path versus the heavy path based on the input sizes. Three, backfill design. So drive jobs by logical run dates. Make each step item potent and separate compute from storage, so all runs can be pre-processed safely. Four SLA. Monitoring, we custom callbacks into a Datadog or Prometheus alert on lateness and on leading indicators like raising retries, not just the failures. And so here's a micro example. A media pipeline checks volume at a fan. Instead, if today's volume is greater than 1.5 times the. Day seven median it auto branches to a wider executor profile and adds a midstream checkpoint. So of course, all of these still lives and dies by the executor sizing. So here is a chart of the spark executor sizing. It is basically the art of the resource allocation. Think of the sizing as a curve, not a point. Large executors mean fewer JVMs and strong throughput, but risk long GC process and amplify SKU when coming to the medium or balanced executors are good default for mixed load workload. Tiny executors start fast and pack the cluster with parallelism, but management overhead and shuffle plan out can dominate your safety. Net is dynamic. Allocation with sensible minimum or maximum bonds. So the cluster scales under load, but never trashes. Target the speed spot where utilization is high. Spills are rare and shuffle stages don't. Here's the field trick If medium task time is low, but the P 95 is awful. Suspect skew. If both median and P 95 are high, you are under resourced or over shuffling. You can't tune what you can't see. So let's wire observability in from the start. So here's the monitoring and observability, and this is actually the reliability foundation, as you can say. So it comes down to the metrics. So use Prometheus to capture cluster health and spark app metrics. Watch executor, active time shuffle, read bytes, and spill spills and tasks cues. And then logs and traces emit structured JSON logs with correlation IDs add open telemetry tracing to switch airflow tasks and sparks to, into a single story data quality. It automate profiling with great expectations. And get critical steps on expectations. Use alert on KPIs. The business cares about, not just the technical counts, costs. Track the per app and per tag costs so you can right size continuously. Now we've got the what and the how. Now, how do you adapt this without breaking production? So here's the modernization journey. Here you can following this from phase one to phase five, you can build a legacy batch to a cloud native a. Use the five phase part. So risk stays low and momentum stays high. Phase one assessment. Here you can inventory jobs, profile performance map dependencies, choose candidates by business impact, and phase two is the containerization. And so this splits monoliths into modules package with multi-stage docker for smaller, faster images, and followed by orchestration, so clean D dependencies and manage environment specific configurations via airflow variables. And for optimization, right size, tuned joints or memories and rebo retry or recover. And coming to operationalization, we can use CICD monitoring SLAs, documentation, and knowledge trans. Small bet approach. So migrate one medium impact pipeline per sprint. Compare baselines versus new. Publish the win repeat. Now as you roll out, dos the classic traps. So we have the common pitfalls and it's better to avoid them. So the first one is the resource misallocation. Over provisioning wastes money and under provision fails. Jobs enforced dynamic allocation bonds, and alert on ideal time. Excessive shuffling, often skew or unnecessary repartition. Use spark UI and stage metrics to pinpoint hot keys, salt, or pre aggregate wrong formats. So switch large data sets from CSV or JSON to packet or ORC for three to five times the gains through column nerve storage and compression. For orphan resources, failed jobs can leave like pods or persistent volumes and add cleanup hooks and quotas as guardrails. So let's land this with a concrete startup plan and three messages to remember. Here's the implementation roadmap. If you are planning on getting started. So this is how you, your implementation would look like. So week one you can pick on safe pipeline, record base line, runtime, and shuffle and cost. Week two, stand up a small K eights node pool. Deploy air flows with helm set workload identity secrets. Week three, containerize. The pipeline add dynamic allocation switch joins appropriately and add checkpoints. Week four, wire metrics, logs or quality checks run new. This is Legacy. In parallel, compare and publish results, capture lessons in spark or Airflow playbook. And if you only take three ideas with you, here is the key takeaways. The key takeaways are. Architectural discipline, performance optimization, resilient orchestration, and operational excellence. Coming to the architecture first, it depends on the containers, Kubernetes and the IAC. This gives the stability and speed of change. The spice part, tuning pace. So focusing on memory joints and SKUs routinely yields approximately 30 to 50% of efficiency improvements. And then coming to the observability trust or metric logs, traces quality and cost. Visibility, makes pipeline dependable and defensible. Thanks for spending this time and if you would like the sizing templates, dynamic DAG patterns or migration checklist, I'm happy to share. Thank you so much. I.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Kube-Native ETL at Scale: Optimizing PySpark + Airflow Workflows in Cloud-Native Environments

Video size:

Abstract

Summary

Transcript

Slides

Sruthi Erra Hareram

Data Engineer @ TELUS Digital

Join the community!

Featured event

2026

2025

Info

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Kube-Native ETL at Scale: Optimizing PySpark + Airflow Workflows in Cloud-Native Environments

Video size:

Abstract

Summary

Transcript

Slides

Sruthi Erra Hareram

Data Engineer @ TELUS Digital

Join the community!