Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I'm Ra Herra, and today we are going to end to end on cube native
ETL at scale, specifically how to make PI spark plus airflow fast, reliable, and
cost efficient in a cloud native in Beran.
So I'll start with the architecture that keeps us stable.
Then the spark tuning that actually moves the needle.
Then the orchestration PA patterns that reduce operational pain.
And I'll finish with the phased rollout.
You can start next week.
My goal is zero fluff and maximum practicality.
Let's align on the route and we'll take, so each section builds on the last.
Here's the agenda and we'll move through five building blocks.
So the first one the cloud native ETL architecture where we'll discuss the
containers, Kubernetes and the IEC.
So we stop the firefighting in me.
Second, the Bipar performance opt optimization, where we'll
talk about the memory join and strategies and how to tame skew.
Third, the airflow orchestration techniques where we have the
dynamic dags resilient patterns and monitoring to stay ahead of failures.
Fourth, a real world scale.
What a multi terabyte pipeline would look like when this is done well.
Fifth and implementation roadmap faced steps to adopt safely
without risking production.
Before the how, let's be honest and discuss about why,
what breaks at the scale.
These are the ETL scaling challenges that we face.
And when the data volume and the velocity climb, the same four issues resurface.
One, the resource misallocation.
So clusters are either over provisioned or expensive or under provisioned
and unreliable because sizing is guesswork and provisioning is slow.
Two, limited visibility and complex dependencies.
When a job fails at 2:00 AM you can't see lineage or timing
clearly, so recovery is manual.
Three, there is brittle monoliths, so small changes like a new column,
can ripple through a monster script and break everything.
Four exploding data volumes.
More data increases shuffle cost, storage cost, and blast radius of bad input.
Together these costs, delays and unpredictable costs
and reactive firefighting.
So a better way is agile, automated and scalable architecture.
Resources flex with load dependencies are explicit, and
reliability is a feature, not luck.
Let's pour a stable foundation first and then we'll talk about speed.
So we rely on three mutual reinforcing pillars.
So the first pillar is the container.
So where here the code is packaged and dependencies together.
So the jobs are reproducible across dev test and fraud.
That alone eliminates the works on my machine surprises and lets
each pipeline declare exactly the CPU memory and the JVM it needs.
Pillar number two is Kubernetes as a compute fabrics.
So namespace isolate teams.
Autoscaling expands worker pods under loads and core task.
Keep noisy S in check for multi-tenant analytics.
This is essential.
Number three is the infrastructure as code.
We've used Terraform and H and this helps to deploy the platform and airflow
declaratively so we can review the changes via request and promote via jet tops.
Turns infrastructure into version and testable artifacts.
Here's a quick example.
So if a team needs a Python upgrade or a new connector, you shape a
new connector container and roll it out safely through Helm values.
No Snowflake servers, no drift with the runway built and the plane is sparked.
Let's tune it to fly.
So the spark performance reduces two, three layers, right?
So the first is memory, then the joints and skew.
When it comes to memory, you can start by right sizing the executors.
Approximately four to eight GB per executor is a p pragmatic range.
Cash.
Sparsing with persist only.
Wear a reuse payoff.
Set memory fractions around 0.64.
Execution and 0.2 for storage as baseline adjust with real metrics, not hunches.
And then coming to joins.
Use broadcast hash joins for small dimensions for about like
less than a hundred MD two steps.
Sidestep shuffles for a large fact to fact.
Joints lean on short merge and reduce the number of wide stage.
If you see ballooning shuffle read times your joint strategy
is usually the culprit.
Then coming to this queue, a handful of hotkey can stall an entire job,
use salting or pre aggregation on hotkey to, so the partitions finish
at similar times, reliability touches.
So add checkpointing to cut lineage depth, right and enable.
Predicate push downs and dynamic partition discovery to trim io
and make all rights item potent.
K eights aware retries don't double, right?
So here's a mini story.
We had a dimension table at 60 MB moving to broadcast.
Join at a stage from 19 minutes to three.
Another pipeline had a single customer representing.
40% of the rows a simple salt with 16 buckets, level the partition and remove
the straggler, but don't trust wipes.
Prove it.
So always take a baseline.
So driver or executor counts, right?
Memory, shuffle, read or write.
Stage times and spill metrics change one variable at a time.
You'll typically see winds from fewer shuffle.
And better executor sizing and smarter joints.
When you confirm a win, bake it into the project templates, so the next pipeline
inherits the improvement by default.
Now let us see, let's orchestrate at scale without turning sari into a midnight hub.
So here is like the airflow on Kubernetes.
So running the airflow on Kubernetes aligns the control plane With data
plane, you get scheduler, scalability.
2000 plus DAC runs per day pod level isolation.
So each tasks de declares CPU and memory workload, identity and manage secrets for
built in security and cloud native logging or monitoring out of the box in practice
teams around 99.9% scheduler uptime with drastically less operational overhead.
So here's a practical tip.
You get to define a small set of pot templates.
A small I. Heavy joints and machine learnings.
So tasks, pick the right shape by convention.
So orchestration is more than just scheduling and it's patterns that's.
Scale people.
So here are the few advanced airflow patterns for resilient ETLs adopt these
four patterns to raise reliability without raising head counts.
First one is the dynamic tag generation, so keep the ConX.
In S3 or GCS and programmatically generate the tags, hundreds of similar
pipelines with minimal copy base.
Second is the intelligent branching here you can use, a branch Python
operator to route around known bad data or to choose the light path versus the
heavy path based on the input sizes.
Three, backfill design.
So drive jobs by logical run dates.
Make each step item potent and separate compute from storage, so
all runs can be pre-processed safely.
Four SLA.
Monitoring, we custom callbacks into a Datadog or Prometheus alert on
lateness and on leading indicators like raising retries, not just the failures.
And so here's a micro example.
A media pipeline checks volume at a fan.
Instead, if today's volume is greater than 1.5 times the.
Day seven median it auto branches to a wider executor profile and
adds a midstream checkpoint.
So of course, all of these still lives and dies by the executor sizing.
So here is a chart of the spark executor sizing.
It is basically the art of the resource allocation.
Think of the sizing as a curve, not a point.
Large executors mean fewer JVMs and strong throughput, but risk long GC
process and amplify SKU when coming to the medium or balanced executors are
good default for mixed load workload.
Tiny executors start fast and pack the cluster with parallelism, but
management overhead and shuffle plan out can dominate your safety.
Net is dynamic.
Allocation with sensible minimum or maximum bonds.
So the cluster scales under load, but never trashes.
Target the speed spot where utilization is high.
Spills are rare and shuffle stages don't.
Here's the field trick If medium task time is low, but the P 95 is awful.
Suspect skew.
If both median and P 95 are high, you are under resourced or over shuffling.
You can't tune what you can't see.
So let's wire observability in from the start.
So here's the monitoring and observability, and this is actually the
reliability foundation, as you can say.
So it comes down to the metrics.
So use Prometheus to capture cluster health and spark app metrics.
Watch executor, active time shuffle, read bytes, and spill spills and tasks cues.
And then logs and traces emit structured JSON logs with correlation
IDs add open telemetry tracing to switch airflow tasks and sparks to,
into a single story data quality.
It automate profiling with great expectations.
And get critical steps on expectations.
Use alert on KPIs.
The business cares about, not just the technical counts, costs.
Track the per app and per tag costs so you can right size continuously.
Now we've got the what and the how.
Now, how do you adapt this without breaking production?
So here's the modernization journey.
Here you can following this from phase one to phase five, you can
build a legacy batch to a cloud native a. Use the five phase part.
So risk stays low and momentum stays high.
Phase one assessment.
Here you can inventory jobs, profile performance map dependencies, choose
candidates by business impact, and phase two is the containerization.
And so this splits monoliths into modules package with multi-stage docker for
smaller, faster images, and followed by orchestration, so clean D dependencies
and manage environment specific configurations via airflow variables.
And for optimization, right size, tuned joints or memories
and rebo retry or recover.
And coming to operationalization, we can use CICD monitoring SLAs,
documentation, and knowledge trans.
Small bet approach.
So migrate one medium impact pipeline per sprint.
Compare baselines versus new.
Publish the win repeat.
Now as you roll out, dos the classic traps.
So we have the common pitfalls and it's better to avoid them.
So the first one is the resource misallocation.
Over provisioning wastes money and under provision fails.
Jobs enforced dynamic allocation bonds, and alert on ideal time.
Excessive shuffling, often skew or unnecessary repartition.
Use spark UI and stage metrics to pinpoint hot keys, salt, or
pre aggregate wrong formats.
So switch large data sets from CSV or JSON to packet or ORC for three
to five times the gains through column nerve storage and compression.
For orphan resources, failed jobs can leave like pods or
persistent volumes and add cleanup hooks and quotas as guardrails.
So let's land this with a concrete startup plan and three messages to remember.
Here's the implementation roadmap.
If you are planning on getting started.
So this is how you, your implementation would look like.
So week one you can pick on safe pipeline, record base line,
runtime, and shuffle and cost.
Week two, stand up a small K eights node pool.
Deploy air flows with helm set workload identity secrets.
Week three, containerize.
The pipeline add dynamic allocation switch joins appropriately and add checkpoints.
Week four, wire metrics, logs or quality checks run new.
This is Legacy.
In parallel, compare and publish results, capture lessons in
spark or Airflow playbook.
And if you only take three ideas with you, here is the key takeaways.
The key takeaways are.
Architectural discipline, performance optimization, resilient orchestration,
and operational excellence.
Coming to the architecture first, it depends on the containers,
Kubernetes and the IAC.
This gives the stability and speed of change.
The spice part, tuning pace.
So focusing on memory joints and SKUs routinely yields approximately 30
to 50% of efficiency improvements.
And then coming to the observability trust or metric logs, traces quality and cost.
Visibility, makes pipeline dependable and defensible.
Thanks for spending this time and if you would like the sizing templates,
dynamic DAG patterns or migration checklist, I'm happy to share.
Thank you so much.
I.