Conf42 Incident Management 2025 - Online

- premiere 5PM GMT

Resilient Cross-Cloud Analytics Pipelines with Azure Databricks: A 10TB+ Scale Reliability Engineering Case Study

Video size:

Abstract

Discover how we engineered a resilient 10TB+/day analytics pipeline across Azure and AWS with 99.95% uptime, robust failure handling, and incident-aware orchestration ideal for reliability engineers managing critical data platforms at scale.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I'm RA Herra, and today I'm sharing a real world case study on building silent cross cloud analytics pipeline with Azure Databricks with more than 10 TB daily scale. You'll hear exactly how we combined SRE discipline with Delta Lake optimizations to keep strict SLAs while making performance predictable and keeping the costs sane. For context, I think billions of events per day, multi-region ingestion and user facing dashboards that must stay fresh within minutes. So here's the flow for the next 12 to 15 minutes. So the first I'll discuss the architecture and where the cross cloud seam are. And second the reliability engineering we enforced. Third, the performance work, especially Delta Lake and Sea order. Fourth, our incident playbook and how we shrank. MTTR. And finally a practical blueprint too that you can adopt. And I anchor each section with the concrete thresholds and examples so you can relate to them. So our challenge was ad campaign in telemetry in double digit terabytes per day. So with many interdependent e ETLs, a single upstream blip could cascade. The business expectations were blunt. So real time dashboards, historical trends in minutes, cross cloud consistency without wasteful duplication and predictable performance, even when the load is spikey without blowing the budget. And we set the working target of P 95, freshness less than 15 minutes, and P 99, less than 30 minutes with back pressure rules when the, these numbers drift. So we deliberately cross clouded not to use everything but to pick the best pieces and define hard contracts at the seams. Three principles guided US one, data locality and replication across regions and providers for HADR and latency. R-D-R-R-P-O, targeted single digit minutes and RTO under one hand. So two standardized APIs and oss. So Spark or delta are lingo, franca, and provider perks are hidden. And the third is unified absorbability. One place to see metrics, logs, and traces across clouds. So detection and response are fast. Every alert ties to an explicit SLI and a runbook. Concretely, our stack looked like this. So we had Azure Databricks for spar, for compute paired with Delta Lake for acid se semantics, observe schema in evolution and time travel and Azure data factory orchestrated dependencies, retries, and monitoring with jitter back offs. Amazon S3 provided durable versioning storage, where we needed it for landing curated zones. Around those. We added reliability layer this where like the circuit breakers, bulk heads, and the boundary isolation. So a cloud specific issue is. On top a performance layer which had this z order and partitioning aligned to access, small file, compaction, dynamic allocation. And finally, an operational layer run where we had the run books, automated remediation and blameless reviews. Every handoff writes checkpoints, and the stage can resume without rep reprocessing. And we ran this like a product using SRE. So we tracked SLIs on three things, the freshness from the ingest to the gold availability P 95, query latency for key workloads and end-to-end success rate. We set error budgets that triggered graceful degradation for reliability slip, for example, pausing non-critical enrichments if the seven day rolling P five freshness exceeded budget. So post incident, blameless reviews, drove sim, systematic fixes, and we attacked toil first, like automating the top recurring failure classes. So humans are in babysitting pipeline. Our goal was to keep auto resolution above 70%. To make failure safe, we standardized four pipelines, patents, checkpointing, so every stage can resume from known offsets. Offsets are committed only after item potent rights succeed. Circuit breakers at integration points. So we stop cascades when freshness drifts and or dependency fail repeatedly. For instance trip if error rate is greater than 10% for five minutes, or if the freshness breaches versus for three cons consecutive checks. Exponential back offs with jitters. Retries do not create stampedes and random jitters. Were up to three seconds. And item put in write offs. Replace never corrupt state. All mergers are key and we avoid side effects on retry. In short, retrials only work when your operations are at important. And then performance was the other half. Before tuning we had full scans, big shuffles and spike latencies, and we aligned storage to actually access patterns. Partitioning that prunes for example by event date. And the ordering on high selectivity, keys used together. Think campaign ID region and device time. So we kept metadata lean with optimize and small file compaction. So spark wasn't drowning in tiny files. The result, drastic input, output reduction and complex queries dropping two minutes or even seconds with stable P 95. And K hit rates improved and shuffle, spill decreased meaningfully. Practically, how did we choose C or keys? We mined query logs to find the most frequent multi column filters with high cardinality and low overlap. We validated with A or B runs and representative workloads and revisited keys quarterly because access patterns shift with new features and dashboards. And we also capsize the order frequency to avoid consistent tree clustering. So typically weekly for hot tables, monthly for warm ones. We also tackled cost without sacrificing reliability. We abandoned fixed clusters and moved to workload aware auto-scaling. Inputs included priority tire, incoming volume q, depth SLA, proximity utilization, and hard costs guards like max nodes per tire. That combination kept clusters right sized cut, ideal burn and preserved predictability. The side effect far fewer capacity pages. Engineers focused on data quality and features. When nearing an SLA breach, we temporarily allele burst scaling, then decay back to steady state. Incidents still happen. And our response lifecycle is five steps. So we have detection is it is S-L-A-S-L-I driven, freshness, drift, and success rates not just the CPU classification tag scope, and business criticality and selects the runbook. Containment trips, breakers diverse traffic to last known good slices and shred non-critical loads. Reation kicks off self-healing re qing item potent stages with jitter, retries, compacts, small files if that's the cause. And restart targeted tasks of whole jobs. Recovery resumes from checkpoints and passes a consistency validation gate like the schema, row counts, and spot queries. So before we lift any stale banners. A concrete example. So an AWS S3 cross Region Replication Lag Freshness monitors flagged a timestamp drift beyond budget, and we paused drown stream joints served the last consistent slice with a new. Clear stale data banner and diver diverted non critical workloads. When replication recovered, we validated consistency on B boundary tables burst, scaled, or clear the backlog. And return to steady state because the playbook was predefined and on call only approved steps. Everything else was automated. So the metrics we get about most was MTTR, first time success rate, P 95, query latency. And auto resolution rate and freshness, data quality, error rate, and overall system uptime versus a 99.9% target. Watching these together is what lets you act proactively. For example, a rising DQ error rate pair with stable compute such as sche drift, not capacity, a widening P 95 with flat volume, such as. Partitions queue or small file bloat. So if you are adopting this here's a practical blueprint. So foundation that defined SLIs and error budgets set up cost cloud authentication, and build a unified monitoring fabric and be explicit about your consistency model when is slightly stale, acceptable, and for how long? And then for reliability, put circuit breakers at every integration scene. Configure auto-scaling policies with burst windows, and add checkpoints at all handoffs and gate promotions to gold through validation and schemas row counts and sampling and freshness. For performance tuned shuffles and partitions implements that order from real query patents enable dynamic allocation and targeted occasion only where hit rates justify and schedule small file compaction. With the thresholds, for example, compact, when average file size is less than 1 28 mb. And for operational codify the runbooks, automate the top three remediations, hold blameless reviews with action items and drill cross stream on two failures modes per quarter. So my key takeaways are one, at this scale, SRE is not optional. It's the backbone two, automate recovery to slash mttr and on-call ing. Target more than half of the incidents to be self healed. Three, optimize for your access patterns. So z order and storage, lay out choices, deliver outside spins. Verify with query logs, not intuition. And four cross cloud only works with intentional contracts at the seams. That is how you prevent cascades and keep costs predictable. So thank you for listening. And in q and a I'm happy to unpack how we picked the order keys from query logs and extract freshness, sli thresholds we used and why, and or the conditions that trip or circuit breakers, including the grace periods and histories that prevent flapping. Thank you.
...

Sruthi Erra Hareram

Data Engineer

Sruthi Erra Hareram's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content