Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm am Kris, a technical architect and lead data engineer at for
with over 16 years of experience.
My research paper is titled De-Mystifying Modern Data Pipeline Architecture from
regional ETL to Cloud Native framing.
Think back to a time when you had to wait all night for business
report to process that was of ETL, but today the digital factory never
sleeps and our data can either.
My career, I've been on the frontline of this dramatic evolution, and today
I'll share how we are moving from batch oriented mindset to the world
of cloud native framing and what that means for how we turn data into value.
Here is a look at what will cover today.
We'll start with evolution of data pipeline architectures.
Discussing the limitations of legacy systems.
Then we'll dive into modern art pattern that emerged.
We look at the two evolution landscape, examine critical design integration
for building resilient pipelines, and finally look at emerging brands and
practical migration s for organiz.
The key takeaway of modern data engineering revolution is the shift
from batch to realtime processing.
The recent approach relied on schedule batch processing
using centralized systems.
Now, modern reality involve distributed realtime and cloud native architectures.
Two main factors, drivers is the business driver.
Has an intense need for immediate insight and operational intelligence.
The technical driver is a demand for greater scalability, better
cost efficiency, and overall flexibility in our data systems.
Traditional EDL limitations, why change was inevitable.
There is an EL framework facing significant constraints first batch
crossing in windows, typically third overnight to avoid disruption operation
systems, which severely limited data availability when businesses started
demanding real-time monitoring.
Second, the design meant single point of failure with limited recovery.
Option that often require completely rep processing the data.
Third, rigid infrastructure meant performance improvement depending
on expensive hardware upgrades involving high upfront cost and
complex cap capacity planning.
Fourth systems were built primarily for structure data and second to handle the
explosion of semi-structure logs and streaming data for new sources like.
Device, devices and web application.
Finally, they also suffer from vendor login.
New to proprietary query extensions,
the cloud storage revolution, decoupling storage from compute,
the introduction of cloud storage, specifically object storage like Azure
block storage or Google Cloud storage.
What.
Fundamental to enabling modern data architecture.
The most profound change was the opening of mechanism
between storage and computing.
Before the loud revolution, storage and computing were tightly funnel,
inexpensive switch capacity warehouses.
If you needed more crossing power, you usually had to buy
more storage and vice versa.
After the shift, we gained access to practically unlimited.
Extremely cost effective object storage, which can scale independently
of our processing cluster.
This adventure Freedom delivers several key benefits.
First, the pay as you go.
Principal model shift cost from large capital expenditures to variable
operation costs, making data scale accessible to every organization.
Second.
Enable schema on read flexibility.
We can now store raw data in its native format, applying
transformation later, preserving native integrity rather than forcing
it into rigid redefined ular front.
Finally, cloud storage offers native redundancy, durability, and it
inherent needs support all formats such as semi-structured unstructured.
Which was impossible.
With regional databases,
modern artificial patterns, overweight as data systems make mature to
handle modern cha challenges, several key artificial patterns emerge.
These patterns represent different philosophies for how we secure
and process data on a scale.
Explore the five most influential.
First, the medallion architecture.
The pattern defines transparent, quality layers, bronze, silver,
and gold for data refinement and curing, governance and quality.
All are built into the pipeline.
The Lambda architecture.
This is a dual path approach using a batch layer for complete historical accuracy
and free layer for real diamond sight.
Copper architecture cation of Lambda.
This pattern adopts a steam first model using a single processing engine and
an even log as a source of truth for both realtime and historical views.
The lake goes paradigm.
This unified approach brings the transactional feature and performance
of data warehouse directly onto the low cost storage of data link.
Finally, data match.
This is a less AL architecture and more organizational structure,
decentralizing data ownership and reading data product owned by Nomen teams.
Now I would like to discuss about the medallion architecture,
the medal architecture model of Olympic medals organizers, data
requirement into discrete all levels.
Bronze laser layer is where raw data lands into lands without motivation, ensuring
complete source fidelity, its validation focuses usually on completeness.
A single layer.
Is that what the foundation here?
Raw data is informed by standardizing format, fixing inconsistencies, and
applying quality and governance rules.
The goal layer, is a data shape for business assumption.
It includes purpose build structures like star S schemas,
and pre aggregated metrics.
The main benefit is that it is clear quality boundaries and
ensures productive processing.
This makes it ideal for organizations with strong governance requirement.
Now, I would like to discuss about the comparison between the
Lambda and the KA architecture.
Lambda and KA architecture represent different philosophies for handling both.
Historical and realtime data.
The Lambda architecture uses batch and stream processing path.
The batch path handles comprehensive historical analysis while the
speed path provides immediate.
If preliminary insights, its main downside are higher complexity and
the necessity of dual code based maintenance for transforming logic.
Architecture simplifies this by adopting a stream processing first approach.
It uses a single app and only event loss as the definitive system of record
batch processing is re ized as just naming with an unlimited time window.
This result in simpler maintenance and unified processing model.
Now I like, I would like to discuss about the lake cost.
The lake of patterning represents a powerful conversion.
It addresses the historical fragmentation of having separate data lakes for
raw storage and aware of analytics.
It augments lake flexibility, low cost storage with data warehouse performance.
S features include implementing warehouse capabilities like asset
transactions and schema enforcement directly on cloud storage objects.
It supports multi workload support for bi ml and streaming
and provides unified governance.
The business impact is is the elimination of data lubrication, fragmented
governance and integration headaches.
Now I would like to discuss about the data match domain oriented approach.
The data match pattern is less about technology, more about organization shift.
Its core principle is to read data as a product owned by domain team.
This is designed to solve the scaling bottleneck present with teams key.
Domain oriented ownership, self serve data platform, federating
governance and data products with clear interfaces and quality assurances.
The primary benefit is or, and scalability and better alignment
with specific business domain.
The primary challenge, however, is that requires significant organization change.
It's a social technical transformation.
I would like to discuss about the Tool Evolution timeline.
Let's look at the tool evolution Pipeline.
We can segment this journey into four areas in nineties and two oh, and
the focus was traditional ET tools like IBM Data stage and Informatica
visual interfaces and mon monolithic batch oriented design characteristics.
From 2010 to 2015, distributed processing framework like Hadoop and in
horizonal scaling and old first approach between two, 2015 and 2013, we saw
the rise of open source orchestration tools like Airflow and Perfect.
Alongside early cloud services, these enabled processing and serverless
executions from 2020 onwards, the landscape was stream first, unified
processing, and ML integration.
We now see real time capabilities, integrative interfaces, and specialize
system for virtual learning.
Modern tool categories.
The current landscape has four main categories of tools, orchestration
framework like Apache, airflow, and perfect focus on scheduling, depending
scheduling dependencies, and monitoring the apple from actual data process.
Cloud native services such as AWS Glue and Azure Data Factory offer serverless
execution and consumption based pricing, fitting well with variable workloads,
saving platforms like Apache Kafka and Spark streaming, or essential
for low latency realtime inside.
And finally crossing engines like Apache Bar Link provide the power of complex.
Distributed computations US selection criteria was choosing a tool must
extend beyond features to include team skills, operational requirements,
cost model, and integration needs
critical design consideration.
Moving to modern distributed system introduces a new set of
challenges and operational necess.
We must address five critical design areas, data governance, and lineage.
Lack racking, data provenance across complex environments, quality
validations, shifting from periodic checks to continuous assurance and
performance optimization, leveraging distributed techniques like
partitioning, security and compliance.
Implementing data centric production and access control in indonesia's
challenges, managing the coexistence of real time and bad workloads,
data governance and distributed systems in distributed hybrid
and multi-cloud environments.
Maintaining visibility is a significant c. Governance had evolved from
checkbox to an operational activity.
Our solution must include automated and linear tracking at multiple levels
from dataset to column and even record levels of provenance for environment
with in complete instrumentation, emergent techno techniques include.
Lineage that uses statistical methods to infer transformation relationships.
The business value is clear, rapid impact analysis during values, simplified
troubleshooting of inconsistencies and necessary audit rating,
failure, audit rail for compliance,
quality validation framework.
Data quality assurance must be must now be an continuous and automated
now, not just periodic and manual.
Modern framework implement multidimensional validation,
syn and correctness.
Semantic validating contextual appro appropriateness.
Implementation architecture has evolved to distributed opponents that exhibit
checks at each transformation boundary.
Streaming and this is achieved to specialized operations that implement
validation logic directly within the execution model related in an
pattern in continuous data flow.
Emerging ran serverless data processing.
Looking forward.
Serverless data processing represents profound paradigm shift.
Key characteristics include elimination of expressing infrastructure, provisioning
di, dynamic resource allocation, and a shift to consumption based pricing.
This has a significant design impact, encouraging architecture, nature,
some of smaller focus crossing unit with easier boundaries rather than
monolithic transformation jobs.
The benefits are powerful, automatic scaling, significant cost optimization,
and operational simplicity.
AI ML integration data pipeline meet machine learning.
The integration of AI and ML has driven specialized natural innovation.
Features stores have emerged as digital component.
These are specialized systems that manage lifecycle of machine learning
features, providing versioning access control, and ensuring consistency between
training and serving modern service pipeline, integrated time inference,
delivering the core data pipelines.
A key requirement is point in time feature accuracy for valid modern
learning and lineage lacking for modern.
The business impact is faster model deployment, consistent feature engineering
and unified infrastructure supporting both traditional BI and advanced ML workloads,
data contracts and EMA management.
Formal agreement for data exchange and data and data ecosystem.
Centralized data contracts and formalized EMA management are
critical for maintaining sim.
The purpose of these contracts is to establish explicit agreement
between data producers and consumers.
Components specify data structure, quality characteristics, and delivery patterns.
The main benefit is enhanced ability in distributor systems
by setting clear expectations.
Implemented via version schema registries, and with sophisticated
compatibility, checking that validate proposed change against historical usage,
prevent breaking downstream consumers
migration strategies, practical approaches to modernization.
Stable approach, risk, timeline, and e factors.
Finally, migrating from legacy architecture is a significant
under undertaking and a whole cell replacement is really feasible.
We must adopt in incremental strategies, enable outline a few proven approaches,
patent based modernization, identify more common crossing patterns.
System and analyze usable modernization approach for each hybrid execution
uses gateway component to maintain and sustain interface while
underlying implementation transition.
Gradually.
Specialize connectors allows modern processing framework to
efficiently integrate with existing data sources through adapter,
patent, and format bridges, and.
Domain by domain starts with non-critical businesses domain and
expand progressively this low risk and low risk phase approaches.
Minimize business disruptions, sorry, and allow organizations to
demonstrate value incrementally.
Key essential insight for data leaders.
To summarize our journey, here are the key.
Here are essential insights for data leaders.
There's no single architecture that future lies in thoughtfully.
System that aligns with specific business context, adopt incremental migration to
modernize and minimize risk gradually.
Governance is critical line, lacking in quality management
or operational necessities.
In distributed enrollments, real time is standard.
Naming is a becoming basic business responsive.
Finally remember that organization change is essential as a technology
recommendations action items for organizations.
For those looking to embark on this modern modernization journey, I offer
this actions action items as this current state inventory or existing architecture
and identify the key pain point.
Event target state, use artificial patterns like medallion or Lake Earth that
align with your strategy business needs.
Start small.
Begin your migration with noncritical domains or workload activities.
Invest in governance, implement lineage tracking and quality
framework early in the process.
Build skills.
Develop the necessary cloud native and streaming processing
bilities within your teams.
And with this I conclude thank you.
This transformation is ongoing and by embracing cloud native principles
and thoughtful design or nations can build the resilient, adaptable
data pipeline necessary for future of operational intelligence.
I note and thank you for giving me this opportunity.
Thank you.