Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
I'm Rich Solinki.
I'm a manager, data Management in Ally Bank.
Thank you for joining me today for this presentation.
I will be talking about how we can transform enterprise data management
using Lakehouse architecture.
A model that combines the best of both data warehouses and data lakes.
Over the next few minutes, I will walk you through why this approach
matters, what core principle make it work, and how it is delivering
the real business impact at scale.
Data explosion, A growing challenge.
As we all know, data is growing at an incredible pace.
By 2025, global data volumes are expected to reach 175 terabytes, and almost
one third of that will require real time processing For enterprises, this
means we are no longer just dealing.
With relational data in clean rows and columns, we now handle structured,
semi-structured, and unstructured data coming from countless systems.
These fragmented environments create serious challenges in
governance, security, and operational efficiency, making it harder to
deliver timely and trusted insights.
Four years, organizations have relied on two main approaches, data
warehouses and data lake, and both have their strengths and weaknesses.
D one on one hand, data warehouse offers reliability,
consistency, and strong governance.
But they are expensive and rigid.
They struggle with scalability and flexibility.
Data lakes, on the other hand, are cheap and highly scalable, but they
often turn into a data swamps with little governance, unclear ownership,
and no transactional consistency.
So enterprises have been forced to choose between structure and flexibility,
and that's where the Lakehouse concept really changes the game.
Lake house architecture.
Bridges the gap.
It combines the reliability and governance of data warehouses with the scalability
and cost efficiency of data lakes.
With Lakehouse, we can handle diverse workloads from batch analytics to
realtime streaming and even machine learning on a single unified platform.
This approach eliminates a trade off.
That used to exist.
It allows us to build one govern centrally and serve many
different use cases efficiently.
So what are the core capabilities of like house architecture?
There are four major pillars that defines the Lakehouse model.
First asset transactions ensure that we maintain full data integrity even
as we scale to billion of records.
Second schema enforcement gives us flexibility.
We can support both schema on read and schema on right models
adopting to evolving business data.
Third is real time Analytics allows teams to query batch and
streaming data instantly without building complex ETL pipelines.
And finally, unified Governance provides a single place for metadata line
tracking and compliance, something that is critical for enterprise adoption.
Multi zone structure model.
To manage data systematically, we structure the lakehouse into three
zones, raw, refined, and curated.
The raw zone stores data exactly as it arrives, preserves the original
format for auditability purpose.
The refined is where cleansing, validation, and the
standardization of data happens.
Business rules are applied here and the curated zone contains high
quality analytics ready data sets for reporting and machine learning.
So basically we are preserving the data as it is coming in the row.
And then any massaging transformation, any business rule application will
be taken care as part of refined and users can use for business analytics.
Curated zone.
This layered design ensures traceability and promotes data
trust across the enterprise.
So within the lake house, we often use data vault modeling because it
provides both flexibility and historical traceability with both of which are
very important thing in modern world.
It is designed for change so we can involve a schema without breaking
downstream systems In this model.
Hub captures the core business entities and their unique identifiers.
Link represent the relationship between those entities and satellite
store descriptive attributes and full historical context.
So this structure allows us to track how data changes over time, which is
key for governance and audit compliance.
Scalability means nothing if performance does not keep up.
So optimization is crucial.
We use column file formats like PERQUE and ORC, which compress efficiently
and read only the necessary columns.
During a query, we implement intelligent caching.
For frequently access databases, cutting down response time for repeated queries.
And through dynamic partitioning, we organize data by time, geography,
or other dimension, allowing faster pruning and parallel processing.
Together these techniques give performance high while maintaining cost efficiency.
Traditional architecture depend on multiple ETL pipelines.
Data is extracted, transform, and loaded across several systems.
This creates latency, complexities and multiple points
of failure in a lake house.
We adopt an ELT approach Instead, extract and load first, then
transform directly in place.
This means fewer copies, less maintenance, and faster time to insight.
So by reducing redundant data movement, teams can focus more on analysis
and less on fixing the pipelines.
Let's talk about automated governance practices as we scale
governance can't be manual.
It has to be automated.
We use data quality automation to validate incoming data and flag
anomalies before they affect analytics.
We maintain completely lineage and audit trails to track how data moves
and transform across the system.
Which simplifies compliance and troubleshooting.
And with role based access control and encryption, we ensure that sensitive
data remains protected, which is still accessible to authorized users.
The proactive governance makes the system both secure and self-healing.
When organization implement the Lakehouse model, the results are significant.
Maintenance effort dropped by 60% because fewer pipelines and tools are needed.
ETL workflows run around 45%.
Faster thanks to in-place.
Transformation and overall cost of ownership goes down by
roughly 40% beyond the numbers.
The biggest gain is agility.
Teams can deliver insights faster and respond to business
needs in near real time fashion.
So how does this all come together in a real world environment?
The architecture usually has four key layers.
First data ingestion layer that brings in both batch and streaming
data from different sources.
Second is storage and compute are decoupled, allowing each to scale
independently based on workload.
Governance and catalog layer maintains metadata access policies and data lineage.
And finally, the analytics and conjunction layer.
It provides flexibility for users from BI tools to data science notebooks.
This end-to-end design gives both control and freedom to data teams, and
we can use data as a strategic asset.
So in this slide we'll talk about the implementation considerations.
So for those who are starting their Lakehouse journey,
a few practical lessons.
First, choose platforms that support OpenTable formats like data
Lake Apache Iceberg, or Hoodie.
This prevents vendor lock-in.
And future proofed our design.
Second adopter phase migration strategy begin with known critical workload
to build expertise and confidence.
And third, invest in team training.
Success requires engineers and analysts to understand both
warehouse and like principles.
It is much about people and process as it is about technology.
So to wrap up, the lake house architecture eliminates the longstanding trade off
between data warehouses and data lake.
It provides the governance and reliability enterprises need, which offering the
flexibility and scalability required for modern analytics by simplifying
pipelines through multi-zone.
Storage and ELT workflows.
We achieve faster insights and reduce operational costs.
So ultimately it is about delivering the measurable business
value, better performance, lower cost, and trusted data at scale.
Thank you for listening to this presentation.
Please reach out to me if you have any questions.
Thank you very much.