Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this Ishi.
I am going to talk about how Rust is powering Modern Data Lakes.
I'll start with my introduction quickly.
I work in data architecture, data platform space.
I've been in the industry for around 20 years.
Been working with large scale financial services companies
to build modern data platforms.
Near real time analytical and data based applications.
Yeah with that, let's start the conversation.
So first we are gonna talk about, we are gonna look at the the landscape
of enterprise data architecture, which has actually undergone
a fundamental transformation over the last 15 years or so.
I have laid out three different phases here.
Phase one.
Which is on-premise Hadoop era.
In around early 2000 where Hadoop was launched it had two primary components
such DFS and map produce which dominated big data processing, large scale data
processing with JV based technologies primarily to handle massive data sets.
It was a big hit, but it had some limitations, to scale.
And to meet the performance needs.
Some of the limitations were delivered to garbage collection, overheads,
and memory management issues.
Lately, after a few few years in mid 2010s cloud data warehouses
were launched which were primarily targeted to to to all limitations
of kadu and to meet the performance.
Criteria and solve more usability problems support analytical applications
and SQL friendly applications.
So Snowflake and Redshift are some of the examples there.
And then more recently, Lakehouse is trying to converge the data
lakes and and cloud data warehouse to give you the best of both.
And that is achieved using open formats.
Unified processing and separation of story and compute.
And also Rust has emerged as a critical technology for modern lake houses.
So we are going to deep dive into these three phases in the
rest of the conversation here.
So why rest for data infrastructure and Rust has a unique value proposition.
Memory safety with without garbage collection no garbage collection
process are required during critical large scale data processing,
especially those jobs which runs.
Long time type system data racists at compiling.
And it also allows parallelism for multicore effective utilization, so
of multicore compute infrastructure.
And it gives you predictable resource usage with
deterministic primary management.
So let's see how phase one limitations were, what phase one limitations were.
So first is Jian based, tasks, right?
Hadoop was primarily used to have tasks, lab tasks, and reduced tasks and and
Hadoop was basically jian based execution.
So each task used to meet its own heap space and the object wave
presentations added 12 to 16 bytes.
For instance, and as you scale, it used to create many maps and many reduced tasks.
They used to they used to exchange data which which needed to serialize de
serialize the data right before and after the exchange which added the serialization
de serialization cost to move the data movement between JVMs which was very
significant in terms of CPU and memory.
Neat.
And the third is the garbage collection Pauses, because it's a jian based, it used
to do unpredictable GC pauses, sometimes few seconds causing a lot of inconsistent
processing times and resource utilization.
How rust is how Rust has created an alternative to map these frameworks.
So Rust has created an alternative.
As you can see here this chart shows the performance comparison of the
word count which is one of the popular benchmark in the big data large scale.
Data processing world.
And as you can see, rust has actually rust is actually finished the latest
the data fusion, which is based on rust, has finished the the work count on the
large skin data in just in eight seconds.
So it has.
Work on all the records and it's super fast compared to
how do spark or even Ruston.
So risk-based imp implementations of all the frameworks basically
eliminated GC process while providing more predictive resource utilization.
And we are gonna see more on this in the follow slide.
So phase two was the cloud data warehouse.
We talked about it.
The primary some of the examples of these are snowflake and
and Redshift of the world.
And the primary limitation there is to, in order to use this cloud
data warehouses historically you needed to load data there.
So they used to store data in proprietary formats like Snowflake
stores data in in the packs.
Which is their hybrid column or storage format, right?
Pql and Redshift store data into their proprietary formats which created a vendor
log vendor locking for the enterprises.
So if you wanna load data into these platforms, it's gonna create a lock-in
for you because the, it's gonna have a egress cost to take data out later.
And you're gonna use compute to load data and compute to query the data.
So that was a big limitation there.
And the model of the compute was paper pay.
The model for the compute price was paper compute which was incentivized
for optimization, but it limited the control over execution if you wanna scale
just storage or compute or vice versa.
Storage and compute used to be tightly coupled.
And additionally there was very limited ML support.
Primarily these applications, these platforms were built for OLA use
cases not machine learning use cases.
If you want to use data for ML use cases, you gotta to take data out and then use it
in some other platforms, which has created some silos and duplicate data storage.
So with that, let's look at.
All frameworks have recently emerged and components which have recently
emerged, which are powered by rest.
So Apache you might have been found there with which is a column of memory standard.
It's a rest based implementation providing zero copy access to column
data with predictable memory layout.
It enables efficient processing across different long languages.
And no civilization costs, right?
Data fusion is another project which is followed by rust to to
enable performance SQL capabilities.
It's it's native and it provides both SQL and data frame APIs.
Physical query optimization.
And the third one is efficient storage.
Lance and Par k Rs.
These are also resonating implementations of column or storage
formats, delivering performance improvement of three to 10 x. Right?
Based on some of the recent benchmarks here and in general, as you can see
here, memory usage comparison for nurse base query engines is typically 32 70%
less based on some of the recent tests.
Then j counterpart counterparts when the, when while processing the same dataset.
So that's pretty significant and powerful and game changing
in terms of infrastructure and, efficient infrastructure and
data processing applications.
So let's look at phase three and which is where we are today.
The modern lakehouse architecture the Lakehouse paradigm.
As I said, it gives you the best of both.
Warehouse and and data lakes and which addresses the limitations of both OOP
era and the cloud data warehouse era.
And it does that by using open storage formats.
Some of those are data lake is for hoodie which provides stable
formats, which is asset compliant.
You can do asset transactions, you can change schema.
Schema can evolve over time.
And you can also do time travel on the scalable object storage.
It basically decouples compute and storage.
It it gives you ability to do single processing layer to sql,
both to serve both SQL users and machine learning users which
eliminates the need for copy of data.
The architecture is bonds scalable because you're decoupling your
storage and compute layer.
So separating them allows independent scaling of resources with the ones.
Met management, manage.
So what is the risk role in RU'S role in Lakehouse architecture
or Lakehouse, infrastructure?
So the key components that are powered by RU today are table formats.
There are table formats, which are, which allows you to interact with with the
deltaic cables or iceberg tables using Delta and iceberg various libraries.
It supports without going through Spark.
So you can interact with these stable formats without going through spark
compute, which is J based compute and native asset transactions
support with minimal overhead.
Right?
There is another there are column of storage format implementations, which
are host based, which we saw earlier.
Arrow Parque, ris and Lance.
Which allows predictive memory usage and serial copy data access.
In terms of query processing, there is data fusion, which has emerged as
a very promising project recently, and it allows parallel execution
with minimal resource footprint.
And last is data frame, bilities, polars, and arrow.
Data fusion is 10 to hundred x faster based on some of the
recent benchmarks that Python has.
With less memory.
So these are some of the powerful results that, that rust has shown in
benchmarking which is gonna radically change how the lakehouse, modern lakehouse
architecture and infrastructure are built.
And as you can see, some of the panel performance be marks was for
us versus traditional technologies, is 10 x more query performance.
Using data fusion compared to opposite spark on the same dataset,
configuration and hardware configuration.
85% memory reduction required for polars data frame compared to pandas
while processing a hundred GB dataset.
30 x data.
Faster data loading.
If you're loading data in an application from par K, you can load 30 x more data.
As by using by using Arrow Rs compared to pyro for terabyte scale data sets.
So truly truly for large scale and enterprise scale analytical processing
here, and 99.9% reliability because you are not going to get outta memory errors
and there is no garbage collection needed.
For long running data processing jobs, which is one of the limitations
for dealing based executions.
So with that, I'm gonna move to next slide.
What are some of the practical implementation patterns?
When I should use, how should I, if I'm using it, how should I optimize?
If I'm not using rest where should I use it?
So there are three integration approaches, right?
Some one is microservices components.
You can identify critical data processing applications in.
In your overall architecture and do risk-based implementation, risk-based
implementation and services and build the interfaces around it so
you can have highly performant and, efficient applications as services,
which could be integrated with other components in the architecture.
Secondly, the extension libraries can create native
extensions for Python on JVM.
As a core processing.
And if you want, you can even replace your entire system or architecture for data
pipelines and data processing with thrust alternative for the maximum benefit.
Some of the common use cases where rust really shines is high
throughput data emission pipelines, which is processing millions or.
Millions of records or events per second.
Last key data processing of, feature engineering for ML training right to 10
your models and to identify the feature, which needs a lot of data processing.
Of course, interactive SQL query engines requiring sub response time
is another use case where we, rust based application would really shine.
So most organizations dig in with targeted replacement.
Of bottleneck components which are already identified as a bottleneck.
And then they start rewriting or replacing those components or
services with rust based applications.
Then do doing a completely rewrite, and then some at some point you,
you might want to replace the entire application depends on what you're
doing in your data processing and how how much bottlenecks you have.
So with that, and we are gonna talk about some of the key
takeaways and the next steps.
Rust addresses critical performance limitations.
Just to recap, a does not need garbage collection.
Predictable performance and efficient resource utilization makes rust
ideal for data infrastructure.
The ecosystem, rust ecosystem is maturing rapidly with arrow data,
fusion delta leak, and colors.
Libraries libraries and trust based implementation.
There are many.
The whole whole landscape is production ready and evolving
rapidly for high performing data platforms for organizations.
I recommend you begin with isolated components analysis and see which
interfaces can be replaced or enhanced or optimized using rust.
Focus on the bottlenecks first and then scale to the other components as well.
So you can really use risk performance characteristics for building
modern lake data platforms and future proof of your architecture.
So with that, I would say I recommend you evaluate your current data platform
architecture and identify components that would benefit most from gross
performance and reliability advantages.
And that's it from me.
I I I hope this was helpful and this information is helpful for
you and your your data platform modernization initiatives.
Thank you for giving me opportunity and thank you for
listening to my presentation.
Bye.