Rust-Powered Data Lakes: Building High-Performance Infrastructure for the Modern Lakehouse Era

Video size:

Abstract

See how Rust is crushing JVM performance in data processing! Learn why Apache Arrow, DataFusion & modern lakehouses choose Rust, and discover practical ways to build blazing-fast, memory-safe data infrastructure.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, this Ishi. I am going to talk about how Rust is powering Modern Data Lakes. I'll start with my introduction quickly. I work in data architecture, data platform space. I've been in the industry for around 20 years. Been working with large scale financial services companies to build modern data platforms. Near real time analytical and data based applications. Yeah with that, let's start the conversation. So first we are gonna talk about, we are gonna look at the the landscape of enterprise data architecture, which has actually undergone a fundamental transformation over the last 15 years or so. I have laid out three different phases here. Phase one. Which is on-premise Hadoop era. In around early 2000 where Hadoop was launched it had two primary components such DFS and map produce which dominated big data processing, large scale data processing with JV based technologies primarily to handle massive data sets. It was a big hit, but it had some limitations, to scale. And to meet the performance needs. Some of the limitations were delivered to garbage collection, overheads, and memory management issues. Lately, after a few few years in mid 2010s cloud data warehouses were launched which were primarily targeted to to to all limitations of kadu and to meet the performance. Criteria and solve more usability problems support analytical applications and SQL friendly applications. So Snowflake and Redshift are some of the examples there. And then more recently, Lakehouse is trying to converge the data lakes and and cloud data warehouse to give you the best of both. And that is achieved using open formats. Unified processing and separation of story and compute. And also Rust has emerged as a critical technology for modern lake houses. So we are going to deep dive into these three phases in the rest of the conversation here. So why rest for data infrastructure and Rust has a unique value proposition. Memory safety with without garbage collection no garbage collection process are required during critical large scale data processing, especially those jobs which runs. Long time type system data racists at compiling. And it also allows parallelism for multicore effective utilization, so of multicore compute infrastructure. And it gives you predictable resource usage with deterministic primary management. So let's see how phase one limitations were, what phase one limitations were. So first is Jian based, tasks, right? Hadoop was primarily used to have tasks, lab tasks, and reduced tasks and and Hadoop was basically jian based execution. So each task used to meet its own heap space and the object wave presentations added 12 to 16 bytes. For instance, and as you scale, it used to create many maps and many reduced tasks. They used to they used to exchange data which which needed to serialize de serialize the data right before and after the exchange which added the serialization de serialization cost to move the data movement between JVMs which was very significant in terms of CPU and memory. Neat. And the third is the garbage collection Pauses, because it's a jian based, it used to do unpredictable GC pauses, sometimes few seconds causing a lot of inconsistent processing times and resource utilization. How rust is how Rust has created an alternative to map these frameworks. So Rust has created an alternative. As you can see here this chart shows the performance comparison of the word count which is one of the popular benchmark in the big data large scale. Data processing world. And as you can see, rust has actually rust is actually finished the latest the data fusion, which is based on rust, has finished the the work count on the large skin data in just in eight seconds. So it has. Work on all the records and it's super fast compared to how do spark or even Ruston. So risk-based imp implementations of all the frameworks basically eliminated GC process while providing more predictive resource utilization. And we are gonna see more on this in the follow slide. So phase two was the cloud data warehouse. We talked about it. The primary some of the examples of these are snowflake and and Redshift of the world. And the primary limitation there is to, in order to use this cloud data warehouses historically you needed to load data there. So they used to store data in proprietary formats like Snowflake stores data in in the packs. Which is their hybrid column or storage format, right? Pql and Redshift store data into their proprietary formats which created a vendor log vendor locking for the enterprises. So if you wanna load data into these platforms, it's gonna create a lock-in for you because the, it's gonna have a egress cost to take data out later. And you're gonna use compute to load data and compute to query the data. So that was a big limitation there. And the model of the compute was paper pay. The model for the compute price was paper compute which was incentivized for optimization, but it limited the control over execution if you wanna scale just storage or compute or vice versa. Storage and compute used to be tightly coupled. And additionally there was very limited ML support. Primarily these applications, these platforms were built for OLA use cases not machine learning use cases. If you want to use data for ML use cases, you gotta to take data out and then use it in some other platforms, which has created some silos and duplicate data storage. So with that, let's look at. All frameworks have recently emerged and components which have recently emerged, which are powered by rest. So Apache you might have been found there with which is a column of memory standard. It's a rest based implementation providing zero copy access to column data with predictable memory layout. It enables efficient processing across different long languages. And no civilization costs, right? Data fusion is another project which is followed by rust to to enable performance SQL capabilities. It's it's native and it provides both SQL and data frame APIs. Physical query optimization. And the third one is efficient storage. Lance and Par k Rs. These are also resonating implementations of column or storage formats, delivering performance improvement of three to 10 x. Right? Based on some of the recent benchmarks here and in general, as you can see here, memory usage comparison for nurse base query engines is typically 32 70% less based on some of the recent tests. Then j counterpart counterparts when the, when while processing the same dataset. So that's pretty significant and powerful and game changing in terms of infrastructure and, efficient infrastructure and data processing applications. So let's look at phase three and which is where we are today. The modern lakehouse architecture the Lakehouse paradigm. As I said, it gives you the best of both. Warehouse and and data lakes and which addresses the limitations of both OOP era and the cloud data warehouse era. And it does that by using open storage formats. Some of those are data lake is for hoodie which provides stable formats, which is asset compliant. You can do asset transactions, you can change schema. Schema can evolve over time. And you can also do time travel on the scalable object storage. It basically decouples compute and storage. It it gives you ability to do single processing layer to sql, both to serve both SQL users and machine learning users which eliminates the need for copy of data. The architecture is bonds scalable because you're decoupling your storage and compute layer. So separating them allows independent scaling of resources with the ones. Met management, manage. So what is the risk role in RU'S role in Lakehouse architecture or Lakehouse, infrastructure? So the key components that are powered by RU today are table formats. There are table formats, which are, which allows you to interact with with the deltaic cables or iceberg tables using Delta and iceberg various libraries. It supports without going through Spark. So you can interact with these stable formats without going through spark compute, which is J based compute and native asset transactions support with minimal overhead. Right? There is another there are column of storage format implementations, which are host based, which we saw earlier. Arrow Parque, ris and Lance. Which allows predictive memory usage and serial copy data access. In terms of query processing, there is data fusion, which has emerged as a very promising project recently, and it allows parallel execution with minimal resource footprint. And last is data frame, bilities, polars, and arrow. Data fusion is 10 to hundred x faster based on some of the recent benchmarks that Python has. With less memory. So these are some of the powerful results that, that rust has shown in benchmarking which is gonna radically change how the lakehouse, modern lakehouse architecture and infrastructure are built. And as you can see, some of the panel performance be marks was for us versus traditional technologies, is 10 x more query performance. Using data fusion compared to opposite spark on the same dataset, configuration and hardware configuration. 85% memory reduction required for polars data frame compared to pandas while processing a hundred GB dataset. 30 x data. Faster data loading. If you're loading data in an application from par K, you can load 30 x more data. As by using by using Arrow Rs compared to pyro for terabyte scale data sets. So truly truly for large scale and enterprise scale analytical processing here, and 99.9% reliability because you are not going to get outta memory errors and there is no garbage collection needed. For long running data processing jobs, which is one of the limitations for dealing based executions. So with that, I'm gonna move to next slide. What are some of the practical implementation patterns? When I should use, how should I, if I'm using it, how should I optimize? If I'm not using rest where should I use it? So there are three integration approaches, right? Some one is microservices components. You can identify critical data processing applications in. In your overall architecture and do risk-based implementation, risk-based implementation and services and build the interfaces around it so you can have highly performant and, efficient applications as services, which could be integrated with other components in the architecture. Secondly, the extension libraries can create native extensions for Python on JVM. As a core processing. And if you want, you can even replace your entire system or architecture for data pipelines and data processing with thrust alternative for the maximum benefit. Some of the common use cases where rust really shines is high throughput data emission pipelines, which is processing millions or. Millions of records or events per second. Last key data processing of, feature engineering for ML training right to 10 your models and to identify the feature, which needs a lot of data processing. Of course, interactive SQL query engines requiring sub response time is another use case where we, rust based application would really shine. So most organizations dig in with targeted replacement. Of bottleneck components which are already identified as a bottleneck. And then they start rewriting or replacing those components or services with rust based applications. Then do doing a completely rewrite, and then some at some point you, you might want to replace the entire application depends on what you're doing in your data processing and how how much bottlenecks you have. So with that, and we are gonna talk about some of the key takeaways and the next steps. Rust addresses critical performance limitations. Just to recap, a does not need garbage collection. Predictable performance and efficient resource utilization makes rust ideal for data infrastructure. The ecosystem, rust ecosystem is maturing rapidly with arrow data, fusion delta leak, and colors. Libraries libraries and trust based implementation. There are many. The whole whole landscape is production ready and evolving rapidly for high performing data platforms for organizations. I recommend you begin with isolated components analysis and see which interfaces can be replaced or enhanced or optimized using rust. Focus on the bottlenecks first and then scale to the other components as well. So you can really use risk performance characteristics for building modern lake data platforms and future proof of your architecture. So with that, I would say I recommend you evaluate your current data platform architecture and identify components that would benefit most from gross performance and reliability advantages. And that's it from me. I I I hope this was helpful and this information is helpful for you and your your data platform modernization initiatives. Thank you for giving me opportunity and thank you for listening to my presentation. Bye.

Slides

Download slides (PDF)

See all 27 talks at this event!

Conf42 Rustlang 2025 - Online

Content unlocked! Welcome to the community!

Rust-Powered Data Lakes: Building High-Performance Infrastructure for the Modern Lakehouse Era

Video size:

Abstract

Summary

Transcript

Slides

Rahul Joshi

Distinguished Data Engineer | Director @ Capital One

Join the community!

Featured event

2025

2024

Info

Conf42 Rustlang 2025 - Online

Content unlocked! Welcome to the community!

Rust-Powered Data Lakes: Building High-Performance Infrastructure for the Modern Lakehouse Era

Video size:

Abstract

Summary

Transcript

Slides

Rahul Joshi

Distinguished Data Engineer | Director @ Capital One

Join the community!