Conf42 Cloud Native 2023 - Online

Build cloud native open format data lakehouse

Video size:

Abstract

As data lakes are growing and getting notoriously messy, companies struggle to change data in order to implement GDPR,CCPA data regulations as to how customer data can be used. So this session covers journey to data lake house architecture using open source table format and cloud native services

Summary

  • Businesses are struggling to capture, store and analyze all the data generated by today's modern digital business. Traditional on premise data analytics approaches do not scale well and are too expensive to handle these volumes of data. Data grow exponentially. They need to be securedly accessed and analyzed by any number of applications and people.
  • A data lake makes it easier to derive insights from all your data. Customers need highly scalable, highly available, secure and flexible data store. What you are seeing is a high level architecture of data lake in the cloud as a low cost, durable and scalable storage.
  • There are some companies that you will most likely need to create and maintain for your data lake. Data need to be collected and must be scalable storage. Most importantly, you need a framework for managing analytics and data. Without governance, finding good solution is impossible.
  • Data lakes have become default repository for all kinds of data. Is your data lake getting unmanageable? Do you want to build a highly scalable, cost effective data lakes with transactional capabilities? There are two options for creating lakes House on AWS, which I will talk about in next few slides.
  • A typical organization will need both a data warehouse and data lake. Data lakes store both structured and unstructured data from various other data sources. The second option is do it yourself option for creating a larger data lakes house.
  • Apache Hoodie follows timeline based transaction model. Apache Iceberg follows snapshot based transaction models. Delta Lake employs optimistic concurrency control. You could also do time travel using according to snapshot id and timestamp. These table formats are integrated with AWS analytics services.
  • Apache hoodie is considered suitable for streaming use cases whether it's IoT data or change data capture from database. Third option is Delta Lake which is suitable if your data platform is built around spark framework. Here are final thoughts on choosing data Lake table format for building lake House architecture on AWS.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Satish Mane. I will talk about data lake table formats and their integration with AWS Analytics Services to build cloud native data lake house on AWS Cloud. Hi, my name is Rajeev Jaiiswal. In this session I'll take you on a journey to Data Lake. Thanks for joining in. Before we dive further, let's first understand couple of trends we see in businesses. The first one is expectation of the customers. In the digital era, customers expect the kind of experience they get from Airbnb, Uber and all these technologies. Apart from these experiences, the BDI is personalized, demonstrating a true understanding of a customer and their context. People's expectations vary from industry to industry as we offer contextually. The second trend is new data volume. Data is growing at an unprecedented rate, exploiting from terabyte to petabyte and sometimes exabytes. Traditional on premise data analytics approaches do not scale well and are too expensive to handle these volumes of data. We often hear from businesses that they are trying to extract more value from their data, but are struggling to capture, store and analyze all the data generated by today's modern digital business. Data grow exponentially. They come from new sources, are becoming more diverse, and need to be securedly accessed and analyzed by any number of applications and people. All this brings us to the subject of technology. Before diving into the broader analytics architecture, let's first understand how legacy or traditional on premise data analytics stacks up. There is typically an operating system and database for storing customer records and transactions, followed by a reporting database for data Mart and data lakehouse. Type use cases there are four main problems with the type of architecture. First, the analytics implementation cycle is too long, as moving data sets and building dashboard can take weeks or even months. The second issue is scalability, higher cost because you always have to plan ahead to buy more hardware and pay for more licenses. Third, this architecture is not suitable for modern analytics use cases such as machine learning adapt queries for data sciences use cases. Finally, organizations struggle to keep up with the pace of changing business needs. Now, how you can solve all these problems? The answer is data lake. A data lake makes it easier to derive insights from all your data by providing a single place to access structured data, semi structured data and unstructured data. Customers need highly scalable, highly available, secure and flexible data store that can handle very large data sets at a reasonable cost. Therefore, three key points are important for data lakes data in its original form and format, no matter how much, what kind, and how fast it is generated. Structures and processing rules should be defined only when necessary. Also, known as reading schema. As data is used by large community, we need to democratize data. What you are seeing is a high level architecture of data lake in the cloud as a low cost, durable and scalable storage. Amazon S three provides storage layer that is completely decoupled from data processing and various big data tools and has a zero operation over it. Customer can choose a data lake file format such as Apache, Parquet, spark powered AWS services such as AWS, Glow, Amazon, EMR and Athena enables access and compute at scale. The meta level stores metadata about tables, columns and partitions in the AWS glue catalog. To keep the data in its original form and format, you need the ability to handle various file formats such as JSON, CSV, Parquet, Avro and more. Each format is suitable for different use cases. For example, CSV is popular for its low volume and human readable format. CSV is what we call the line storage parquet file. Organize data into column column store files are more optimized because you can perform better compression on each column. Parquet is well suited for bulk processing of complex data. Now you understand the data lake component. As such, if you're creating your own data lake, there are some companies that you will most likely need to create and maintain for your data lake. Now data need to be collected and must be scalable storage. Without ETL transformation, all data must be cataloged because without a catalog you cannot manage data, find data and organize access control. All kind of analytics are needed including patch analytics, stream analytics and advanced analytics like machine learning. Therefore endtoend data ingestion and analysis. Processes need to be coordinated. Data should be available to all kind of people, users and roles. Most importantly, you need a framework for managing analytics and data. Without governance, finding good solution is impossible. Data analysts can query data lakes directly using fast compute engines such as redshift and their preferred language SQL. Data scientists then have all the data they need to build robust models. Data engineers can also easily simplify data and focus on infrastructure. So let's understand benefit of serverless data lakes services is a native architecture of the cloud, allowing you to offload more operational responsibilities to AWS. Increase agility and innovation by allowing you to focus on writing the business logic and services. Your customer services technology offers automatic scaling, built in, high availability, and a consumption based billing model for cost optimization. Serverless allow you to build and run applications and services without worrying about infrastructure. Eliminate infrastructure management tasks such as server or cluster provisioning, patching, operating system maintenance and capacity provisioning. AWS offers many other serverless services which I won't cover here such as DynamoDB and Redshift, serverless etc. Now I'm going to hand over the call to Satish who is going to deep dive Lake house architecture. Thank you Rajiv. Now that you understand regular data lake, let me explain the building blocks of lake House architecture data lakes have become default repository for all kinds of data. A data lake services as a single source of truth for a large number of users querying from a variety of analytics and machine learning tools. Is your data lake getting unmanageable? Do you want to build a highly scalable, cost effective data lakes with transactional capabilities? Are you struggling to comply with data regulations as to how customer data in data lakes can be used? If you are facing these challenges then this session talks about how lake house architecture solves those challenges. What challenges do typical data lake face? Regular data lakes provide scalable and cost effective storage. However, this is not possible with regular data lakes when continuously ingesting data and using transactional capabilities to query from many analytics tools. At same time, under CCPA and GDPR regulations, businesses must change or delete all of customers'data upon request to comply with customer's right to be forgotten or change of data. Change of consent to the use of data it is difficult to make these kind of record level changes in regular data lakes. Some customers find change data capture pipeline difficult to handle. This is especially true for recent data or erroneous data that needs to be rewritten. A typical data lake would have to reprocess missing or corrupted data due to job failures, which can be a big problem. Regular data lake do not enforce schema when writing so you cannot avoid ingesting low quality data. Also, one should know the partition or table structure to avoid full table scan and listing files from all partitions. So let's see how a open table format can be used to address these challenges mentioned on previous slide one of the key characteristics expected of lake house architecture is transactional or acid properties. You do not have to write any code in the transactional or data lake format which I will cover in next few slides. Transactions are automatically written to the log presenting a single source of truth. Advanced features such as time travel, data transformation with DML, and concurrent read and writes are also expected in data lake to handle use cases such as change data capture and late arriving streaming data over time. You can also expect data lake to have features such AWS, schema evolution and schema enforcement. These features allow you to update your schema over time to ensure data quality during ingestion. Engine neutrality is also expected in future of data architecture. Today you use a compute engine to process data, but tomorrow you can use different engine for new needs for time travel data lake table format versions, the big data that you store in the data lake. You can access any historical versions of the data, simplifying data management with easy to audit rollback data in case of accidental bad writes or deletes and reproduce experiments and reports. Time travel enables reproducible queries by allowing two different versions to be queried at same time. Opentable format work at scale by automatically checkpointing and summarizing large amounts of data, many files and their metadata. So what are your options for creating a data lake house architecture to solve those regular data lake challenges? Customers often face a dilemma when it covers to choosing right data architecture for building data lake house. As such, some customers use data warehouse to eliminate the need of data lakes and the complexity that comes with the data lake. However, a new pattern that is emerging as a popular pattern for implementing data lake house on AWS is to combine both data lake and data warehousing capabilities. This pattern is known as lake House architectural pattern. There are two options for creating lakes House on AWS, which I will talk about in next few slides. So before diving into each data lake table format and the lake house architectural options on AWS, let me quickly compare the building blocks that we discussed on or I discussed on previous slide. So depending on your needs, a typical organization will need both a data warehouse and data lake that serve different needs and use cases. Data lakes store both structured and unstructured data from various other data sources such as mobile apps, IoT devices, and social media. The structure of the data or schema is not defined at the time of data collection. This means you can store all your data without having to plan carefully or know what questions you will need to answer in the future. A data warehouse is a database optimized for analyzing relational data from transactional systems. Data structures and schemas are predefined to optimize fast SQL queries, the result of which are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so that it can serve as a single source of truth that users can rely on. However, once organizations with data warehouse recognize the benefits of data lakes house that provide the functionality of both data lake and data warehouse, they can evolve their data warehouse to include data lake house and enable various query capabilities. So the first lake house architecture option is ready to use platform on AWS. This approach allows for separate data warehouse with transactional capabilities such as Amazon, Redshift, and a cost effective scalable data lake on Amazon s three technologies such as Amazon Redshift spectrum can then be used to integrate strategically distributed data in both data warehouse and data lake. This approach definitely simplifies the engineering effort free developers to focus on feature development and leave the infrastructure to the cloud to harness the power of serverless technology from storage to processing and to presentation layer. In this pattern, data from various data sources is aggregated into Amazon s three before transformation or loading into data warehouse. This pattern is useful if you want to keep the raw data in the data lake and process data in the data warehouse to avoid scaling cost. With this option, you can take advantage of Amazon Redshift's transactional capabilities and also run low latency analytical queries. The second option is do it yourself option for creating a larger data lakes house. Why do it yourself? This pattern is growing in popularity because of three table formats, Apache hoodie, Delta Lake, and Iceberg, that have emerged over the past few years to power data lake house that support acid transactions, time travel, granular access control, and deliver a very good performance compared to regular data lake. These open table data lake formats combines the scalability and cost effectiveness of data lake on Amazon s three and transactional capabilities, reliability and performance of data warehouse to ensure greater scale. Table formats or data lake table formats are instrumental for getting the scalability benefits of data lake and the underlying Amazon s three object store, while at the same time getting the data quality and governance associated with data warehouses. These data lake table format framework also add additional governance compared to regular data lake. Optionally, you can connect Amazon redshift for low latency OLAP access to business ready data. Now I will quickly walk through three popular table formats. First one is Apache hoodie. Apache Hoodie follows timeline based transaction model. A timeline contains all actions performed on the table at different instance of time. The time could provide instantaneous views of table and support to get data in the order of arrival. Apache Hoodie offers both multi version concurrency control and optimistic concurrency control. Using multi version concurrency control, Hoodie provides snapshot isolation between an ingestion writer and multiple concurrent readers. It also apply optimistic concurrency control for a reader and writer. Hoodie supports file level optimistic concurrency control, that is, for any two commits or writers happening to the same table. If they do not have rights to overlapping files being changed, both writers are allowed to succeed. The next one is time travel. You could also do a time travel. According to hoodie, commit time hoodie supports schema evolution to add, delete, modify, and move columns, but it does not support partition evolution. You cannot change partition column when it comes to storage optimization. Auto file sizing and auto companies is great for ensuring storage optimization by avoiding small files in Apache hoodie, and the last one is indexing. By default. Hoodie uses index that stores mapping between record key and file group id it belongs to. When modeling, use record key that is monotonically increasing. For example timestamp prefix for best index performance by range pruning to filter out the files. The second option is a table format is Apache iceberg. Apache Iceberg follows snapshot based transaction model. A snapshot is a complete list of files in the table. The table state is maintained in metadata files. All changes to table state create a new metadata file and they replace old metadata file. With atomic swap, iceberg follows optimistic concurrency control. The writers create a table metadata files optimistically assuming that current version will not be changed before the writers commit. Once writer has created can update, it commits by swapping the table's metadata file pointer from the base version to the new version. If the snapshot on which update is based is no longer current, the writer must retry the update based on the new version current the new version time travel user could also do time travel using according to snapshot id and timestamp. When it comes to storage optimization, you can clean up unused older snapshots by marking them as expired based on certain time period and then manually run spark job to delete them. To optimize files into larger files, you need to run spark job in background manually and the last one is indexing. Apache iceberg uses value range for columns to skip data files and partition fields to skip manifest files when executing query. Delta Lake Delta Lake has a transaction model based on transaction log. It logs the file operations in JSON file and then commit to the table using atomic operations. Delta Lake automatically generates checkpoint files every ten commits into parquet file. Delta Lake employs optimistic concurrency control optimistic concurrency control is a method of dealing with concurrent transactions that assume that transactions or changes made to table by different users can complete without conflicting with one another. User cloud also do time travel query according to the timestamp or version number deltax supports or lets you update schema by a schema of a table by adding new column or reordering existing column. And when it comes to storage optimization, delta Lake does not have companies as it follows copy and write. Hence file sizing is manual. You need to run vacuum and optimize file size command to convert small files into large files. Delta Lake collects column sets for data skipping index during query so it takes advantage of this information, the minimum maximum values of each column to add query time to provide faster queries. The zorder index technique it uses to colocate the data skipping information in same file for a particular column to be used in Z order. So this is a quick snapshot of how these table formats are integrated with AWS analytics services. Amazon Athena has better integration with Apache Iceberg in terms of read write operations, whereas it supports only read operations on Apache hoodie and Delta Lake. Amazon redshift spectrum supports both Apache hoodie and Delta Lake for reading data. EMR and glue supports both read write against these table formats. These three table formats you can also manage permissions in Amazon Athena using AWS Lake formation for Apache hoodie and Apache iceberg table format. Similarly, you can manage permissions in Amazon redshift spectrum using AWS Lake formation for Delta Lake and Apache Hoodie. To conclude, here are final thoughts on choosing data Lake table format for building lake House architecture on AWS based on your use case as well as integration with AWS analytics services. Apache hoodie is considered suitable for streaming use cases whether it's IoT data or change data capture from database. Hoodie provides highly flexible three types of indexes for optimizing query performance and also optimizing data storage. Because of autofile sizing and clustering optimization feature backed by index lookup, it is great for streaming use cases. It comes with managed data ingestion tool called delta Streamer unlike other two table formats. Second option is Apache Iceberg. If you're looking for easy management of schema and partition evolution, then Apache Iceberg is suitable table format. One of the advantage of Apache iceberg is how it handles partitions. Basically IT services partition value from data fields used in a SQL where condition. One does not need to specify exact partition key in the SQL query. Unlike Hoodie and Delta Lake, Iceberg allows easily to change partition column on table. It simply starts writing to new partition. Third option is Delta Lake. This table format is suitable if your data platform is built around spark framework with deep integration of spark features. AWS Delta Lake stores all metadata state information in transaction log and checkpoint files instead of metastore. You can use separate spark cluster to build table state independently without relying on central Metastore and this really helps to scale and meet the performance requirements depending on your spark cluster size. Thank you for listening. Me and my colleague hope you all enjoyed this session.
...

Satish Mane

Senior Solutions Architect @ AWS

Satish Mane's LinkedIn account

Rajeev Jaiiswal

Senior Solutions Architect @ AWS

Rajeev Jaiiswal's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways