Conf42 Cloud Native 2024 - Online

Building a Modern Enterprise Data Ecosystem on Cloud: Challenges and Opportunities

Video size:

Abstract

Explore building a modern enterprise data ecosystem on the cloud with us. Discover data domain design, seamless transfer, user experience enhancements, crucial roles, robust security, and the power of machine learning. Don’t miss unlocking the full potential of cloud-based data management.

Summary

  • Welcome to today's conference where we delve into modern data architecture. In an era where data has become the cornerstone of innovation and competitive advantage, organizations worldwide are facing to develop modern data strategies. We will explore the latest technologies and the frameworks that empower organizations to transfer data into actionable insights.
  • The data lake is like a simple storage where structured and unstructured data can be stored and consumed from. However, this simple storage will have a number of challenges. addressing these challenges is essential for optimizing the functionality and efficiency of the data lake.
  • Data writers are assigned specific areas within data lake designated for their respective suborganizations. Data within each server organizations is structured into distinct data sets. Within each dataset's designated area, data is organized into delta files and aggregates. Data is partitioned for efficient storage and retrieval.
  • There will be various personnel who will consume this data. This could be data analyst, data scientist, applications, or consumers that we do not know of. All this would need the data to be stored in a fit for consumption platform. The challenges are consistency of data in various things.
  • Multiple data publishers write data in various formats. We will need a component that will be able to move the published data into any of the data platforms that consumer want to consume from. Some of the challenges that this component will have to solve are ensure that there is consistency in the data.
  • The data pipeline is to be able to preprocess the data before it is written to the storage platforms. The basic steps that will be within the pipeline are implementation of the data governance platform and route data to the appropriate syncs. A data pipeline offers a plug and play approach for adding and updating data governance.
  • How can this platform that we have defined till now support machine learning? There are a few stages. Raw data will be available in the data lake. A transformation job within the data transformation platform will extract the features and store them in a feature platform. Finally having an UI that binds all this together, that is the data portal.
  • There could be a potential challenge with data security. Data integration, integrating on premise data with new data. Cloud capacity is also a concern. Although it's good abstraction, we will need to plan for the capacity.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to today's conference where we delve into modern data architecture. In an era where data has become the cornerstone of innovation and competitive advantage, organizations worldwide are facing to dont modern data strategies that can unlocking its full potential. But why do we need modern data architecture? Today? We will explore this question and more understanding how the landscape of data has evolved and why traditional approaches are no longer sufficient. From the explosion of data sources to the demand of no real time insights, the challenges facing organizations in harnessing the data and the solution lying the modernizing our approaches to data architecture. Our discussion will not only focus on the theoretical aspects of modern data strategy, but will also delve into practical insights on how to build modern data architecture that are agile, scalable and resilient. We will explore the latest technologies, methodologies and the frameworks that empower organizations to transfer data into actionable insights, driving innovation and failing growth. Throughout this talk, we will share best practices gleaned from industry leaders and experts, offering key takeaways that you can implement within your own organizations. So whether you are a seasoned data professional or just beginning your journey into the world of modern data architecture, I encourage you to engage, collaborate and immerse yourself in the wealth of knowledge and the insight that this talk has to offer. Welcome to the journey of building modern data architectures. So, before going into the details of how we architect the modern data organizations, let's take a step by step approach. So in every organization there will be a need to kind of write the data in some area and then read that data by the consumer. So in a very simple format, the data lake is like a simple storage where structured and unstructured data can be stored and consumed from. So in short, in its basic form, an organization will need a place to write data that will be consistent and durable. They will also need a way from consumer to consume that data in a consistent way. This could be as simple as s three in AWS. This will store data that can be structured, unstructured or semi structured. However, this simple storage will have a number of challenges. If every publisher writes data in their native format, then consumer will not have a consistent way to read the data. Organizing the data to be written so that consumer know where to read from data is added at some interval. This creates a problem to present a comprehensive view going a little bit detail on this part. In its rudimentary form, data lake serves as a reservoir for both structured and unstructured data, providing a centralized repository from which the information can be stored and retrieved. Writers deposit the data into the data lake while readers access and extract the data from its storage. However, this seemingly straightforward approach presents the number of challenges that we discussed. So addressing these challenges is essential for optimizing the functionality and efficiency of the data lake, thereby enhancing its utility and value within the organization ecosystem. Now, I will be talking a little bit more about how to organize the data storage. So basically, in this talk, the way we are kind of trying to achieve this one is in a step by step way. So we were talking about the simple form. Now going a little bit deeper, how can we store the data? And we'll follow this pattern in our entire talk. Coming back to organizing data storage. So if you think that every data publisher will need to write data to the storage, right, but we do not want all of them to invent a new way to write this data. Hence, there will be a need for a standard way to write the data. This can be fulfilled by creating a component called data writer. Data writers are assigned specific areas within data lake designated for their respective suborganizations. They must authenticate their identity and specify the data set they are writing to. Meanwhile, the data readers allow consumers to access data based on their identity and the data set that they are authorized to access. Data within each server organizations is structured into distinct data sets. Within each dataset's designated area, data is organized into delta files and aggregates. Additionally, data is partitioned for efficient storage and retrieval. Standard formats such as Parquet are employed for storage, while advanced formats like Iceberg and Hoodie may be utilized to enable transactional capabilities. In short, I'm right now focusing on the right part, how the publisher will kind of leverage and data writer component to store that data, and then a little bit how to organize that data within an organization and then how to manage the delta. That will come on a constant basis or some predefined frequency. But we need to make sure that the deltas are coming and the deltas are getting updated into the master files. Now, let's transition from storage to utilization. So basically, we'll be talking here on the consumption side, right? And then before going into the details, the very high level questions would be who will consume it, where from they will consume it and how will they consume it. Now there will be various personnel who will consume this data. This could be data analyst, data scientist, applications, or consumers that we do not know of and why. I'm saying that the consumer that we do not know of, there could be an innovative kind of scenario where the consumer fit into data analyst, data scientist, or application bucket. So that's why I'm creating another bucket for the consumer that we do not know right now. And all this would need the data to be stored in a fit for consumption platform. This means that the data for a batch application will be stored in a data lake. Data for real time access would most likely be accessed from a database or from a low latency storage. Additionally, for consumer patents that we do not know of, this platform will have to enable the creation of custom syncs to fulfill the specific consumption needs. Further, the consumers will need tools that are appropriate for their consumption. This could be API for real time access or notebooks for data scientists. Now, even with this upgrade, the challenges are consistency of data in various things, the duplication of effort to write this data and the control to prevent unauthorized access. And we will be kind of moving forward. We will see how to address these challenges in a more holistic view or with a more holistic view. Now let's discuss how the challenges that we talked earlier can be addressed. Multiple data publishers write data in various formats. For example, a publisher may write data in CSV format to the data lake, and there could be another publisher that may want to write data in avro format in streams. So we would need a component that will enable publisher to publish in an uniform way. We will need a component that will be able to move the published data into any of the data platforms that consumer want to consume from. This component will also write data in a standard format. For example, part three. We call this as the data pipeline. Some of the challenges that this component will have to solve are ensure that there is consistency in the data that is being written to these various platforms. Excuse me, ensure that the data is being written in a standard format so that there is a predictability from consumer perspective. Implement governance so that there is a trust in the data and then the burden on each platform to implement governance is not there. Hence, it is important for us to establish a first class data storage platform. If there is an inconsistency in the data, then this is the word we will use to rebuild the platform. For the consumer to consume data, we will need to have a discoverability of the data. This can be achieved by establishing a data catalog. Data catalog tells consumer what data is there for consumption, but we will have to solve for who can access what data. This is where data access control will come into picture. Now, before going into or doing a deep dive on data pipeline, let's talk or spend a few minutes about the role of data writer. So the data writer component plays an pivotal role in managing both real time and past data stream. Within the data ecosystem, real time streams are critical for instantaneous decision making and monitoring processes. Data can originate from APIs or change data capture mechanisms regardless of the source. Real time data is channeled into a continuous flow, forming a dynamic data stream. Bash data processing is essential for handling large volumes of data in scheduled intervals. Bash data may originate from file based storage logs or databases. Bass datasets are directed towards a centralized data lake for storage and subsequent analysis. Let's discuss about the data pipeline now. The basic idea of the data pipeline is to be able to preprocess the data before it is written to the storage platforms. The basic steps that will be within the pipeline are implementation of the data governance platform or sorry perform light transformation, route data to the appropriate syncs. Additionally, more steps should be addible in the future, so it should be flexible enough custom steps down the line as well. All of these tapes are organized using an orchestrator. This could be something like infling or airflow. Now let's talk about each of these steps. Data governance implement policies and procedures for managing data access, security and compliance. It will also do a schema check to validate the schema and to ensure the conformity with predefined standards and structures. It will also do and quality checks to identify and address the data integrity issues or anomalies. It will also do and scan for sensitive data and conduct in order to identify and protect sensitive data and ensuring the compliance with privacy regulations. Data pipeline will also routing the data into the destination, so direct data stream to diverse destination based on predefined routing roles and criteria ensure data is eventually retained. So it's like implementing mechanisms to ensure data persistence reliability even under challenging conditions. It will also do an light transformation, so applying lightweight transformation to optimize the data for downstream processing such as changing data types or formatting. Now, achieving all the above can be streamlined with a data pipeline offering a plug and play approach for adding and updating data governance. For batch data, it routes both streaming and batch platforms for real time data. It batches before routing to the lake and sent it to the other downstream storage syncs simultaneously. Before sharing the responsibilities of a data transformation platform, let's revisit the challenges associated with it. If each team must develop its own computing platform, optimal capacity utilization becomes challenges. Second, each team will expand effort in maintaining infrastructure individually. And the third, there is no standardized approach to implementing data ops, leading to inconsistency across teams. To tackle these challenges, a data transformation platform can be utilized to accept transformation scripts in various formats via a Ci CD pipeline. Initiate and execute jobs on an on demand computing infrastructure. Third is dynamically grant access to specific data based on job context and the lastly, the fourth one transfer the transform data to a data writer for storage on data platforms. As you see in this diagram, I'm going to talk about the data organization and then the left hand side of this diagram. It's more like talking about how to organize the data in different lobs and what's the role of individual publishers within that lob. The right hand side is like how that enterprise catalog would look like and what would be the role of different sublob within the overall framework in general, right? Every organization is composed of sublobs. So when I'm talking about lobs means each of these sublobs will have multiple publisher and each of these publisher may publish various data sets. So these data sets may in turn be sent to various data platforms like the lake or database. We will call this data distribution. Further, there can be multiple consumers in each of these sublobs which may want to access data from any of the data sets across any sublob. So it is imperative that this data is discoverable by the consumers. Hence, the publisher will be responsible for registering the data in the catalog. In many organizations, each sublob may have invested in their own catalog. In such cases, they will need to roll up all this registration in a central catalog so that the data is discoverable. Now let's take a look at things that are needed to enable publishers and consumers. We will need an UI for the data users. We will call this as data portal. Many organizations will call it by different names. As we have already discussed, we will need a place for the admins to define an organization and its organization. We will also need to let the publisher define their data in the catalog. This would be data sets, data distribution and other things. After defining the catalog, the publishers may be asked to define the data quality, lineage and other things so that the consumer have confidence in the data that they are consuming. The basic idea here is to reduce friction for the data to be published. Hence the tiers just publish but no consumption. Publish data with basic governance information like schema and tier three, additional information like data quality. Besides supporting the publisher and consumer, the data portal would need to support other use cases like reporting, searching and access request. Let's now focus on machine learning. How can this platform that we have defined till now support machine learning? As we can see in the diagram at a rudimentary level, there are a few stages. Raw data will be available in the data lake. A transformation job within the data transformation platform will extract the features and store them in a feature platform. This feature platform may be created using the following feature registration in the data portal, data storage in data lake and perhaps a low latency cache feature serving using an API. A job to train the model within the transformation platform. Then the trained model will create offline predictions and store that in a data lake and a low latency storage here. One other thing needs to be noted here that we are not covering the real time prediction here. Now putting everything together, we will have these components. The data writer, it will be able to read and write data from various sources and various format, be able to scale up and down according to load and cater to both batch as well as real time data. The storage platforms it will be something like the data lakes, streams and other storage things. A data pipeline to implement various preprocessing steps and binding all these steps through an orchestration. Now potentially there could be a data transformation layer or platform which will help us to do the needed transformation to send the data in various things. The consumption tools for consuming data from the various platforms in a way that is convenient for the consumer and finally having an UI that binds all this together, that is the data portal. Now I will spend some time to share some of the challenges related to the cloud or with how how each lob will create or share their AWS accounts. And then it's a common challenge. And then the solutions option could be like create one central account to store the data and each lob creates their own account. There could be a potential challenge with data security. There is an added concern of data breach right sensitive data. The solution for that one is like sensitive data should be scanned, an appropriate remediation performed such that even if the data is compromised, it will be of very little use. Some options are encryption, masking or deletion. Data integration, integrating on premise data with new data. The solution could be historical data remain on premise, migrating historical data to cloud and disconnect with on prem data. There could be challenge with workflow migration and that could be potentially solved by migrating on premise job to the cloud so that new data can be created on cloud and also let the job remain on premise and create new data on premise. Create another job to move the data from an ops point of view. Each service and storage are tagged with the Subalobi's identity so that this information can be used for billing. There could be challenge with data governance. So in that case I would recommend know CDMC as a standard and also use custom data governance component. Cloud capacity is also a concern. Cloud is someone else data center. Although it's good abstraction, we will need to plan for the capacity as each account has a soft and hard limit on capacity available. Well, thank you once again for your time and attention. It has been a pleasure to share my talk with you all today. I have shared my email and I would be happy to take any questions you may have. Thank you.
...

Raja Chattopadhyay

Engineering Manager @ Capital One

Raja Chattopadhyay's LinkedIn account Raja Chattopadhyay's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways