Conf42 Cloud Native 2023 - Online

What is a Data mesh for god's sake !?

Video size:

Abstract

Data Mesh is the new black in analytics domain. But hold on, what is a Data Mesh? A lot have been written about the theory but not so much in term of implementation. During this talk we will see with concrete Cloud Services how to implement such a mesh and share our field feedbacks.

Summary

  • In today's talk with Charles, we want you to introduce you with the infamous notion of data mesh coined by Zamag Delgani in 2019. We will first show what is the problem that the data mesh is trying to solve. Finally, we will share with you possible implementation.
  • The cycle begins with final users which will create what we call operational data serialized inside relational database. This operation will consist in bridging this very operational data into an analytics world. It's from the data that will fetch insights that will be important to create new features.
  • The problem is not really technical. Problem is mainly organizational. It would be hard to maintain an efficient communication between those teams. In the software design world, we can find some insights inside the domain driven approach.
  • Data as a product, self serve platform and federated computational governance. Stop thinking your business domain as a monolith. A data mesh is a path for better collaboration. It's a social technical concrete which solve an organizational scalability issue.
  • There is no data mesh ready solution. The catalog is the place where all domain can push their own products by product. subscriber can also build their own project based on those projects. This is definitely doable and needs to be included into the project characteristic.
  • Charles: The orchestrator will be in charge of managing the data from the moment we pull it from the data sources. Also, the orchestrator can provide an administration panel. This is helpful when you want to debug and see what is happening into the pipeline. Charles: Google is aiming to provide all those services under the hood.
  • GCP Dataplex is a product self services platform and federated computational governance. It will offer you logical layer that will federate the different services on GCP. Each asset will benefit from by design technical metadata. The real problem is not technical but more like organizational.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Ismayer. I am cloud native developer at Wescale. In today's talk with Charles, we want you to introduce you with the infamous notion of data mesh coined by Zamag Delgani in 2019. In our famous article with shards, we believe that this world became more and more a buzzword, meaning that most people talk about it, but not really master it. As a software developer, we saw that this data mesh notion is deeply rooted into software designs considerations, and we want you to share with you this understanding so that data mesh is no more a mysterious notion for you and you are in capacity to efficiently implement it. But first of all, let's take a citation that is not from us, but that would guide us through this presentation. There is no sense in talking about the solution before we agree on the problem and no sense talking about the implementation steps before we agree on the solution. And this will serve as a guideline to this presentation. We will first show what is the problem that the data mesh is trying to solve. Then we would see the data mesh and for what reasons. It seems to be the solution to the problem we just introduced. Finally, we will share with you possible implementation, and I insist on the word possible of the data mesh. So, prologue when we talk about data, we talk in fact about a wide reality. We have many jobs, many notions to talk about, but we gather all these notion under the name of data. And this is very important for an enterprise because it's from the data that will fetch insights, that will be important to create new features. And the cycle begins with final users which will create what we call operational data serialized inside relational database, for instance, and which represents the business entities which are manipulated. And then from this operational world, we want to understand, to have a broad understanding of our business in order to maybe fix it, or more likely evolve it, enhance it, to answer new kind of needs from the final users. And this operation will consist in bridging this very operational data into an analytics world where we would mix the facts from the operational data with new dimensions coming from third party providers. In order to mix joins all those data and explicit them into graphs, for instance, that business owners and analysts will share with product owners which will be in position to later create new features to the final users that would be in use. And we see that we have this cycle and there is no secret in the sense that when we call the data the new 21st century oil, that is true because it's from the data that you would get some valuable insights to make evolve your applications and let's not forget about the data people who are the key for this cycle as we have here, data engineers, database administrators, data scientists and so on. And thanks to those people, we are going to create a current in order to make the dialogue between the operational and analytics world possible. We are talking about operational and analytics. What are the fundamental differences? In an operational world, we focus on the business entities and their relationship. Moreover, we require consistency over availability, maybe real time, and we usually take volumetries such as gigabytes. Conversely, in the analytics world, we want to have a broad understanding of the business. It's not about business entities, but it's about the business. And we will manipulate facts rather than business entities that we mix with dimensions in order to create this broad understanding and maybe have a better new understanding of the business to create new features. So what would be the problem? Because we have different needs between operational world and analytics world, we most likely want to bridge these different approaches in order to extract from the operational world the needed information to do analytics. And usually we pass through a dedicated pipeline called ETL pipeline for extract, transform, load that would make this transition between the operational world and analytics. So we first extract from the databases such as SQL for instance, the data, we transform it and then we load it into dedicated analysis database that we usually call data warehouses. So let's take a closer look to this operational and analytic bridge. We have this operational word represented by for instance a MySQL database, an analytics word represented by the data warehouse and in between the transformation pipeline. It is a logic called extract, transform, load ETL. Like I said, two problems with this approach. First, one is we try to put an entire business domain inside the very same data warehouse. So we have to think about a consistent way to put all these operational facts inside the analytics database, which is not a trivial issue as we need to keep this understanding to have the good insight from our business domain. The second problem is about coupling between this operational and analytics world. What happens if we decide to change the schema of a table here? We would break the pipeline because at some point we use those schema as a contract between the operational and this pipeline represented by the ETL pipeline. So from a database administration point of view we would say that wait a minute, I don't have to change this schema because I know that we have some hundreds of pipelines sourcing from this table, very same table. I don't think it's a valid reason. So from operational perspective, we don't mind about this analytics world and this transformation has to be kind of agnostic of the schema we have here. And this is why we introduce data lake technology in order to first extract and load inside a data lake to get the ownership back on the schema when we transform the data to load it inside the data warehouse. So here we are not worried about the fact that some schema may change because of database administration operations. We have the ownership back here as we extracted and cloud the raw data inside the data lake. But we still have this first problem about putting indistinctly all the data inside the data lake. That would become a data swamp from which it is hard to get sense of. So if we sum up, it's not just about the two problems I mentioned, it's also an organizational issue because the classical approach, as the Conway's law stated, it is to split our project or products by technical teams. We have the data engineering teams, we have the DBA team, and we have the data analyst data science team. All of these will communicate by Jira tickets. Usually, for instance, the data science team would ask new dimensions to the data engineering teams, which has no clue on what this means in term of business. All the translation here are only in terms of technical needs. And another problem is between the team's data science team and data engineering team. Usually we would have a bottleneck because this central team became the central point. When all those teams have an issue, have a need, and what would happen usually is that, okay, you don't give me the feature in time, I will do it myself. And we have shadow it appearing different source of truth, which will harm the broad understanding obviously of our business. So I would say that the problem is not really technical. All these technologies will scale. Problem is mainly organizational. It would be hard to maintain solvent of pipelines. It would be hard to maintain an efficient communication between all those teams being from the operational team to engineering team, but also from the data engineering team to the data analyst science team. We have an issue to solve. So what would be the solution? In her article, Zamag Delgani tells us, but solutions coming from the software design world. And what I didn't mention is that at the time, Zamac was an employee of thoughtwork, a software consultancy firm specialized in software design. And I think there is no coincidence that it was one of their employee who came with this notion of data mesh, because like Zamax said, we can find some insights inside the domain driven design approach. This approach is about discussion with a strategic phase and a tactical phase. In this first strategic phase, we are going to understand the business domain dividing it into subdomains and gathering them into bounded context, which appears as physical boundaries between concrete that should communicate with one another, but at the same time be autonomous in their growth. And this discussion should happens between a multidisciplinary team inside a multidisciplinary team made of business analysts, product owner, product manager and also developers. So that we begin to have an ubiquitous language that we would use to create the different user stories. And at that point when we have the different words verbs, relationship between our business entities, when we have this ubiquitous language, we can define this bounded context. We begin to implement it through a tactical phase. And this implementation would come with technical patterns such as exagonal architecture, securers, event sourcing and so on, in order to have an application that is testable, maintainable, evolvable, and so on. And the fact that we talked about just before about data swamp, meaning a data lake which is really hard to understand. We can make a parallel with software design where we have the same kind of notion called big ball of mud, and the fact to use domain driven design is a good approach to avoid at all cost. This notion of big ball of mud and domain driven design is a kind of cycle. It does not end when we have our ebikitus language or even the code to represent the different user stories. We would need new features. So we would create new, enrich our ambiguous language with new verbs, new nouns, and maybe create another kind of language to create a new bounded context. And the idea of Zamac was to apply this way of thinking into the data world, in particular the data analytics world. If we sum up the main goal of the DDD, domain driven design is to make emerge or different ubiquitous languages which will be protected by the bonded context and then implemented as domain model decomposed into subdomains and finally implemented using some software design pattern. You may know some of them, MVC for model view controller, exagonal, secure rest, event sourcing and so on. If we take business domain like marine, we may have some different subdomains. And the idea of these slides is to show you that the discussion between the product owners, the business analysts and developers may conclude in different stories. We may have different bonded contexts according to the kind of discussion we would have, and especially the kind of issue we want to tackle. So at some point we may have four different bonded contexts or three different. It depends on your business needs, obviously, because all of those bonded contexts are part of the same business domain. They are to communicate, they cannot live alone. So communication also have to be consistent in terms of models we use to communicate between all those bonded, between all bonded contexts. And we have some patterns to apply this consistency. For instance, an anticorruption layer that would give the consumer here represented by the context two, the guarantee that what we consume from the provider context one, will match the needs in term of type, in terms of fields that we have in the context two, conversely, the provider, the data provider, would also be able to apply a logical layer that would allow him, allow it to create its own published language, to not pollute, let's say, the inner ubiquitous language, and have this autonomy we want for each bonded context. In term of organization, the DDD requires you to have a unique team per bonded context. And it's very important that we have this unique team so that the ownership is clearly stated and this very specific team will be in charge, will be accountable for the quality of its bonded context first to create SLA, SLO, for instance, but also accountable to the global consistency, the global rules we have inside this business domain splitted into bonded contexts. We don't want all those bonded contexts communicate in their way. We want them to communicate as if they were part of the same business domain. So here, for instance, we have the team one, which is accountable for bonded context one and two. It is possible it is a one to many relationship, but a given bonded context cannot have two teams accountable. And that's why the second team here will not be in charge of the first context, because team one is already in charge. So here we are, datamesh. So what is, for God's sake, the relationship between DDE and data mesh? Well, it's kind of obvious, because instead of thinking this way, operational world, analytics world, and then the ETL pipeline bridge, we apply the same mechanism we just saw in DDD. We have a multidisciplinary team which will discuss about a business domain and start to subdividing it into bounded context. So in a way, the data mesh where we have data domain should be called data bonded context. And this mesh is made of nodes represented by those bonded context or data domain and vertices by the different communications we have between those domains. So let's not forget that each domain has an ownership of a given ubiquitous language, but is not alone. In a way, we have to consume data coming from other domains in order to produce the different analytics we need and what is inside each domain. It's up to you. In fact, we can get black inside this domain to the very legacy way of thinking with the operational world being bridged to the analytics one through ETL and this is what we would observe usually. So data mesh is not saying that this approach is wrong, it's just saying that we have to have a step back and think the same way the DDD tells us in order to organizationally scale. What we have to understand is that the data mesh is a sociotechnical concept which brings especially an organizational scale and not really a technical one. The technical scaling is already solved, in my opinion. We have all the database and data warehouses we need. The pipelining is scaling with for instance, Apache beam spark and so on. So the problem is not here, it's more about tackling an organizational issue. And with the data mesh, as we have in the domain driven design, we have teams in ownership of well designed data domain, and we have to apply some pillars where the domain ownership is the main one and is backed up by three other pillars. Data as a product, self serve platform and federated computational governance. Data ownership is like we specified in DDD is. But stop thinking your business domain as a monolith. You have to split it into bonded context so that we have a team in charge to keep, to catalyte like we can say as a product, to provide SLA SlO, to provide a quality of service through a self serve platform which provides you with the technical assets you need technical assets that would scale, especially if we consider managed services on the cloud. But we still need at the same time a federated computational governance, so that, for instance, we don't exceed API quota. We keep in line with the naming policy, with the communication rules between all those data domains and so on. What we have to consider though, is unlike the service mesh for the DevOps who are listening, data mesh is not a purely technical concept. It's not like, okay, I am on my cloud platform and I will install a data mesh. It's not working like that. You have to first think your business domain and have a discussion between all the different jobs you have, developers, data scientists, business analysts and so on, in order to make emerge different data domains and activities such as event storming could help you to do so. So like I said, it's a social technical concrete which solve an organizational scalability issue. So be cautious about solutions that sells itself as data mesh ready solution. What exists, on the other hand, is our enablers, but a solution that would tell you, okay, you just have to put the coin inside the machine and here it is, you have your data mesh does not exist. And a data mesh is a path for better collaboration. It's not an end in itself, it's a means to reach a better collaboration between your team, a better understanding of your data. So where it shines, it is where your domain is complex, your business domain is complex. Where you have different subdomains, where you have a rich communication between entities, their data mesh will shine. But if your business domain is simple enough, there is nothing wrong to have the legacy approach. Considering only the operational world, analytics one, and the bridge in between represented by pipelines, it's perfectly fine to act this way. But once you begin to have organizational issue, once you begin to not understand what your business is, to not have the right insight to make evolve your business, maybe data mesh should be a solution here. So we were talking about data mesh from a theoretical point of view. Let's see what kind of implementations we can imagine. And I insist, imagine like I said, there is no data mesh ready solution. So Charles, it's up to you. So let's dig into the catalog. The catalog is the place where all domain can push their own products by product. Let's understand that we are talking about the data that each domain collect, store and want to make available for other domain. Let's see it like Marketplace, a catalog where every producer of data. So let's understand. A data domain can push and make available a product which is an aggregation, a formatted amount of data that subscriber can. So we will find here in the catalog a place where the data domain, so let's call them producer can push their own projects and subscriber people out of the domain can subscribe to those projects and start pull them. Each data owner will be in charge of describing its project and push describe define few parameters. Those projects will have a set of characteristics which will define basically what a product is. So you will find of course schema. You can also find information related to the API where you were pulling the refresh vacancy and all the information that producer can find relevant. This will help all the people from outside so the subscribers to pull the data correctly and automate the phases of pulling. We can also consider that subscriber can build their own project based on those projects, how it works. So you can open a contract, subscribe to a project and from there build your own set of data based on this project and enrich this project and build your own product on top of this. This means that you will create a project from other project by aggregating those data and transform those data. This is definitely doable and needs to be included into the project characteristic saying that this project is beta on this one and we are doing transformation on the first product. So how it works. We will dig into it into the third chapter. So now let's talk about the orchestrator. The orchestrator will be in charge of managing the data from the moment we pull it from the data sources and it became available for pulling by the subscribers. The orchestrator will also monitor all stages on the data pipeline, starting from how the data is ingested, if the old data has been ingested correctly, if the data is transformed correctly based on the description of transformation in the catalog, and if the data is correctly loaded into the data stores. From this moment, the orchestrator will work with the catalog to make sure that the state of the project is correctly updated, saying for example that the last refresh time is on 22 of March 2023. Also, the orchestrator can provide an administration panel. This is helpful when you want to debug and see what is happening into the pipeline. For example, you just notice that project has not been refreshed as it should and you want to see what is happening. So the administrator panel will be able to see what job is currently running and maybe why. Also it is taking so much time. We can for example notice that our data sources is taking much longer to pull and to give us the data. If the usage of data mesh is growing within the organization, it can become hard to debug and see all the job and states on all the job at the current time. So having an administration panel to help you seeing graphically through a graphical interfaces what is happening within the data mesh infrastructures can help you and gain a lot of time on debug. So now let's dig into what kind of architecture can be built to host those services. Here we are, this is a zoom on a data domain. Here on the right hand side you will see data sources. All those data sources is basically a set of data like databases, another application, CSV files, whatever you want and this will be used as sources for our product. So let's start by the catalog. The catalog is in charge of creating products and based on those characteristics, on those parameters, the application transformation the orchestrator will be in charge of making this project available. So we will start by ingest the data from data sources. It can come from one to end sources. Once they are downloaded they will be pushed to a cage. So storing the data into the cage will avoid redownloading all the data from sources. If there is any issue with later operations like transformation for the transformation here we use spark with EMR on AWS to help us doing all those transformation. Basically the transformation that has been defined into the catalog. Once all those transformations are done, the data will be pushed to s three, redshift or aurora postgres. Why using proposing these three data stores? Because of the difference we can have in terms of data complexity or amount of data. For example, s three will be very useful if we have large amount of data, but redshift will also allow us to use SQL queries. So redshift can be very useful with large amount of data and if the application of exploration is using for example GDPC driver and want to use SQL queries to run against the data store the same way, aurora postgres can be very useful if the concurrency is very high. We all know that Redshift is a very powerful tool, but the concurrency is very hard to deal with that kind of data stores. Aura postgres allow us to be very efficient in terms of queries with a large amount of data and can give us very high amount of I ops and also can help us to have a very high amount of concurrent queries because of two main features, the rate auto scaling of course and also the fact that we can have very big instances and last but not least the exploration application. So this application will be in charge of retrieving efficiently the data within our data stores and make all those data so the product available for all the subscribers. This application will need to take in charge of those operations, meaning it needs to control the way it is retrieving the data to do not put too much pressure on data stores and do not impact all subscribers. And it needs to be intelligent enough to load, balance, shard or optimize customer queries. This is what data domain can put in place in AWS, for example to build their own data mesh. Here we can use containers within ecs, for example, the usage of containers is recommended as few of those operations can need data within the containers or also run for a long time. So I would recommend to use containers here instead of lambda as we can use it here for example to run our application orchestrator or catalog based on DynamoDB lambda and API gateway, basically the serverless framework. So this is all the things that we can build to make our projects available within our data domain. Now I will give the hand back to Ismail who will be introducing a Google project that is aiming to provide all those services and manage everything under the hood on Google side. Thank you Charles. Before concluding, I want to present you a quick overview of a product called GCP Dataplex. As we saw when we talk about data Mesh 3D product, we have to be very cautious because the real problem is not technical but more like organizational. So it requires you more to think about your business domain than buying another product. But we saw that we have enablers that enables you to implement the pillars of domain ownership data as a product self services platform and federated computational governance. We think that GCP dataplex from Google is a good example of such products because it will offer you logical layer that will federate the different existing services on GCP, such as bigquery, dataflow, cloud storage and so on. In order to give you a sense of what should be data mesh. If we look at this logical layer, we have a lake which represents in fact the data domain and which relies on certain amount of services such as the data catalog, which will store metadata related to the different data that you will store and make compute on. And also of course Google Cloud IAM which will allow you to federate to give you the federated computational governance on different assets on different GCP assets such as bigquery or cloud storage. Another point is that the lake will be separated into zones which represent a kind of logical separation of your data. We can interpret it as package if we reason in terms of language development, for instance, and each of which will be attached to different kind of assets. Depending what the team associated to zone wants to do. Each asset will benefit from by design technical metadata such as schema for instance, the type of the different schema of big rate tables for instance, and will be automatically reported to the lake. It is interesting to observe that the different assets we link to the lake are not necessarily part of the same GCP project. In fact, we can see a dataplex lake as the same kind of abstraction of a GCP project, but only for data where the GCP project allows you to abstract the billing and API quota. The data lake will allow you to abstract the notion of data mesh through this federated computational governance which is not really per project but per lake. We do not forget also that given lake which represents a data domain is not enough. We also have other lakes which represent other data domains and as we saw, in the end they will be able to communicate according the permission we set in the federated computational governance layer and also according obviously the need of communication between those domains. So let's see short how it illustrates. So here I am on GCP console. So obviously when we consider Dataplex, we have to be familiarized with the Google environment and we will have in the manage section the different lakes that represent our data domain. We can create new one if we have the right permissions and inside each lake we can do certain amount of action. For instance, we can federate the permissions on the different zones of this lake and we can grant access to those specific zone. Here we have three different zones, one of which I created two different assets. So if I go inside the zones, I will be able to see my different assets. I can also create and delete the existing assets and of course add permissions on the zone itself, but also on the asset itself. So here my two different assets are a bigquery data set and a storage bucket. In Dataplex, those are the main asset type. That doesn't mean that you cannot set other kind of assets, but it would be necessary through those pillar assets, since bigquery, for instance, allows you to later fetch information from history, blob objects, or even on premise databases. See Bigquery Omni from the manage section. I won't be able to access the assets. It would be from the catalog, the data cataloging feature of Dataplex, which is a kind of search engine which will rely on the metadata, the technical one, of course, the name of my different schema, the name of the column and the types, but also on the business metadata, and we will see how to provide them to Dataplex. Here I can see that I indeed can access to my assets and I have a certain amount of filter that allows me to add more criteria in my search. If I go on the assets, I can have different kind of metadata, technical metadata on it. I can access the schema and we can see that I can associate them to business terms. Those are the specific business metadata I was talking about and which will be in fact fed by a glossary which explains the data domain we are working on. This is very important because it will allows new users to have context on what kind of business we are working on. And we can see that here I created a people domain that I document and finally I create a new element, gods, with a definition on which I can create relationship. So here a commercial is related to a customer by this definition. And I added the link and I can also add a steward which is a kind of owner of this definition so that any people in question for this notion is able to contact the right person. And if we go back to the assets which consume those elements, I am able in the catalog to search for customer, for instance, and see that the assets associated to this notion is bring back the search engine. So this is very interesting in terms of data exploration and data and business understanding. Of course, my data lake goal is not to just expose my data to understand the business data domain, but also to apply transformation on it and to have quality insights on it. And this section of process allow you to create tasks under different data that you consume, that you stored inside the assets we just saw. And those processes are provided by services, Google services, which are not to create by yourself, but which is possibly given to you through templates for the different tasks that are common. But keep in mind that you are also able to provide your own business logic through for instance, dataflow pipelines. You also have the capacity to define specific processing to have more insight on the quality of your data. This feature is still on preview, but relies on dedicated data quality project from Google which allows you to expose the different rules you want to apply on your data through YamL file. And it is quite interesting in terms of possibility we can have on this feature. And of course we have also the secure section that allows you to have a broad overview of what kind of access you can give on the lake, but also the zones and the assets associated to them. So like we saw, this product is more like an abstraction layer rather than a real service like bigquery and gives you pointers to different assets. So bigquery, data set and cloud storage bucket to federate the different processes you will apply on your data, but also the different permissions you will apply on it. And last but not least, it also enables the exploration of your entire data set with elements that are technical, the type of data you are looking for, but also related to the business. Thanks to the glossary we just it will concrete this presentation. Thank you for your attention and if you have any questions feel free to join us on the chat. See you.
...

Ismael Hommani

Cloud Native Developer & GDE & Manager & Podcast Host @ Wescale

Ismael Hommani's LinkedIn account Ismael Hommani's twitter account

Charles-Alban Benezech

Cloud Solutions Architect @ Wescale

Charles-Alban Benezech's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways