Conf42 Cloud Native 2023 - Online

Automatically shard and scale-out your traditional databases on Kubernetes for true digital transformation

Video size:

Abstract

How to migrate your databases to the cloud or Kubernetes? How to create a distributed/sharding database system based on your existing monolithic databases? Find a solution to go cloud native and tackle the pain points and difficulties.

Summary

  • Shista: How to upgrade your existing traditional databases to become a distributed one. He says you can leverage the distributed database system to have the higher performance. Shista: Sometimes I will post some articles about the open source on Twitter.
  • Today we will have the following topics we will talk. Let us dive into the distributed database system. How to make your favorite postgres or MySQL or SQL server or Oracle to become a distributed one. And then if we have time, I will introduce my demo show.
  • For SPCo, for big data or databases, all the new scenarios make us have the new definition about big data characteristics. The big data five v they are the value, veracity, volume, velocity and variety.
  • We cannot use one single databases or product to solve all of the issues, right? So generally speaking, we split the scenarios into two. One is about OLTP, that means or this kind of databases or productions. Why people want to adopt the new solutions at this point?
  • Use Apache sharding Sophia plus your existing favorite, MySQL, Postgres or Oracle. Solution can leverage your existing database cluster. It can help you consider how to move your databases into cloud.
  • Sharding Sophia is a lightweight GDBC driver for your Java application. It's deployed between your application and your databases. Shortening Sophia can help you do all of the encryption work and decryption work. Can be deployed anywhere thanks to cloud computing and storage architecture.
  • Sharding table is not a single table or standard table in a postgresql database. It's made up of many shards or many subtables or many physical tables. The demo show will include the following steps and I do it by myself.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello guys, this is Shista. So I'm so happy to give this talk around how to upgrade your existing traditional databases, for example MySQL or PostgreSQL, to become a distributed one so that you can leverage the distributed database system to have the higher performance and make your query become faster and to give you the out of the box deployment solution. Right. So let's get started. This is about myself. I'm the Trista now I'm at the Sofias working as the co founder and the CTO. My area, it's about the distributed system around the databases, around the automatic databases management platform. So I spend my time as a technology and also in some open source foundation, for example Apache Software foundation. I'm the Apache member and incubator mentor to help to give some of the tips to some open source projects and their community. Yeah, so sometimes I will post some articles about the open source, about the open source monetization, about the databases and cloud databases on my Twitter and linking. If you're interested in such a topic soon, welcome to give a look there. Right? And today we will have the following topics we will talk. The first one, I will give the current scenarios because we have the new technology, we have the new needs. So we need to do some change around our database infrastructure. And second one, let us dive into the distributed database system and to look at what is happening in a distributed database system now. And then I will give some new ideas or solutions. Like I said before at the beginning, how to make your favorite postgres or MySQL or SQL server or Oracle to become a distributed one. To become a distributed system. And then if we have time, I will introduce my demo show and to let you learn it or deploy it one by one. Yeah. So first one, when you talk about the new scenarios, because in this new era, actually we have some new technologies. All of them are good features or good technologies to provide their values for our bi, I mean the business, your commercial, your customer. But for SPCo, for big data or databases, actually all the new scenarios make us have the new definition about big data characteristics. That means the big data five v they are the value, veracity, volume, velocity and variety. I will give the brief introduction about each of them. The first one about the value. That means we need to use some AP or use some the big data architecture or the solutions to help us do some statistics so we can know more about our business or give project some of the good feedback. Right? The second one, it's about variety. That means we need to guarantee that the data we collected are correct though. If we use this data to create some data model to do the bi work, that means that they are the valuable. If we use the wrong data to do the AI, that means we will guess the wrong readout will guide our product or companies to the wrong direction, right? The NASA one is about volume. That means, you know that in the Internet that people create the data every day, every minute, every second. So that means we need to manage the tremendous data size. And that's why it's velocity. That means that any users or other department, they really want to get the readout or the queries, the readout as quick as possible, right? So they want to get the real time readouts or do the real near real time process or stream data stream. The next part is about variety. That means your databases or your databases store or data store need to help you manage the structured and unstructured data. So that's the main innate big data characteristic. Based on all of the characteristic, we need to solve all of the issues youth or new databases architecture or new databases solution or product. But before we introduce all of them, I just want to give a little bit introduction about the OLTP OLAP. Because you know we cannot use one single databases or product to solve all of the issues, right? Or to meet all of the requirements from our previous slides. So generally speaking, we split the scenarios into two. One, the first one focused on the OLAP. That means to precise the larger amount of the data quickly and to do the analystic. The second one, it's about OLTP, that means or this kind of databases or productions, they want to handle large volumes of the transactional data involving multiple users, right? When we speak of the MySQL or PostgreSQL or Oracle, actually they are all belongs to this type, the type of the databases. So today I just want to pick up the OLTP area to give my solution and my idea. But if you have some idea about OLAP, maybe later on I can give other talks about around that topic. So now I know people love the MySQL or love the postgresql or some company will bought some the Oracle databases, right? But you can see that all of the monolith databases actually for them is too hard or too difficult to manage the structured data, especially data STP or BP. All of them kind of like tremendous data size, right? So people will consider how to use some distributed databases or new SQL databases, for example the crop row DB or MongoDB or the aura or such kind of the databases solution, right? But we need to ask a question why people want to adopt the new solutions at this point. Because you know that I introduced before that the monolithic databases. For them, it's very hard to help us to manage all the structured databases in BT or PD, right? And plus you know that the service at the beginning, they're the single one, they are deployed in a monolithic machine, right? But now you found it split it in the microservice or serviceless, right? And people will use the Kubernetes to manage all of the services, right? So you can see that the services, it's become the distributed one, it's split, it's sharded and your server actually on the same way to do all the stuff at the beginning, maybe you can just use the one server to do all the stuff. But later on you found that if we can use the cloud, the pod or container or the docker, you found it's easy for us to manage and to skew in or skew out it make this system become or the server make the infrastructure become distributed and become the elastic skill in our skill out, right? So that's the power of the distribution. And how about the databases? Actually big databases. And currently at this point people use the same idea, that means the distributed or sharding or the splitting the architecture to split one single databases into a distributed cluster to make this cluster can help you manage the tremendous data, enormous data, and can help you to do the elastic skewing and skew out based on your needs, right? So you can choose the pay add to go this model. Yeah. So at this point, you know that maybe we need to use some of the distributed databases for some necessary needs or cases. But I know most people, maybe at the five years ago or ten years ago, you are so like, you are so support the PostgreSQL or MySQL. But now you need to consider the upgrading all the stuff, okay, you can actually just at the beginning choose the distributed databases. But I guess most of us are still in the progress of the evolution, right? How to make it become the distributed one and today. So I want to share another solution to help you to own a distributed system based on your existing postgreSQL cluster or MySQL SQL cluster, and make this distributed postgres cluster, have the skewing skew out elastic skewing skew out features or rewrite splitting like the traffic governance features into this cluster. So how to do that? We need to know that when we speak of the distributed databases. Actually most of them adopt the computing storage splitting architecture. That means in this distributed system, it's made up of two important parts. The first one is the computing part. That means the computing nodes. The second one the storage nodes, right? So based on this idea, if we want to make our monolithic postgreSQl become a distributed one, actually we can regard or view postgreSQl or MySQL SQL as the storage nodes, different storage nodes. It's postgres means the one storage node and then all the storage nodes. They can do some the local computing work and help us to persist your data. Right? So if we want to create this distributed system, then we need the exact global computing nodes, right? So who can become the distributed nodes? Later on I will introduce the Apache sharding Sophia. It can work as the computing nodes. So use Apache sharding Sophia plus your existing favorite, MySQL, PostgreSQL or Oracle. You can have the distributed one. So what's the partition Sophia? All right, before that I need to give some of the benefits of this solution. The first one, it can leverage your existing database cluster. That means you don't need to overturn the whole of the infrastructure. A second one, you can leverage the ecosystem of the mature postgreSql deployment or maintaining the ecosystem of the postgreSQl or MySQL's maintaining work and deploying work, right? You have a lot of mature tools or the dbas can help you deal with all of the stuff. The third one that I will give more like the SQL audit or traffic governance or elastic skilling all the features into this solution, right? The last one that it can help you consider how to move your databases into cloud. I mean because we use the storage and computing splitting architecture, that means your postgreSQl works working at the storage nodes and are deployed in on premise. But you can use Apache sharing Sphereex working at the computing nodes and are deployed on the Kubernetes or on the cloud, right? So that means you can use this solution to solve the headache of database on cloud such issues. And that's one that it will provide you the out of the box deployment. I guess most of the projects will give such feature, right? It's not the highlighted one. Yeah. So let's move the next page with the Charlene Sophia. Before introduce the feature of this product or this project, I just want to say that it's the big community, open source community. It's the Apache top layer project. And because it has been open source for more than six years. So you know that people love it. So there are many contributors committers active there. And it released for more than fifty s. So you know, worry that it's a brand new project and for you, because people will worry about that. Oh, if I have some questions, I cannot get help from the community or oh, it's a new one. So people will worry about the quality and his state stability, right? But all of the issues don't exist here because it's a bigger one. And many people or companies use this project for their production environment. And it provides a friendly document for you to learn more about it. And that's why I want to introduce the feature of this project. Generally speaking, sharing Sophia provides two clients for you to choose. The first one, the sharding Sophia GDBC. It's a lightweight GDBC driver for your Java application. It's a framework so you can use the native Java interface to use Apache Sharding VGDBC and help you to reach the following functions. The first one, data sharding. That means to shard your data into different database instance, right? Make it become the distributed database one. A second about the distributed transformation and also rewrap splitting. If your databases have many replicas and you want to input the QPS or TPF system, you consider how to leverage all of the replica one, right? For example, to send some the select query rings to the replicas ones, right? So at the beginning, maybe you consider use the load balancer. But now, if you use a existing Sophia, no matter it's a GDBC or proxy, all of them can help you automatically do the load balancer stuff. And send the update or delayed queries to the primary instance. And send all of the liked queries to the replicas once. And can for example, use the random strategy or use the round robin strategy to leverage your replicas into the instance. And also it can help you do the data masking or data encryption. I think it's so valuable, so valuable features, because you know that you want the data in the databases are encrypted, right? But how to do that if you use sharding Sophia, because you can see here, sharding Sophia, it's deployed between your application and your databases. So all of your queries first read sharding Sophia proxy shorting Sophia can automatically do the data masking or data encryption or decryption. And so it sends encrypted data into your databases, right? So you don't worry about that. Your data, it's all plain text in the databases because shortening Sophia can help you do all of the encryption work and decryption work. Yeah, so that's because you know that shortening Sophie, oh, another product is shorting Sophie proxy. You know that at the name hint, it's the database proxy. Database proxy can help you do all the features I mentioned earlier, right? Yeah. So that's all about the sharding Sophia GDBC and Sharding Sophia proxy and the feature they can provide you. Actually if you use sharding JDBC because it just implement the native GDBC. So no matter your databases, it's a postgreSQL, MySQL or SQL, SQL or Oracle, it all works here. Right? The last part about if you use the sharding SUV GDBC then your deployment architecture will become the right column. Right? At the beginning, your application need to manage your database connection and data sources by yourself. But now sharing Sophia proxy can help you manage all of the database instance, help you manage the connections, do the data sharding to do the data masking or data encryption. All of the good stuff, right? And your application just focus on itself to use this distributed database system. Yeah, and another part that how about if my application lives on the Kubernetes, right? If so, you can just use our sharding Sophia on cloud project to use the helm charts. One command to deploy sharding Sophia cluster in your Kubernetes cluster and then light your application, leaving the kubernetes to connect the sharding Sofia proxy and sharding Sophie proxy can help you manage all of the databases and do data sharding. No matter your databases are on the cloud, are on the premise or in the Kubernetes cluster or the RDS. Just like the sharding Sophia can connect to your databases, you can solve all of the issue, right? So thanks to the computing and storage splitting architecture, that means your databases where your storage nodes can locate it anywhere. So NAs part, I will use the sharding Sophia on cloud project to do the demo show and I will use the helm to deploy the Sharding Sophia cluster and use the postgresqL chart to deploy the postgresql on the Kubernetes. If you already have your RDS, then you can ignore this part and just use sharding Sophia cluster operator chart to deploy sharding Sophia in the kubernetes and light the sharding Sophia connect to your RDS. Then it can help you manage the connection and do the data. Yeah, so when you first do the deploy work, the second one we need to do the work about to create your databases and sharding table it's not a single table or standard table in a postgreSQL database. It's a sharding table. That means this sharding table, actually it's made up of many shards or many subtables or many physical tables. All the subtables located in different postgresl instance. But for your application, your application just visit shorting surface proxy and it thinks that there's only one logic table named T user. But actually this t user consists of 1234 subtables into two postgresql instance. But you know, here each PostgreSQl cluster, it has the primary one and a replica one, right? So Sharon Sophia will send the queries to the primary one or replica one in this cluster. Also if it found that the data or rows records existing the cluster two, I mean the group two, it will send a query to your postgres cluster two to finish the routing work and the SQL parsing work, right? But here, if we want to define a sharding table, we need to use this distributed SQL, it's a sharding Sophia SQL dialect dialect and to similar with your MySQL dialect or postgreSQl. It's the Sharding Sophia dialect, it's a distributed SQL. You can see here, maybe before you can just use this create a table to create a single table. But here you need to use a sharding table, similar SQL language to create a sharding table. And this sharding table, you can see here the sharding column, it's an order id and it used the harsh mod strategy to split one single table into four one, right. Four subtables, four shards, right. So if we finish this dymo and your application just waited, one of the sharding Sophia proxy and sharding Sophia will do all of the internal workflow and it will use the sharding algorithm or strategy to help you locate the exit table where your expected record locate. Right. And if it found that it's a select query, it will also to leverage replica to send the query to one of the replica. Therefore you will found the QPs of this new database system will become higher because it will automatically do the rewrite splitting such features. Yeah, so the demo show will include the following steps and I do it by myself but actually I have no time to introduce more. If you have this slides you can follow my guide and to do it by yourself. First one, deploy the cluster by use the Sharding Sophia proxy charts and then use the distributed SQL to create the sharding DB sharding table to make sharding Sophia be aware sharding Sophia aware of your databases and then do some acquiring to test it's okay or not. All right, so that's all about my talk. And if you really have some questions and we can talk later, or you can just give me the message on my linking or GitHub or Twitter. Right? So see you later on.
...

Trista Pan

Co-Founder & CTO @ SphereEx

Trista Pan's LinkedIn account Trista Pan's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways