Architectural Best Practices for Large-Scale Data Systems

Video size:

Abstract

In the ever-evolving landscape of large-scale data systems, ensuring robust security measures alongside efficient handling of data manipulation, retrieval, and system resilience are paramount. This abstract encapsulates a comprehensive exploration of critical aspects: Security, Pagination and Filtering, Caching, Error Handling, and Resilience.

Summary

Today I'll be talking about some of the architectural best practices for large scale data systems. What is storage architecture? Ultimately the data has to be stored in the form of data structure. What storage we need to pick depends totally on the data structure or the back end implementation.
B trees are hierarchically organized. What it means is it's efficient for lookups, like reading. It enables us to search through the right trees and pruning the other irrelevant or not needed branches. Why disk? Because disk based storage is always cheaper than the SSDs.
Some databases use quarteries to represent the data. The top node represents the entire map, like a two dimensional map. Each node has a value, and the higher the value, the more important the place is. So that's why you need to consider using quad trees for any service.
LSM trees are the data structure in the storage systems used for write heavy applications. The incoming writes first are buffered in a small cache, like a mem cache. Once that is full, the data is asynchronously flushed to the disk.
Inverted index is again used to search in a DB. You create a map with a keyword and then having the reference like say, id of a document. With that ID you refer to some content. More for faster search in the backend or in the database.
The data is always distributed with large scale, large data intensive systems. In each partition there is something called secondary index. The querying process across multiple partitions is known as scatter and gather. What we need is not local for this particular use case, but a global secondaryIndex.
Multileader replication is needed in distributed systems for several reasons. First and foremost, high availability. Another big important use is the write scalability. This enables the system to handle large volumes of write operations by parallelizing writes across multiple nodes.
replication involves copying data from one data center to the other asynchronously so that they are all consistent. How can you resolve such conflict during the multilider data replication? We need to use some data types, specific data types called conflict free replicated data types.
When considering the end to end design of a system, there are two types, inside out and outside in. Inside out architecture can be say monolithic to microservices decomposition. Outside in architecture example can be product driven or user driven.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi. View us. I'm santosh, software engineering lead at Bytedance, San Francisco, Bay Area. Today I'll be talking about some of the architectural best practices for large scale data systems. So what are these large scale data systems? They are something which we use on a daily basis, like Gmail, Yelp, eventbrite, booking.com, digital wallet, e commerce system, and Google document. So I'll be going over some trade offs which you need to consider as part of making choices for what you need to pick while architecting these large scale data systems. So firstly, to begin with, we'll go over the storage architecture. Now, what is storage architecture? So let's get into some introduction about this, right? So let's take an example. I mentioned a user, or a lot of users, accessing their Gmail or Yelp or booking.com or any of these services or websites. The request goes through load balancer and then the HTTP API call to gateway, then the microservices. All of these fun things happen in the backend system. So what is ultimately happening is all the data is written to some kind of a place where the data is written, and then the data is also read when you have to retrieve or on the get API calls. So this particular place where all the data is stored, like the blue box in this diagram I'm showing is the storage layer, and the storage has different types of databases. In the current world, we have been using lot of NoSQl and MySQl. And this is again data moving from MySQL to NoSQL or directly making an entry into the NoSQL like Cassandra or Redis, and databases like that is totally different topic, which I won't be covering here. And MySQL databases, MySQL, like a typical relational database management system. So where the data is written. And now in this storage system, we already have lot of options in the current world, which I just gave some examples, right, you don't need to implement the storage system from scratch. All we need to know is some technical details of how the data in the storage system being saved or processed, so that we as an architect or any software architect, will be able to make the right decisions when building a system. When building a data intensive system. So the two things, right, select the app storage engine based on the trade offs and then fine tune the storage engine to perform well for our system workload. So for these two needs, we need to understand the underlying implementation of this particular storage layer. Now let's look at some options. So ultimately the storage layer, which I just talked about has some way of storing the data in the form of some data structures, right? So ultimately the data has to be stored in the form of data structure. So when you say MysQL or a Cassandra or any database, ultimately that database is storing your data, say your gmail or say the restaurants in the Aihelp or the Google Maps data, the places and all those things are stored in some data structure in the backend of these storage systems. So that's something we'll be focusing on to understand the different types and then what storage we need to pick depends totally on the data structure or the back end implementation of that particular data structure. So now let's look at some examples. Firstly, the B trees. So this is a very classic data structure where you have the parent and then the children dropping off the parents, which I'll be going into each and every data structure in the upcoming sections. So most relational database management systems use this btrees and some examples of backend systems which use the database based off of Btree's are like Gmail, Yahoo mail or any mail system. And there are other services too, but I just took email because it's more connected to all of us. So the other ones are quattries. Say we have these services like Yelp, what are the nearby restaurants or nearby places on Google Map where you can just look around searching, or also booking.com where you select a region and then look for all the hotels or anything you want to book. So at the back end you are getting some data to be viewed, right? So where is that data coming from? That data is coming from a data structure which is called quattrics and why quattries are used here. And if you're building a new system altogether, what data structure you need to pick is something which should be running on your mind while I'm giving these details. Now these are like very interesting. The third type of data structure is the LSM trees. So they are write intensive. They're used for write intensive applications because it's a very unique data structure where the data is first written into some temporary cache and then flushed into the secondary storage like a disk. So like for example, rocksdb is a database which implements or uses this or implements the log structured merge trees in the backend and I'll be going into the details in the upcoming slides. And digital wallet is one service which uses because there is heavy rights. And of course the Hadoop distributed file system and Kafka also have write intensive operations, so they have to use the LSM trees in the backend storage. And inverted index is another, although it's not a data structure, but another concept which you need to consider when selecting the storage systems and configuring these storage systems. And in search engines like google.com or Amazon.com, when you search for certain items, you immediately get those things right. So in the back end, it is the inverted index that is doing the magic. And so whenever you build any search engine, inverted index is something which is very crucial and needs to be incorporated in your system design. So now let's get into the details one by one, like I said, firstly, the b trees. So as you can see in the picture, this is how the B tree is structured. So at the top you have some numbers, and anything to the left of the numbers are the nodes, or a node which contains values which are less than, say that number here 100, less than 100 are 48, 50 and 79. And between 100 and 155 we have the second node at the second level, 128 and 140. Like that, anything greater lies onto the right side of any node and anything less than that particular node value lies onto the left side of that node and anything in between. Of course, say here in 155, on the left you have lesser than 155 and on the right you have greater than 155. So this way the data is well hierarchically organized. What it means is it's efficient for lookups, like reading, say, when you have to look for when the data is so organized like this in btrees. So when you have to look for, say certain queries, you say, I want to look at an email from in a particular time range, like certain say March 10 to March 15. And that means that data lies in only one side or one part of the B tree. So you don't have to look through or search through the entire database automatically when you query. It doesn't do lot of seeking because even if you think about it, the data is ultimately stored in some kind of a secondary storage. So this particular B tree data structure enables us to use something like a disk and disk. Why disk? Because disk based storage is always cheaper than the SSDs. But again, disk that operations like input out operations really adds a lot of latency to avoid that our btree data structure, which is where the way it is organizing the data in the storage when we are using the RDBMs, it enables us to search through the right trees and pruning the other irrelevant or not needed branches so that you can just go to a specific node and there is not much I O happening, which means you're already optimizing the disk storage. And of course you're getting the data in a very quick time. And the time complexity would be log n. As for any binary tree or a B tree, as is proven mathematically. And of course it supports like just mentioned, range queries is one important thing. So in Gmail or Yahoo mail, when you have to search for a certain, because that's a very frequently used operation to look for emails from a particular sender or time range. So the querying is really, really easy. So read heavy operations are really efficient with the btree, and cheaper as well with the disk based storage optimization, which I just described. Now moving on to the next data structure, which is quarteries. And so some databases use quarteries. I gave an example here, mongodb, PostGis. These databases use quarteries to represent the data. Now, what is a quarter? Right, let's look at an example. You can look at the picture in this slide. The top node represents the entire map, like a two dimensional map. That map can be the entire globe, or that map can be a small region. Say, let's do it simple, right? Let's imagine the top node is the entire globe 2D map. And every map has four quadrants. On the left top it's the northwest, and the right top it is northeast, and down left is southwest, and down right is the southeast. That's it. So now if you want to go into North America in the globe, it's in the northwest. So it goes to node a. And in the node A in North America, there might be lot of places, lot of states. And when you have to go zoom into some places, you further go drill down from the node A, even down further. Just like how we have node C and D, I've given here in the northeast quadrant, which we can assume as say China or like Japan, where you can look for certain, you zoom into certain regions. And again, say if C and d represent certain countries, say for example, c represents China here and inside China, if you want to look for certain places, again the c drills down to four more nodes as a children, and then you look for certain whatever places you wanted to see. And that way the whole quadrant is structured in such a way that you can zoom in, zoom in, zoom in, or even zoom out, zoom out and look at the country level or just continent level and all those things. And one thing to note here is every node has a value, and each value represents like, the higher the value, the more important the place is. For example, if you are looking for, say, restaurants, the top restaurant always has a high value and is always at the top of the node in a particular region. Say you are zooming into North America A and a represents inside going down a, you have a one which represents California, and in a one you probably pick a top restaurants, like a brazilian steakhouse. And that immediately is the first child of a one. So when you're picking the algorithm, how it works is the breakfast search is done when an area is selected and all the nodes which are at the periphery of the selected map are picked. So, which means that the top places, the top things are always retrieved first. And the way this is working, like how the data is being retrieved very quickly, because this particular quad tree supports something called spatial indexing. It is just like a regular indexing, but applied on this particular data, the coordinates based on the applied on the coordinates data, like a 2D map, which I just explained. And of course the range queries similar to Btree, where we go through certain nodes, we don't have to search through the entire nodes. Again, login complexity, really quick. And what else? Oh, the important one is the density. So you can see in this particular picture itself, you just left node a as is, and there are no children, but node, the second node, which is northeast region, you have like again four children, and again in the southeast region, you have until level three, going until E and F nodes. So that means if a user wants to just look at superficially zooming onto the map, or a 2D map at a very high level, like a country level, you're not going to see too many details, right? But if you drill down all the way to certain pinpoint locations, then that's when the tree gets more dense in that particular node, like say southeast in this example, and you look into the details of it and you can get as dense as you can. So say, if there are a lot of things to be retrieved, the particular node can be very dense. And if there are not many places to be shown, then it will be low density. So it's quite adaptable and flexible in that aspect. So that's why you need to consider using quad trees for proximity service or any service that involves like the coordinates and spatial indexing and anything with maps and places. Now, the next one is very interesting. One is LSM trees, I explained. So what are these? So LSM trees are the data structure in the storage systems, which are used for write heavy applications, say Kafka, say Hadoop distributed file system, where there is lots and lots of data, like stream data coming in. And imagine if there is lot of data coming in, you just don't want to write all the data each and every. Say you've got a write request and you don't want to write it as secondary storage immediately, right? Because it involves some I o and it won't be efficient. You always go to the disk, write and come back. No, the more efficient way is store it like buffer it in a particular cache, like a cache thing, as it is shown here, right? So the incoming writes first are buffered in a small cache, like a mem cache, and that itself is implemented. The back end data structure is again a binary tree, and once that is full, the data is asynchronously flushed to the disk. So here, level 0123 are disks, the secondary storage. So when the cache is full, the data is flushed onto the disk, and the disk when it is flushed, the small objects or the small entities you see at level zero, each one is called sorted string table, which are called the SS tables, and they are immutable, they can't be changed, and the data is sorted whenever it is pushed. And when you have lot of such individual SS tables, imagine if someone wants to do a search like writing is good, right? We are just flushing into the memcache and then writing into the memcache and flushing into the disk. But what if someone wants to read? Are you going to write the logic to go over all these SS tables? It's not efficient, right? So to make the system read efficient, what happens is in the background, asynchronously, there's something called compaction, which happens. What it means is say at level zero, the two or two to four SS tables are clubbed together, like the merge sort algorithm is used. Since they are sorted. We use a merge sort to compact or club multiple SS tables together and then flushed to level one. So the level one, what you see is an X level zero which was flushed to level one. Similarly, level two is an X level one which was combined. At level one, all the data SS tables, the merge data is again merged. It's a classic merge sort. You can look up online and the merge data is flushed from level one to level two. So that way the data keeps getting built up. And so when you have all the sorted data clubbed together at one place, it is easy to search. And that way you're getting the read performance and also the right performance with that memcache buffer. So compaction is an important point here, which needs to be noted. And LSM trees are used for any write heavy application. This is the very reason. And another important thing here is with the LSM trees, the configuration is customized. We can provide what is the buffer size of Memcache and how frequently do you want to flush the data from buffer to the disk storage and what should be the size of the compaction which we are doing. So all those things are there. Now moving on to the last one. This is again not a data structure. But I mentioned about inverted index, right? So why inverted index in search engine? So what are inverted index? Right? So what is an index first of all? So index is again used to search in a DB. You create a map with a keyword and then having the reference like say, id of a document. And then with that ID you refer to some content. So it's more for faster search in the backend or in the database. That's what the indexing is used for. Now, what is reverse index? Say, let's take an example here. So we have lot of documents and you can imagine these documents as say the web links which appear when you search Google.com. When you search on google.com, who is the highest paid actor, you get a list of links, right? So you can imagine all those links as these documents say, document one is one web link. And like a web crawler thing, right? So document one is web link one, web link two, web link three. And now how we build the inverted index is you get all the names, right, all the searchable entities. Like he says this a all brown day in this particular black table, I picked all the words from these documents and put here. And then the key is what all the documents are referring to, these particular keywords. So it's very opposite of say, document one has these keywords and document two has these keywords. It's the reverse. The keywords have documents one, two, three, and keyword has document two, three, four, whatever, something like that. And now when someone is searching, right, when you search for, say with these keywords, say all brown day, then what is returned is documents three, one and two are returned. So that means you are actually based on the keywords you're caching the actual documents, just like say Google.com, right? You search for the highest paid actors, right? All the keywords are used and whichever documents or whichever web links have those keywords retrieve the particular links and are displayed. So that's what the reverse inverted index is about. So when you're designing or building a search engine or anything, one of the most important storage architectural considerations is building an inverted index. So now moving on to the second part is again, we're still in the storage department. So it is a bit tangential, but still in the storage part is partitioning the databases. Now, until now, we talked about the data structure behind the storage, where the data is being stored. Now, how the data is being like, you can't use one database or you can't use one storage system. It's always distributed with all these large scale, large data intensive systems, right? So we all know about primary indices, but I would like to focus on something called secondary index. Now let's understand what is secondary index and why we need to know, and what are the architectural best practices you need to consider to use the right secondary index. Now, let's take an example. I took an online book management system here and say a user is trying to find, I wrote it in red here, trying to find all books by a particular author called. So now what happens here is the user sends that find all books by David, and it goes to, obviously the microservice, and the request goes to the backend storage. You see, there is a primary index for fast retrieval of the data. Say, when you say find all books, there are the data. First of all, this is how the data is partitioned. Let's assume, right, you don't have one storage or one server or one back end storage, where you get all the data. It's always distributed. You have partition one located in, say, Virginia, partition two located in Seattle, partition three located in, say, Europe. You have data scattered all across. Let's assume the data is sorted or partitioned based on the genre, and say, now we have to find all the books by the author. David, you need to go through, like novel science and biography, each of the partitions. And in each partition there is something called secondary index. Why? Because even with just going into the partition won't help you, right? There might be a lot of other authors too, and you want to retrieve the books of only one but one author, which is searched. So going until the partition level is fine with the primary index, but in the primary index, in a partition, you actually need a secondary index to retrieve results faster and quick lookup. And that's why you need a secondary index. And now what happens here when you go to the secondary index and search for the author David, you have author David in partition one, two and three. So the data is retrieved from all three partitions, and then it is aggregated in the microservice and then returned to the user. So that's something which is a lot of activities happening there, right? Like aggregation. So it sounds fine, but let me explain. You this particular secondary index is located in each partition, right? That's why we call it local secondary index, or another name is partitioning secondary index by document. So the querying process across multiple partitions is known as scatter and gather. It's scattered, you're querying everything and you're gathering together, aggregating and sending it to the user. This involves parallel queries to each partition to collect the data needed, although parallelization can improve speed. Although parallelization can improve the speed, the scatter and gather can be resource intensive, especially in this kind of large databases. So the main costs are associated with querying each partition separately and then consolidating the results, which often require lot of computational and network resources. You're actually making calls over the network for fetching all the data. You can definitely optimize it. Consequently, while local secondary indices enhance partition specific query efficiently, which I just explained, they also necessitate a careful consideration of the overhead involved in scatter and gather operations in distributed data systems like this. So what is the solution? So what we need is not local for this particular use case, but we need a global secondary index. So here in this example, in this architecture, the secondary index is not local, it's not pertaining to a specific partition, but it is at the level of primary index itself. Now, when a user tries to find all the books for that particular author, we can directly let the microservice query through the secondary index. And you can see here, when the user asks for the books with author David, you can fetch it from partition one, two, and so, and there is no aggregation needed. And all the books data is actually returned back. And at the same time, if you want to retrieve all the books related to a particular genre like novel, it goes through primary index and say partition one, all the books with novel genre and all the authors are returned. So this is the more ideal. Like the partition secondary. Global secondary index is the most ideal in the particular case where you are drilling down querying certain authors books in a particular genre, like two levels of filters. So this partitioning is called partitioning secondary index by term. This approach offers a more efficient solution for queries that span multiple partitions, like searching by authors across all genres. It reduces overhead of scattered and gathered, which I described, since the index is global and not confined to individual partitions. However, one more thing, this method might introduce challenges in maintaining the global index. Like you're already having a primary index, right? So you have to have a separate data structure like a map, and where do you want to store it and how do you want to store it? And with a large data and large data intensive environment, it becomes a challenge. So it's a trade off between the ease of cross partition queries which you get with the global secondary index, and the complexity of maintaining a global index. So after all, everything in life is a trade off. Now let's get some differences, right? I'll give some real world use cases. So let's take an example, a couple of examples. When local secondary index is used and when global secondary index is used. So local is about partitioning. Querying in a particular partition, say in an ecommerce website, the buyer wants to look at all the products, say in electronic products and with a particular company, say Sony, he wants to look at. He or she wants to look at Sony electronic products. In Sony electronic products. So the first level is go to electronics partition. And then in electronics partition you want to retrieve all the products that are Sony. So for that you need secondary index right here. The Sony products or the company name is a secondary index in the electronics category. Electronics, what is that partition? So we only need the data from that partition. Now consider a use case which is more applicable for the global secondary index is say any company or any firm, right? A multinational firm has employees across different countries like US, UK, Asia. Now the HR department wants to view all the managers across the company respect of the country. And let's assume that the storage is actually partitioned. It is sharded as per the country like US employees are present. The data is present in the US data center, UK and UK data center, so on. And now for this particular query from the HR to view all the managers across the company, you don't need a local secondary index. You don't need to look into the details of one particular partition. Rather you want to look at all the partitions. And this is where the importance of global secondary index comes into picture. So data is obtained from all partitions. So those are the two use cases. And yeah, this multinational company employee database is a common thing for a lot of companies. So I'm sure many of you guys can relate to it. Now we'll switch gears, moving on to the next part, which is again, we are still in the storage world or a database world. Now I'll be talking about this conflict free replicated data types. So what are they? But before I talk about these data types, we should understand why we are even talking about this. Or how can the architecture be improved or architectural considerations should have something like a CRDTs. Now let's talk about a concept called replication. Now what is replication? Right. So as you can see, I directly am mentioning multilateral replication here. So that means the back end storage or the data is not present at one place, but it is copied over to multiple servers or multiple data centers or multiple places. And again, multileader replication is needed in distributed systems for several reasons. First of all, replication is needed. Several reasons. And on top of that, multireeader replication. Why? Like first and foremost, high availability. Say system availability is improved by providing multiple independent leader nodes. Say if one leader node fails, like one database node fails or becomes unavailable, other leaders can continue to accept the right request, right. Ensuring uninterrupted access to the data and services so the user has good experience. Another reason is say, fault tolerance, especially this multileader replication improves the fault tolerance by providing redundancy at the leader level itself. If say one leader node fails, other leader node can continue to accept write of business and preventing data loss and service disruptions and things like that. And another big important use, big important advantage is the write scalability. So this particular replication model allows write operations to be distributed across multiple leader nodes. So this enables the system to handle large volumes of write operations by parallelizing writes across multiple nodes, involving improving the overall throughput and scalability. So again, like geographic distribution in general, the data earlier I mentioned by giving the example of the employees database, right? So the data is always distributed geographically, allowing the leader nodes to be located in different regions or data centers. So this enables the data to be replicated closer to the users or replication instances, reducing latency and improving the overall user experience. So that's why we have multiple nodes and the data is always copied from one node to the other node. And now let's take this use case where say you have two users trying to edit a common shared Google document. Let's assume the title of the document is a and then the ID of the document is one, two, three. Now in the red flow, the user red and purple both. Step one happen at the same time. The user one says, hey, want to update the page? The Google Doc title is equal to b for this particular Google Doc one to three. And user two says yeah, even I want to set the title to C. And what happens? Step number two? In red and purple, the data is updated by the leader node. Internally. That means the leader one changes the title from A to B, and the leader two changes the title from A to C. Done. And don't focus on that follower here. It is just like a asynchronous copying of the data from leader to follower for the read operation. So that's again a different concept. Leader follower model of general replication. So let's focus on step four and five now. Now, like I said, replication involves copying data from one data center to the other asynchronously so that they are all consistent. And the way the data is synchronized between the leaders. Leader one, two, and there can be a lot of leaders, right. Leader 12345. There are a lot of topologies for it, say star topology, ring topology, mesh topology. And again, that's again a different concept. I'm not going to go into it, but the concept here is leader one, leader two should be coming into sync or syncing each other. So this asynchronous, during this asynchronous operation, what happens is let's assume red step four, red line, step four, red line. Step four says four and five says, hey, leader one is saying to leader two as part of step four and five. The old document title is A and I want to change it to B. But what happens? Leader two says, hey, what is a? I don't have a. It is already changed to C. So there is a conflict there. It can't be changed. And then now steps four and five at purple also you can see the reader two goes to leader one and says, hey, I want to change document which is titled as a to C. Then leader one says at purple, step four and five. At step five of purple, the leader one says, what are you talking about? I don't have any document with a. I only have it B because it's already changed to B. So there is the conflict. So how is this conflict resolved? How can you resolve such conflict during the multilider data replication, that's where we shouldn't use the regular data types like int or maps sets and all the data types we have while actually doing the replication, while actually copying the data from leader one to leader two or leader two to leader one. We need to use some data types, specific data types called conflict free replicated data types. They are different. They are special data types. I gave an example here. I thought integer is more easier to explain, so I just gave a code sample here. You can see that it's not a regular integer which is declared. It is an atomic integer which is coming with the concurrent atomic import utils. And you can see leader one in the main block. Two objects are instantiated. Imagine both are like this program is being run instantiated as a multithreaded program in leader one and leader two and leader one increments the counter by one and leader two also increments counter by one, as you can see in the main block. And after that each one have value as one. Now here the merge leaders is the concept of actually the replication we're talking about. Like merging is nothing but replication. And in that, in the merge logic, as you can see at the top public void merge, you're actually adding the logic where you're ensuring the data is consistent. You are not having any conflict in terms of the inconsistent data, but rather you are ensuring that the data is consistent across both the nodes and merging it accordingly. So using the conflict free replicated data types, we can ensure that these conflicts don't occur in any replication, mostly the multi leader replication. Now we will switch case again, move out of the storage world, and rather focus on some architectural considerations for the end to end design of the system. Now, when considering the end to end design of a system, there are two types, inside out and outside in. Now let me go into the details. As you can see the pictures here, they are very general, like whatever I have been talking about until now. Like say any system has, you can see outside in architecture first, where there's a user making some requests, say ecommerce platform, he wants to make some payment. After selecting a list of things to pay for, you have a user interface where he does that. And then you have the APIs, right? Like the APIs are invoked and at a particular endpoint the HTTP call comes onto the microservices. And then internally there is some data manipulation there. Writing or reading happens. Ultimately it's some crud operation, right? Like create, read, update or delete. And step four is the classes or abstraction, which is defined on top of the database. And then you actually write it to the DB. So this is the general flow. Now, what is outside in and inside out? So let's say there are two ways of designing a system. So when there is a use case, right, you have to design a system. Either you can define it from the user itself, saying user interacts the system in certain way, user opens certain page on the UI and user clicks this button. That results in calling a particular API onto the backend system. If you go by that approach, like a user centric approach, or a user interactive approach, that is outside in architecture, I would say it's more like a product management or product driven approach. When a product managers come to a software engineering team and says that hey, this is what we want to build as a product, then you most likely tend to go with outside in architecture and inside out architecture. Is more engineering driven. That means you are not starting off with the user or user perspective, but rather you are changing the data models, data types, and then accordingly you are changing the abstraction. Back end classes like data access layer. Like for example, you're splitting the database into multiple databases with foreign key and primary key linkings, and then you are writing new classes on top of it. And then you are, accordingly you are defining some new microservices. It is domain driven. So this is like the engineering team decided to make some architectural enhancements within the system. So that's when you actually start off with the database, and then the classes on wrapping around the database, and then the microservices, and then the API change and ultimately it propagates all the way to the user. So that is the inside out architecture. So an example of inside out architecture can be say monolithic to microservices decomposition. And outside in architecture example can be product driven, right? Or user driven. It can be like ecommerce platform, where the user goes and purchases things. And how a user interacts with the system is one thing, let's say inside out. Another example for inside out can be a banking system. In developing a banking system, the core domain revolves around financial transactions, accounts, customer relationships. All these are like entity relationship model needs to be defined and then written, right? And then the domain driven design starts by modeling these concepts and then defining the business rules governing them. Once the domain is well defined, then the infrastructure things such as the storage, user interfaces and all the external integrations are addressed and it's propagated out. So that's the inside out and like outside in. I told about ecommerce platform, right? So imagine a company is building ecommerce platform that aims to provide a seamless shopping experience to its customers. So now to apply an outside in architecture approach for such use. Case, the development starts by identifying the key user interactions and requirements from the perspective of both customers and sellers. Then they prioritize features that directly impact the user experience, such as product search, browsing, purchasing and order tracking and all of that, right? Yeah, so that's what I have just described here. I gave some examples as point number four. Inside out you can see monolithic to microservices like domain driven or domain driven applications like banking system. And outside in user centric design. As I just explained, API driven inside out, as you can imagine, right? It's a push architecture or a push strategy because you're going from database, which is inside the back end system. You can think of it as inside the back end system, and then you're going out to the user, so it's inside out or push. And then that's why the outside in is more coming from the user, the end user, all the way until the system, right? So that is user into the system. So you can imagine yourself as the system you're either pushing or then for pulling. So that's how the strategies are defined. Since if you're doing something like monolithic to microservices rearchitecture, like an inside out, you know all the details of the system, like what needs to be changed. And you don't know about UI changes in that particular case. Right? So that's why you start off with internal things and that's why I mentioned at the second point, like forecast the demands of the UI needs, and outside in is exactly opposite. You start from the UI, you know what UI is demanding and then you go inside. These are the four topics I wanted to cover as part of the architectural practices, the important architectural trade offs or design decisions that one needs to consider, especially a software architect designing these data intensive systems, knowing the details of the back end to make the right decisions based on some crucial trade offs. And there are other things like say consistency and availability and lot of other aspects of the system which can be considered and architectural practices needs to be employed. But yeah, I'll keep that for some other talk. But thanks a lot for listening to my session today and I really appreciate it. Thank you.

Slides

Download slides (PDF)

See all 47 talks at this event!

Conf42 Cloud Native 2024 - Online

March 21 2024

Architectural Best Practices for Large-Scale Data Systems

Video size:

Abstract

Summary

Transcript

Slides

Santosh Nikhil Kumar

Senior Software Engineer @ ByteDance

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2024 - Online

March 21 2024

Architectural Best Practices for Large-Scale Data Systems

Video size:

Abstract

Summary

Transcript

Slides

Santosh Nikhil Kumar

Senior Software Engineer @ ByteDance

Join the community!