Conf42 Platform Engineering 2025 - Online

- premiere 5PM GMT

Building Real-Time Graph Analytics Platforms: Architectural Patterns for High-Performance Relationship Processing

Video size:

Abstract

Your relational database is struggling with complex relationships. Discover how platform engineers build graph systems processing queries 100x faster, handling millions of entities in real-time. Battle-tested architectural patterns included.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Good morning, and good evening everyone. Depends on where you're joining from. Thanks for joining the session. My name is MuTu Shandra Mohan, and I have more than 20 years of experience in data engineering domain. And today I'm gonna be talking about building real time graph database that can be leveraged for both real time and analytical capabilities. And I'm gonna be talking about the architectural patterns that can be followed and the key design considerations that need to be accounted in building the graph database. I wanted to start with the putting the question of why we need to adopt graph database, right? What is the problem statement in building a graph database? The key thing comes from handling the relationship data. In today's scenario, the data. It's very complex. And getting analytical information out of it getting information out of the data is crucial and especially in handling complex data patterns like social network data handling supply chain information and how to identify fraud from a given data set and building knowledge graph from a given data data set that can. Provide information out of the data is very complex mainly because of the la last volume of data, large volume of data that that we handle on a on a daily basis. The, that, that brings in the need for graph database in a traditional relationship database. Identifying this complex multi hop analysis is very difficult and considering scalability is even more challenging and that is one of the main reasons that we adopt graph database. We will get into more details as we go through the session. So as I move along. When we decide to build a graph database, what are the challenges? We as as a platform engineers now either in the data engineers or in the platform engineers, what are the common challenges that we face on a database? Or in, in building the lifecycle of the database. I wanted to cover these three bullet points that's listed here. One is when we build the graph database that we are going to use it for real time analysis, right? The often the expectation is to handle customer calls that can be addressed in subsecond. The query response times this is this is a very a common case when it comes to fraud and cybersecurity use cases that the response times need to be fairly quick. And how do we adopt seamless integration with the different. Different applications or systems, for example, it can be a microservices, API based systems, or it can be a batch oriented data integration platform. Or it can be even with any of the packaged applications. So how do we integrate with different applications and systems that can bring in heterogeneous data to the graph database? How do we integrate it? And then the third thing is how do we address the scalability concerns? We can do vertical scalability, but that has a limitation at some point. So one of the one of the more, most common adopted architecture is the horizontal scalability. So that it, there is no limitation in terms of the amount of scale that we can. We can grow into, right? So how do we how do we adopt horizontal scalability for an ever-growing demand on the data side, and especially with graph database, right? So how do we handle these three challenges? Is what we are going to be discussing about. And these are some of the key challenges that comes into when we start implementing graph databases. And and there are very few technologies that kind of stands out to handle. These critical, important challenges when it comes to graph database. Moving on what are the limitations of the traditional approaches. Or in other words like the current databases that are available. And when we look at the current database, I think one of the key thing that comes out is the relational database. So what are the challenges in a relational database? For a use case that we adopt with graph database often comes with a lot of interconnected relationship among the data. The key thing is how do we build build a system that can inherently provide the ability to for the users to analyze the data and provide the information out of it, right? So that is where the relational database kind of struggles in terms of building the relationship within the data. When we say relationship at a high level, yes, there is a relational database that provides one-to-one or one to many relationship, but what we are talking about is complex relationship with that are that are not direct, right? So that is where the relational database struggle. And it is more often realized when it comes to running the queries to understand the relationship, right? For example, one of the use cases like who is a friend of friend or find all the friend of friend who share at least three interest with a particular user, right? This doesn't come with a direct query analysis. On a relational database when you want to address this problem, that's where the second point comes. It's an exponential com complexity because often we don't know how many levels of analysis that we wanted to perform and how many joints that we wanted to perform, right? So in most of the cases in relational database, this often comes at a point of making an assumption that how many levels that you wanted to perform. That's where the limitation of the relational database comes into play, where graph database scores out. And that multi hop analysis in a end number of hops that we wanted to perform graph databases handled so easily. And the way how it graph database, addresses this problem is that the way it stores the data the the graph database stores the relationship as a first class citizen rather than as a data as a property, right? So relationship is being stored. And that's how the data is being is being sorted or stored so that the traversal queries can be handled so easily and any sort of indirect relationships and multi hopp analysis that, that needs traversal based on the dynamic results that are fetched at the previous step can be achieved using the graph database solution. And that's how it differentiates itself from any of the relational database or document database or any sort of search database and graph stands out separately. Moving on what is the core architectural pattern? So we are gonna be talking about different things, but the first one that I wanted to talk about is an event driven architecture. I picked up the event driven architecture because this is one of the in, this is a common use case that comes to different graph implementations because often the times the graph implementation, the expectation is that. How do I handle fraud use cases and it comes to financial or even sometimes non-financial institutions how do I address the fraud and cybersecurity use cases, which requires faster response time, right? And that is where the event driven architectural comes in when we talk about event driven architecture. It often the handles the ability of implementing or integrating with an event event based system, something like Apache, Kafka or any of the streaming platforms, right? There are good amount of graph database that inherently supports. Integrations with even based systems. They have connectors with Apache Kafka or any of the other streaming systems. And and any sort of even driven updates that comes from from these systems can be handled using a distributed database that kind of. Routes the calls to a specific partition within the database that that is driven based on the relationship that is that is identified while building the data model. So it's a hand in hand process that you identify how you are going to model your data and that gets a specific. Storage unit, which is partitions in a graph database. And that can be easily retrieved at the time of handling real time updates that any sort of event driven architecture that feeds data in a real time manner can get easily routed to the specific graph partitions. And what it brings us brings the capability is that it can handle it, it can it can handle if there are any outages, it can replay the data from the streaming platform. We can quickly replay from where where we left over. And multiple consumers can consume from the same event for different purposes. Even in terms of a graph database, you can have a different graph instances that handle different customer expectations. That can be one for fraud related use cases, and that can be one for marketing. Use case. Both can be fetching the data from the same Kafka instance or the streaming platforms. So that is one key consideration to look look into. Because at the time of defining your data model you also want to consideration what sort of systems that you wanted to integrate with. And what sort of given driven architecture that you wanted to adapt to? There can be instances where we don't need it. It can be a pure batch processing systems. But often it's the case that that you wanted to have even driven realtime architecture and and event driven architecture is a very fairly. Robust system compared to other API based integrations. It depends on the situation, but even driven architecture is a much more reliable architectural pattern that can be looked up to. Moving on. What are the different horizontal scaling strategies that we can adopt? So we talked about the the growing needs of data. When it comes from different sources, like it can be an iot system or it, it can be from a these days there are large volume applications that generates data in a real time manner. So how do we handle. Data that's coming from these systems. How do we make sure that we handle them correctly? So one of the key thing is handling it through horizontal scaling so that you are not hitting a limitation at some point. And horizontal scaling handling it through distributed systems is the, is the way to go forward. And within horizontal scaling approach what are the different strategies that you can use? One is a graph partitioning. Where the graph database, as we know is defined from how it has been defined in the mathematical, the statistical models vertex and edge based and how they are connected by each other and how they comprehend each other. When it comes to graph partitioning. This is also at the time of graph data model modeling and that you define how you want, what is the key thing behind your your data partition, right? Whether it is driven through vertex partitioning, or it is through each parting. It is largely driven by your your query pattern. How you are going to be doing it is not in terms of your data characteristics. It is going to be how you are going to be consuming your data from your querying patterns going to look like. So depends on that. You might want to choose between a vertex based partitioning or edge based partitioning. There are different graph database that supports it. There are. Some databases that doesn't support it. So this is one of the key consideration when you wanted to adopt a graph database to see which partition, which type of graph database that you wanted to choose, and whether that can support a ver, a vertex cut partitioning or a h cut partitioning, right? That is one. And the second thing is what kind of replication strategy, right? Often when we talk about distributed databases, there are different it follows the cap, the. There are different replication methodologies that are being adopted. One is full consistency. There is there is the other one that's the eventual consistency model, right? The eventual consistency model is that you write to a quorum and then you read it from a quorum. So that way you are fairly confident that you are reading the latest data. So that is eventually consistent model is one strongly recommended approach when it comes to graph database because it's not gonna be a system of source it's just a system of reference. So eventual model is fairly reasonable to go with and a master slave replication where you write it down to one and then you read it from the replicas, that's another one. And there are things like consensus protocol, which is a different. Approach of eventual, but it follows a raft protocol. That's another replication strategy. So all these are some of the key aspects to look into when you are looking for horizontal scaling. Graph database that support horizontal scalability and which is essentially one of the key characteristics of distributed. Computing or distributed data storage. I have put it into two different groups. One is like how you are going to store your data. That's your graph partitioning part, and then what kind of replication strategy that you wanted to adopt to. That's more from a computing and data data storage eventually. And, 1, 1, 1 key thing to look into is like graph is again it's not a stateless system. It is a, it's a stateful system. So you wanted to consider about the storage what kind of storage that you wanted to, so that your travels again is fairly easy, right? So that, that's something to look into. But. Distributed database is the way to go forward. That's what we wanted to cover on this one. The next one is we talked about this the kind of storage that we wanted to adopt into, right? So you decided that you wanted to go for a distributed database, but if you are not sure about what kind of, storage that you wanted to adopt into this is where your next design aspect or the architectural aspect comes into play. What is going to be your use case that you are going to handle? Is your use case going to be like a sub subsecond or a sub Yeah. Subsecond queries that you wanted to address in a fraud or cybersecurity use case. Or is it is it gonna be a system that heavily relies on how you are going to store the data but may not be needing a sub-second response time? There, there are use cases like that often times that you look at to a data store where you are okay to wait for two, three seconds or even up to 10 seconds. But you wanted to be more towards storing the data in a persistent manner that you are able to rely upon the data and faster recovery time, right? So if you have a use case like that, and then there are specialized storages, like document stores or key value paths columnar databases, right? Those are again a flavor of your distributed databases. May not be truly graph, but it probably doesn't need a high amount of data. So you can fairly run better than a relational database. That's another specialized storage that you can look up to, right? So these are the different flavors in which you can adopt to a hybrid storage. Like things like in memory when you are looking for subsecond, then you can go for Redis graph or Apache graph. As as an option. And when it comes to persistent storage, like Neo four J Amazon, Neptune, that's another thing to look at. And there are specialist storage. There are tons of things. And in the market oftentimes when you have a NoSQL database it comes with a graph database. There are cases that we, we often look up to. Using the same NoSQL storage with a graph layer on top of it because you don't have to adopt multiple technologies to, to support your use case. So that's another area to look into and query query optimization techniques. This is like post your graph database. You have adopted a graph database and then you have you have fed in a good amount of data into it. Like how do you. Optimize your query, right? So this is where you you put the hat of a developer and then see what is the best way in which I can I can define my query or design my query, right? So you can look at the option of early termination that is once you reach a part. A query result point, then you don't have to continue the query. That's an advantage with the graph database that you can terminate the query as. And when you you have reached the result point, right? And then there are bidirectional search. What I mean is that you can start from multiple points depends on the query criteria. And this is where a graph database plays a critical role that if you wanted to start with a country, and if you start with a particular age group, right? So you can have these two search points and you convene towards or converge towards a singular result site, right? So you can start from bidirectional point. So that way your query run times would be faster. And this is this is something different from our relate traditional databases, right? And and then there is cost base optimization that you can do. What it means is that you can find out what is the degree distribution for that means when you are on a particular vertex, you can see how many. It just are connected to it, right? So that way you know what is the criticality of that particular vertex, and you can do cost-based optimization. If there are high degree, then probably you don't want to run your query from there. You wanted to start from a much lower. Vertex, so that way your results are faster. So that is one way of cost based optimization that you can adopt in with the query optimization technique. So these are some of the trivial ones, but can yield the high results when it comes to query optimization and then memory management strategies. This is go hand in hand with your query optimization. You can adopt things like your vertex edge pooling. So basically you can fragment your memory and see what graph elements that you wanted to store in memory, so that way your garbage collection, GC collection. Often times that you you will find these databases use Java heavily, so the garbage collection, once you address garbage collection, your query response times are faster. So that is the main intention behind it. And then compress direct presentations, each graph database will have its own compression logic. One of the key thing to look after is something like CSR which kind of uses a bit backed adjacent list that, that helps in compressing the data. And that kind of helps in getting the runtime faster. And then you have the hot cold separation that this kind of goes in relation with your monitoring your database that you identify, what are the key vertexes or key partitions that, that are hot and cold which are like often used versus less used. You can separate or not separate. Identify them and try to split the hard partitions into multiple different chunks of give more compute power and more storage so that the queries can run faster, right? So these are different areas that you can look into from a memory management sta standpoint that can help lead higher results. And moving on the integrations pattern. So this is where we started with when you want to implement graph database you wanted to implement in an area and fashion that's one of the more often use cases that we have seen. But it is not limited to that, right? But even within a near real time there are multiple different methodologies that we can opt into. One of the key thing is the change data capture which is the logs that are retrieved from a traditional databases. Or you can have a custom connector that kind of reads from a traditional database and then start pumping in data using a streaming platform in the form of a CDC. So you can have data integration. Pipelines that you can connect to graph databases or even connectors that are available using streaming platforms that you can use as an integration pattern. This is one of the most common thing that that often that we see with graph implementations. And the second thing is the graph QL integration. So this is where you see graph as a database kind of retrieve the results. Identify the relationship. So often we get the question that you have a graph database, and why do you need a graph ql? So this is where we see the heavy lifting is done by the graph. That means identifying the unknown relationship, multi hop analysis scanning through a large volume of data and bringing those information out of it. Is done through graph database, but then you often need to integrate through a schematic layer, right? And where we expose the data to Y end consumers in the form of a UI or any other integrations, right? That is where Graph QL comes into play. That if you wanted to if you wanted to match the data between a customer platform with. With the cybersecurity or a fraud analysis data, that is where graph brings these two data together, stitch them together in the form of a schematic layer and present it in the graph QL integration. So that is another integration pattern to look into. It depends again on the use case how you are going to consume the data. And then and very own use case that is the analytical integration that integrates with warehouses or from big data platforms. That often we go with Apache Spark connectors that kind of does the heavy lifting with large data of large data volume. And you can have connectors like press two or Tryo if you wanted to run SQL queries. Because these connectors provide a schematic layer and you can run and you can some of these graph database provides. The ability to run a direct SQL queries that you translate under the hood to, to convert them into graph queries. Moving on monitoring and observability. This is monitoring and observability is standard across different systems, but what is the thing to look after in the graph database? The key things to look after is the query complexity metrics, right? So you have tons of queries that that, that are being run, right? But often you wanted to see what are the queries that perform bad? And what is the traversal path that these queries are taking? And what are the queries that often use high memory, right? How do we how do we improvise them? So this is the bottleneck identification layer. Whether it some of the strategies that we looked at, whether it needs partitioning, whether it is a hot, cold part hot cold partition that, that needs to be addressed or whether it can be handled through some compression logic. Identifying this complexity metrics is one thing to look after. And the heat maps again, the heart VAEs and heart edges. Often that if you have a customer base and that is very kind of centric towards a particular region those are things to look after from this heat maps, right? And distributed tracing. This is a fine grain analysis to find out when you have a query that probably doesn't perform good, you break it down into individual layers. There are you can achieve it using any of the logging platforms and often it is it can be done through the native graph logs itself. You can break down the different layers of the query to find out which layer is performing back or or needs to be improvised so you can get the idea what can be done. The next bullet points we cover about, providing a custom dashboard that kind of captures these these use cases, but what different metrics that need to be adopted. Things like cluster coefficient trends, degree distribution changes, and component size distribution. The, so these are some of the metrics that, that, that needs to be looked into. And yeah. Moving on. So you have a graph database up and running. How are we going to back up and recover the data? Because you have customers who rely upon this data, so you need to have a reliability score on it. So how do we do backup and recovery? So one of the most common things that people recommend is to go for a full graph backup because. What that means is that you don't want to do an incremental backup because this is not a traditional database that you get the logs the incremental logs out of it, and then and then store it. Because these are all interconnected graph data. So often we try to use, to the extent possible, we try to adopt full graph backup that is the most reliable one. But when you when you start implementing real time solutions that those are not often directly possible so there are even sourced incremental backups that are available that are largely supported by all these databases. You just. Take the running event logs and you are able to build the snapshots out of this and that's the next level of reliable backups that you can get. But it, there are some trade offs that need to be looked at. What is the RTO and what is RT objective for the for this particular strategy that you wanted to set as in business expectation? Finally there are some advanced techniques that are that are being used in some cases which are like multi version concurrency control that basically. Takes backup in a real time manner without blocking, right? But again, these are things that are new in the market and that needs that's being explored in different use cases. When it comes to some of the critical use cases, I think those are still in the experimentation stage as I would like to point. Yeah, so this is capacity planning. So you are ready to build a graph database. What are the key things to look up to when it comes to capacity planning as a, as UD the architect hat. What are the out of the box strategies that are available when you wanted to do capacity planning? There are some proven methodologies and proven tools or accelerators available. For example, the synthetic graph generator, that is a tool that you can use. So you feed in some of the informations about. What, how many tis, how many edges that you wanted to build, and what is the data size that you are expecting? You feed in that information and it gives you approximate figure of what you should be looking at. So that is a great tool to start with. And one of the most reliable thing that I would say is the comprehensive load testing. That is to actually start a POC. But you run different scenarios that, that, like things like the burst scenarios, right? You a COOs testing, right? You test it and find out, because often with graph database the data is unpredictable because you often find new. Patterns of data, because that's the thing that the graph database, that is, that you find out different pattern patterns of data that's something unknown to you. So the load testing is a great way in, in my view, but it takes maturity time and then memory modeling. This is similar to load testing, but you you try to build a complex scenario that you try to query at least one percentage of your graph that has a large set of edges and s that are being connected to, and you try to query at least one percentage of the graph data from that, right? So that is basically you try to put your system under stress. And find out how your system performs. So you are basically planned for the worst in that case. This is true when you wanted to use for your analytical use cases this memory modeling stresses. So that, that's from a capacity planning. Certainly the tools is the good starting point. And once you are into some level of maturity from the implementation, then you can go for the load testing or the stress testing. So the, then that comes the architectural trade off, right? So what do you want, how do you want your system to look like, from a consistency versus performance standpoint, right? So often, like we, we talked about horizontal and distributed systems, right? How do you want your system to be? Like, whether you wanted a full consistency or eventual consistency, or you wanted handle the performance as the major thing, right? So this is where you will have to find out what is the targets target users that you wanted to address whether or use case also that you wanted to address, whether you wanted to handle the. A real time use case or you wanted to handle an analytical use case, so that kind of drives the cap there. Whether you wanted to put consistency first or you wanted to handle performance first. So that that, that's one thing to look after. So that, that is purely based on what user base that you wanted to handle. So oftentimes that I would like to point out is that even though you have the same set of data, you might want to consider multiple graph instances because your target audience could vary, right? And that is where it is. And then flexibility. This is more in terms of choosing your graph database, whether you wanted to go for a typed or a property graph, or it is an adaptive indexing graph database. So the way this is different in the way that the graph stores the data whether it is like a vertex and edge based schema, which is a type graph. And then whether you wanted to handle the property graph, which is more towards knowledge, graph building, or whether you wanted to do an adaptive indexing that you build an index, run the query, and then the indexes gets dropped at the end of it. So it, it's more in terms of choosing what type of draft database that you wanted to choose, right? So these are things to to address at the time of choosing the database. At the time of implementing a particular graph use case, and at the time of finding out which target user that you wanted to address, right? So those are the key things that will drive this particular strategy. So we talked about graph database that, that's being used across and how. What is the current state that that it is right now? What are the future directions for graph database? So it, it has its own limitations. Things that, that needs to be looked up is that graph often requires high computing because of the way it handles com complex queries. So there are there are work that, that are being adopted to see how to improve the hardware acceleration, to implement things like GPU and handling graph algorithms natively. So those are things that are being in, in, in work right now. And the machine learning I would say it is, it has already been into a good. State right now, because often the graph has worked along with machine learning integration, but it is being expanded largely that both graph and machine learning comprehend each other to build complex use cases. And even in, in some cases, I have seen AI based implementation users. Graph to a watched extent, right? So that that's a major work area that's currently in in progress. And then serverless graph databases are, excuse me like cases where there are a lot of graph providers that they provide their offerings in, in the cloud. There, you don't have to worry about the hardware behind it. And that's coming as a serverless offering. So that's also another growing area that I wanted to quickly touch upon where graph is heading towards graph databases. Yeah. At last I wanted to give a quick. Overview what are the different areas that we talked about? Graph being a real time implementation, near real time implementation that often follows distributed implementation. And mainly for addressing the large amount of data that it handles. And when it comes to integration, it indicates with event driven architecture. To handle use cases on a ML or fraud or cybersecurity use cases. And from a storage standpoint we kind of balance it out between the cost versus performance the three different scenarios that we talked about in memory, hybrid and in memory, hybrid are some of the key areas that we talked about, and the persistent storage is another one. So those are key areas. I, at the same time too, when I want to conclude, I wanted to make sure that, not every case use case gets implemented with graph. So we need to be very clear about what use case that we wanted to implement and what is the target audience that we wanted to address. And the kind of strategy that we wanted to implement when we in, when we do data integration. And so that will eventually drive us to success. I hope this session was very useful and informative. And I really look forward you in, in, in any of the future implementations that, that you implement these strategies and, I wish you a great day and success. Thank you.
...

Muthuselvam Chandramohan

Senior Software Engineer @ Intuit MailChimp

Muthuselvam Chandramohan's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content