Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning, and good evening everyone.
Depends on where you're joining from.
Thanks for joining the session.
My name is MuTu Shandra Mohan, and I have more than 20 years of
experience in data engineering domain.
And today I'm gonna be talking about building real time graph database
that can be leveraged for both real time and analytical capabilities.
And I'm gonna be talking about the architectural patterns that
can be followed and the key design considerations that need to be accounted
in building the graph database.
I wanted to start with the putting the question of why we need to
adopt graph database, right?
What is the problem statement in building a graph database?
The key thing comes from handling the relationship data.
In today's scenario, the data.
It's very complex.
And getting analytical information out of it getting information out of the data
is crucial and especially in handling complex data patterns like social network
data handling supply chain information and how to identify fraud from a given
data set and building knowledge graph from a given data data set that can.
Provide information out of the data is very complex mainly because of the la
last volume of data, large volume of data that that we handle on a on a daily basis.
The, that, that brings in the need for graph database in a
traditional relationship database.
Identifying this complex multi hop analysis is very difficult and
considering scalability is even more challenging and that is one of the main
reasons that we adopt graph database.
We will get into more details as we go through the session.
So as I move along.
When we decide to build a graph database, what are the challenges?
We as as a platform engineers now either in the data engineers or in the
platform engineers, what are the common challenges that we face on a database?
Or in, in building the lifecycle of the database.
I wanted to cover these three bullet points that's listed here.
One is when we build the graph database that we are going to use
it for real time analysis, right?
The often the expectation is to handle customer calls that
can be addressed in subsecond.
The query response times this is this is a very a common case when it comes to
fraud and cybersecurity use cases that the response times need to be fairly quick.
And how do we adopt seamless integration with the different.
Different applications or systems, for example, it can be a microservices,
API based systems, or it can be a batch oriented data integration platform.
Or it can be even with any of the packaged applications.
So how do we integrate with different applications and systems that can bring in
heterogeneous data to the graph database?
How do we integrate it?
And then the third thing is how do we address the scalability concerns?
We can do vertical scalability, but that has a limitation at some point.
So one of the one of the more, most common adopted architecture
is the horizontal scalability.
So that it, there is no limitation in terms of the amount of scale that we can.
We can grow into, right?
So how do we how do we adopt horizontal scalability for an ever-growing
demand on the data side, and especially with graph database, right?
So how do we handle these three challenges?
Is what we are going to be discussing about.
And these are some of the key challenges that comes into when we
start implementing graph databases.
And and there are very few technologies that kind of stands out to handle.
These critical, important challenges when it comes to graph database.
Moving on what are the limitations of the traditional approaches.
Or in other words like the current databases that are available.
And when we look at the current database, I think one of the key thing that
comes out is the relational database.
So what are the challenges in a relational database?
For a use case that we adopt with graph database often comes
with a lot of interconnected relationship among the data.
The key thing is how do we build build a system that can inherently
provide the ability to for the users to analyze the data and provide
the information out of it, right?
So that is where the relational database kind of struggles in terms of building
the relationship within the data.
When we say relationship at a high level, yes, there is a relational database
that provides one-to-one or one to many relationship, but what we are talking
about is complex relationship with that are that are not direct, right?
So that is where the relational database struggle.
And it is more often realized when it comes to running the queries to
understand the relationship, right?
For example, one of the use cases like who is a friend of friend or find all the
friend of friend who share at least three interest with a particular user, right?
This doesn't come with a direct query analysis.
On a relational database when you want to address this problem,
that's where the second point comes.
It's an exponential com complexity because often we don't know how many
levels of analysis that we wanted to perform and how many joints
that we wanted to perform, right?
So in most of the cases in relational database, this often comes at a point
of making an assumption that how many levels that you wanted to perform.
That's where the limitation of the relational database comes into play,
where graph database scores out.
And that multi hop analysis in a end number of hops that we wanted to perform
graph databases handled so easily.
And the way how it graph database, addresses this problem is that the
way it stores the data the the graph database stores the relationship as
a first class citizen rather than as a data as a property, right?
So relationship is being stored.
And that's how the data is being is being sorted or stored so that the
traversal queries can be handled so easily and any sort of indirect
relationships and multi hopp analysis that, that needs traversal based on
the dynamic results that are fetched at the previous step can be achieved
using the graph database solution.
And that's how it differentiates itself from any of the relational database or
document database or any sort of search database and graph stands out separately.
Moving on what is the core architectural pattern?
So we are gonna be talking about different things, but the first
one that I wanted to talk about is an event driven architecture.
I picked up the event driven architecture because this is one of the in, this
is a common use case that comes to different graph implementations
because often the times the graph implementation, the expectation is that.
How do I handle fraud use cases and it comes to financial or even
sometimes non-financial institutions how do I address the fraud and
cybersecurity use cases, which requires faster response time, right?
And that is where the event driven architectural comes in when we talk
about event driven architecture.
It often the handles the ability of implementing or integrating
with an event event based system, something like Apache, Kafka or any
of the streaming platforms, right?
There are good amount of graph database that inherently supports.
Integrations with even based systems.
They have connectors with Apache Kafka or any of the other streaming systems.
And and any sort of even driven updates that comes from from these
systems can be handled using a distributed database that kind of.
Routes the calls to a specific partition within the database that that is driven
based on the relationship that is that is identified while building the data model.
So it's a hand in hand process that you identify how you are going to model
your data and that gets a specific.
Storage unit, which is partitions in a graph database.
And that can be easily retrieved at the time of handling real time
updates that any sort of event driven architecture that feeds data in a
real time manner can get easily routed to the specific graph partitions.
And what it brings us brings the capability is that it can handle
it, it can it can handle if there are any outages, it can replay the
data from the streaming platform.
We can quickly replay from where where we left over.
And multiple consumers can consume from the same event for different purposes.
Even in terms of a graph database, you can have a different graph instances that
handle different customer expectations.
That can be one for fraud related use cases, and that can be one for marketing.
Use case.
Both can be fetching the data from the same Kafka instance
or the streaming platforms.
So that is one key consideration to look look into.
Because at the time of defining your data model you also want to
consideration what sort of systems that you wanted to integrate with.
And what sort of given driven architecture that you wanted to adapt to?
There can be instances where we don't need it.
It can be a pure batch processing systems.
But often it's the case that that you wanted to have even driven
realtime architecture and and event driven architecture is a very fairly.
Robust system compared to other API based integrations.
It depends on the situation, but even driven architecture is a
much more reliable architectural pattern that can be looked up to.
Moving on.
What are the different horizontal scaling strategies that we can adopt?
So we talked about the the growing needs of data.
When it comes from different sources, like it can be an iot system or it,
it can be from a these days there are large volume applications that
generates data in a real time manner.
So how do we handle.
Data that's coming from these systems.
How do we make sure that we handle them correctly?
So one of the key thing is handling it through horizontal scaling so that you are
not hitting a limitation at some point.
And horizontal scaling handling it through distributed systems
is the, is the way to go forward.
And within horizontal scaling approach what are the different
strategies that you can use?
One is a graph partitioning.
Where the graph database, as we know is defined from how it has been defined
in the mathematical, the statistical models vertex and edge based and
how they are connected by each other and how they comprehend each other.
When it comes to graph partitioning.
This is also at the time of graph data model modeling and that you define
how you want, what is the key thing behind your your data partition, right?
Whether it is driven through vertex partitioning, or it
is through each parting.
It is largely driven by your your query pattern.
How you are going to be doing it is not in terms of your data characteristics.
It is going to be how you are going to be consuming your data from your
querying patterns going to look like.
So depends on that.
You might want to choose between a vertex based partitioning
or edge based partitioning.
There are different graph database that supports it.
There are.
Some databases that doesn't support it.
So this is one of the key consideration when you wanted to adopt a graph database
to see which partition, which type of graph database that you wanted to
choose, and whether that can support a ver, a vertex cut partitioning
or a h cut partitioning, right?
That is one.
And the second thing is what kind of replication strategy, right?
Often when we talk about distributed databases, there are
different it follows the cap, the.
There are different replication methodologies that are being adopted.
One is full consistency.
There is there is the other one that's the eventual consistency model, right?
The eventual consistency model is that you write to a quorum and
then you read it from a quorum.
So that way you are fairly confident that you are reading the latest data.
So that is eventually consistent model is one strongly recommended approach
when it comes to graph database because it's not gonna be a system of source
it's just a system of reference.
So eventual model is fairly reasonable to go with and a master slave
replication where you write it down to one and then you read it from
the replicas, that's another one.
And there are things like consensus protocol, which is a different.
Approach of eventual, but it follows a raft protocol.
That's another replication strategy.
So all these are some of the key aspects to look into when you are
looking for horizontal scaling.
Graph database that support horizontal scalability and which is essentially one
of the key characteristics of distributed.
Computing or distributed data storage.
I have put it into two different groups.
One is like how you are going to store your data.
That's your graph partitioning part, and then what kind of replication
strategy that you wanted to adopt to.
That's more from a computing and data data storage eventually.
And, 1, 1, 1 key thing to look into is like graph is again
it's not a stateless system.
It is a, it's a stateful system.
So you wanted to consider about the storage what kind of storage
that you wanted to, so that your travels again is fairly easy, right?
So that, that's something to look into.
But.
Distributed database is the way to go forward.
That's what we wanted to cover on this one.
The next one is we talked about this the kind of storage that
we wanted to adopt into, right?
So you decided that you wanted to go for a distributed database, but if you
are not sure about what kind of, storage that you wanted to adopt into this is
where your next design aspect or the architectural aspect comes into play.
What is going to be your use case that you are going to handle?
Is your use case going to be like a sub subsecond or a sub Yeah.
Subsecond queries that you wanted to address in a fraud
or cybersecurity use case.
Or is it is it gonna be a system that heavily relies on how you are going
to store the data but may not be needing a sub-second response time?
There, there are use cases like that often times that you look at to a data
store where you are okay to wait for two, three seconds or even up to 10 seconds.
But you wanted to be more towards storing the data in a persistent manner
that you are able to rely upon the data and faster recovery time, right?
So if you have a use case like that, and then there are specialized
storages, like document stores or key value paths columnar databases, right?
Those are again a flavor of your distributed databases.
May not be truly graph, but it probably doesn't need a high amount of data.
So you can fairly run better than a relational database.
That's another specialized storage that you can look up to, right?
So these are the different flavors in which you can adopt to a hybrid storage.
Like things like in memory when you are looking for subsecond, then you
can go for Redis graph or Apache graph.
As as an option.
And when it comes to persistent storage, like Neo four J Amazon, Neptune,
that's another thing to look at.
And there are specialist storage.
There are tons of things.
And in the market oftentimes when you have a NoSQL database
it comes with a graph database.
There are cases that we, we often look up to.
Using the same NoSQL storage with a graph layer on top of it because
you don't have to adopt multiple technologies to, to support your use case.
So that's another area to look into and query query optimization techniques.
This is like post your graph database.
You have adopted a graph database and then you have you have fed
in a good amount of data into it.
Like how do you.
Optimize your query, right?
So this is where you you put the hat of a developer and then see what is the
best way in which I can I can define my query or design my query, right?
So you can look at the option of early termination that is once you reach a part.
A query result point, then you don't have to continue the query.
That's an advantage with the graph database that you
can terminate the query as.
And when you you have reached the result point, right?
And then there are bidirectional search.
What I mean is that you can start from multiple points
depends on the query criteria.
And this is where a graph database plays a critical role that if you wanted to
start with a country, and if you start with a particular age group, right?
So you can have these two search points and you convene towards or converge
towards a singular result site, right?
So you can start from bidirectional point.
So that way your query run times would be faster.
And this is this is something different from our relate
traditional databases, right?
And and then there is cost base optimization that you can do.
What it means is that you can find out what is the degree distribution
for that means when you are on a particular vertex, you can see how many.
It just are connected to it, right?
So that way you know what is the criticality of that particular vertex,
and you can do cost-based optimization.
If there are high degree, then probably you don't want
to run your query from there.
You wanted to start from a much lower.
Vertex, so that way your results are faster.
So that is one way of cost based optimization that you can adopt in
with the query optimization technique.
So these are some of the trivial ones, but can yield the high results when
it comes to query optimization and then memory management strategies.
This is go hand in hand with your query optimization.
You can adopt things like your vertex edge pooling.
So basically you can fragment your memory and see what graph elements that you
wanted to store in memory, so that way your garbage collection, GC collection.
Often times that you you will find these databases use Java heavily,
so the garbage collection, once you address garbage collection, your
query response times are faster.
So that is the main intention behind it.
And then compress direct presentations, each graph database
will have its own compression logic.
One of the key thing to look after is something like CSR which kind of
uses a bit backed adjacent list that, that helps in compressing the data.
And that kind of helps in getting the runtime faster.
And then you have the hot cold separation that this kind of goes in
relation with your monitoring your database that you identify, what are
the key vertexes or key partitions that, that are hot and cold which
are like often used versus less used.
You can separate or not separate.
Identify them and try to split the hard partitions into multiple
different chunks of give more compute power and more storage so that
the queries can run faster, right?
So these are different areas that you can look into from a memory
management sta standpoint that can help lead higher results.
And moving on the integrations pattern.
So this is where we started with when you want to implement graph database
you wanted to implement in an area and fashion that's one of the more
often use cases that we have seen.
But it is not limited to that, right?
But even within a near real time there are multiple different
methodologies that we can opt into.
One of the key thing is the change data capture which is the logs that are
retrieved from a traditional databases.
Or you can have a custom connector that kind of reads from a traditional database
and then start pumping in data using a streaming platform in the form of a CDC.
So you can have data integration.
Pipelines that you can connect to graph databases or even connectors that are
available using streaming platforms that you can use as an integration pattern.
This is one of the most common thing that that often that we
see with graph implementations.
And the second thing is the graph QL integration.
So this is where you see graph as a database kind of retrieve the results.
Identify the relationship.
So often we get the question that you have a graph database,
and why do you need a graph ql?
So this is where we see the heavy lifting is done by the graph.
That means identifying the unknown relationship, multi hop analysis scanning
through a large volume of data and bringing those information out of it.
Is done through graph database, but then you often need to integrate
through a schematic layer, right?
And where we expose the data to Y end consumers in the form of a UI
or any other integrations, right?
That is where Graph QL comes into play.
That if you wanted to if you wanted to match the data between
a customer platform with.
With the cybersecurity or a fraud analysis data, that is where graph brings these
two data together, stitch them together in the form of a schematic layer and
present it in the graph QL integration.
So that is another integration pattern to look into.
It depends again on the use case how you are going to consume the data.
And then and very own use case that is the analytical integration
that integrates with warehouses or from big data platforms.
That often we go with Apache Spark connectors that kind of
does the heavy lifting with large data of large data volume.
And you can have connectors like press two or Tryo if you wanted to run SQL queries.
Because these connectors provide a schematic layer and you can run and you
can some of these graph database provides.
The ability to run a direct SQL queries that you translate under the hood to,
to convert them into graph queries.
Moving on monitoring and observability.
This is monitoring and observability is standard across different
systems, but what is the thing to look after in the graph database?
The key things to look after is the query complexity metrics, right?
So you have tons of queries that that, that are being run, right?
But often you wanted to see what are the queries that perform bad?
And what is the traversal path that these queries are taking?
And what are the queries that often use high memory, right?
How do we how do we improvise them?
So this is the bottleneck identification layer.
Whether it some of the strategies that we looked at, whether it needs
partitioning, whether it is a hot, cold part hot cold partition that, that needs
to be addressed or whether it can be handled through some compression logic.
Identifying this complexity metrics is one thing to look after.
And the heat maps again, the heart VAEs and heart edges.
Often that if you have a customer base and that is very kind of centric towards
a particular region those are things to look after from this heat maps, right?
And distributed tracing.
This is a fine grain analysis to find out when you have a query that
probably doesn't perform good, you break it down into individual layers.
There are you can achieve it using any of the logging platforms and
often it is it can be done through the native graph logs itself.
You can break down the different layers of the query to find out
which layer is performing back or or needs to be improvised so you
can get the idea what can be done.
The next bullet points we cover about, providing a custom dashboard that kind of
captures these these use cases, but what different metrics that need to be adopted.
Things like cluster coefficient trends, degree distribution changes,
and component size distribution.
The, so these are some of the metrics that, that, that needs to be looked into.
And yeah.
Moving on.
So you have a graph database up and running.
How are we going to back up and recover the data?
Because you have customers who rely upon this data, so you need
to have a reliability score on it.
So how do we do backup and recovery?
So one of the most common things that people recommend is to go
for a full graph backup because.
What that means is that you don't want to do an incremental backup because
this is not a traditional database that you get the logs the incremental logs
out of it, and then and then store it.
Because these are all interconnected graph data.
So often we try to use, to the extent possible, we try to adopt full graph
backup that is the most reliable one.
But when you when you start implementing real time solutions that those are
not often directly possible so there are even sourced incremental backups
that are available that are largely supported by all these databases.
You just.
Take the running event logs and you are able to build the snapshots out
of this and that's the next level of reliable backups that you can get.
But it, there are some trade offs that need to be looked at.
What is the RTO and what is RT objective for the for this
particular strategy that you wanted to set as in business expectation?
Finally there are some advanced techniques that are that are being used in some
cases which are like multi version concurrency control that basically.
Takes backup in a real time manner without blocking, right?
But again, these are things that are new in the market and that needs that's
being explored in different use cases.
When it comes to some of the critical use cases, I think those
are still in the experimentation stage as I would like to point.
Yeah, so this is capacity planning.
So you are ready to build a graph database.
What are the key things to look up to when it comes to capacity planning
as a, as UD the architect hat.
What are the out of the box strategies that are available when
you wanted to do capacity planning?
There are some proven methodologies and proven tools or accelerators available.
For example, the synthetic graph generator, that is
a tool that you can use.
So you feed in some of the informations about.
What, how many tis, how many edges that you wanted to build, and what is
the data size that you are expecting?
You feed in that information and it gives you approximate figure
of what you should be looking at.
So that is a great tool to start with.
And one of the most reliable thing that I would say is the
comprehensive load testing.
That is to actually start a POC.
But you run different scenarios that, that, like things like
the burst scenarios, right?
You a COOs testing, right?
You test it and find out, because often with graph database the data is
unpredictable because you often find new.
Patterns of data, because that's the thing that the graph database, that is, that
you find out different pattern patterns of data that's something unknown to you.
So the load testing is a great way in, in my view, but it takes maturity
time and then memory modeling.
This is similar to load testing, but you you try to build a complex
scenario that you try to query at least one percentage of your graph
that has a large set of edges and s that are being connected to, and you
try to query at least one percentage of the graph data from that, right?
So that is basically you try to put your system under stress.
And find out how your system performs.
So you are basically planned for the worst in that case.
This is true when you wanted to use for your analytical use cases
this memory modeling stresses.
So that, that's from a capacity planning.
Certainly the tools is the good starting point.
And once you are into some level of maturity from the implementation,
then you can go for the load testing or the stress testing.
So the, then that comes the architectural trade off, right?
So what do you want, how do you want your system to look like, from a consistency
versus performance standpoint, right?
So often, like we, we talked about horizontal and distributed systems, right?
How do you want your system to be?
Like, whether you wanted a full consistency or eventual consistency,
or you wanted handle the performance as the major thing, right?
So this is where you will have to find out what is the targets target users
that you wanted to address whether or use case also that you wanted to address,
whether you wanted to handle the.
A real time use case or you wanted to handle an analytical use case, so
that kind of drives the cap there.
Whether you wanted to put consistency first or you wanted
to handle performance first.
So that that, that's one thing to look after.
So that, that is purely based on what user base that you wanted to handle.
So oftentimes that I would like to point out is that even though you have
the same set of data, you might want to consider multiple graph instances because
your target audience could vary, right?
And that is where it is.
And then flexibility.
This is more in terms of choosing your graph database, whether you wanted to go
for a typed or a property graph, or it is an adaptive indexing graph database.
So the way this is different in the way that the graph stores the data
whether it is like a vertex and edge based schema, which is a type graph.
And then whether you wanted to handle the property graph, which is more towards
knowledge, graph building, or whether you wanted to do an adaptive indexing that you
build an index, run the query, and then the indexes gets dropped at the end of it.
So it, it's more in terms of choosing what type of draft database
that you wanted to choose, right?
So these are things to to address at the time of choosing the database.
At the time of implementing a particular graph use case, and at the
time of finding out which target user that you wanted to address, right?
So those are the key things that will drive this particular strategy.
So we talked about graph database that, that's being used across and how.
What is the current state that that it is right now?
What are the future directions for graph database?
So it, it has its own limitations.
Things that, that needs to be looked up is that graph often requires
high computing because of the way it handles com complex queries.
So there are there are work that, that are being adopted to see how to
improve the hardware acceleration, to implement things like GPU and
handling graph algorithms natively.
So those are things that are being in, in, in work right now.
And the machine learning I would say it is, it has already been into a good.
State right now, because often the graph has worked along with machine
learning integration, but it is being expanded largely that both graph
and machine learning comprehend each other to build complex use cases.
And even in, in some cases, I have seen AI based implementation users.
Graph to a watched extent, right?
So that that's a major work area that's currently in in progress.
And then serverless graph databases are, excuse me like cases where there
are a lot of graph providers that they provide their offerings in, in the cloud.
There, you don't have to worry about the hardware behind it.
And that's coming as a serverless offering.
So that's also another growing area that I wanted to quickly touch upon where
graph is heading towards graph databases.
Yeah.
At last I wanted to give a quick.
Overview what are the different areas that we talked about?
Graph being a real time implementation, near real time implementation that often
follows distributed implementation.
And mainly for addressing the large amount of data that it handles.
And when it comes to integration, it indicates with event driven architecture.
To handle use cases on a ML or fraud or cybersecurity use cases.
And from a storage standpoint we kind of balance it out between the cost versus
performance the three different scenarios that we talked about in memory, hybrid
and in memory, hybrid are some of the key areas that we talked about, and
the persistent storage is another one.
So those are key areas.
I, at the same time too, when I want to conclude, I wanted to
make sure that, not every case use case gets implemented with graph.
So we need to be very clear about what use case that we wanted to
implement and what is the target audience that we wanted to address.
And the kind of strategy that we wanted to implement when we
in, when we do data integration.
And so that will eventually drive us to success.
I hope this session was very useful and informative.
And I really look forward you in, in, in any of the future implementations
that, that you implement these strategies and, I wish you a great day and success.
Thank you.