Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
I'm and Andrew Manko In the stock, I want to cover the evolution of databases
from population systems to modern earth scale and distributed scale databases.
I also discuss the tradeoffs you facing on how to manage them.
Let me tell a bit about myself.
I've been working with diss for over 10 years, and right now I'm focusing
on building ML infrastructure.
Our team works on two major products, much and share chat.
We, which together have all three a million monthly active users.
Okay.
before we dive into modern databases, let's take a look quick
look back to the first databases.
This were complex structures or where data was organized as nasty
trees and or girls in parties.
Working with them was extremely difficult.
First, we had to remain s to entire hierarchy.
To get the record, you need second application.
You tied decoupled to the data structure.
Imagine just, renaming a field and everything breaks.
And finally, there was no separation between neur logical and physical error.
Applications had to know exactly way and how the, the data was stored,
which made system fragile and flexible.
In 1970s, code introduced something that funda fundamentally changed the
way we worked with data relation model instead of context structures, he
proposed a formal mathematical foundation based on said theory and first order.
P logic.
I'm sure how, I'm sure many of you studied collection algebra, at university, but
it wasn't just a new format, it was a new philosophy code defined 12 principles.
And what's important is that they addressed nearly all the key problems we
just talked about on the previous slide.
Slide after code, big throw.
We entered the two ation databases.
The idea is, he introduced very quickly implementing real
systems, why it works so well.
Because relation databases had a clear, formal foundation, said Syrian logic.
This means that the base could be built not on intuition, but on mathematics.
A scale emerged in unified standard, standardized query language.
For the first time, developers could work, with data declarative
the declar without needing to know how to store it internally.
In relation, the basis introduced transactional support based
on the asset principles.
Is a complete fully, or didn't happen at all?
No more health safety orders on your shopping cart.
they become a symbol of availability and consistency, and of course,
normalization played a huge role.
We stopped storing the same data in 10 different places and the data
structure become clean and predictable.
Skim data was now based on the domain model, and again, I won't
go into too much details here.
I assume many of you have worked with the fifth normals from in university.
This was the gold, golden edge of databases and many systems are still
built on those foundations today.
And some of the systems are still well known like Oracle or FoxPro,
which, was developed in that time.
However, in the nineties, in era began there of mass internet taxes.
Now companies started building web applications, online stores,
digital services, the number of users good drastically.
And that came, the, dramatic and case in database load.
So how did this affect databases?
First, the volume of the data skyrocketed.
The foremost demands become critical.
Users expect instantly responses.
No one wanted to wait while they say, figure out how to join five
and almost different tables.
The key challenge is that relation databases were usually designed
to run on a seamless server and precise the, all data locally.
But as the volume and complexity of data grew, they have begun to struggle and the
response, the industry began to adapt.
And, first of all, valuation database started, started to
struggle on the heavy load.
The natural first step was a simple, intuitive, vertical scale.
What does it mean?
We simply upgrade to a more powerful server, add more arm, a faster disc,
a more CP power, and more CP course.
It makes sense.
Everything stays the same.
One database, one entry point.
The advantages are obvious.
It's simple.
No need to change the architecture, the da the data is still stored centrally.
We don't lose consistency.
Transactions remain intact, and we can still use complex
SCO array, including joints.
But this a serious downside.
Vertical scaling has a hard limit.
You want find a sale with 1000 girls if your product has hundreds
of thousands of users online.
Just want to survive
the next logical step after radical scaling physiological database separation.
But now we do it deliberately along domain boundaries.
For example, imagine we have three areas that are loosely coupled after
integration users, and each of them has its own data logic and all cycle.
So we make the decisions to speed them into separate databases.
It's so simple.
You not dealing with distributed transactions or complex tools and
just clearly separate domains.
Minimize KAPLAN and the cheap regulation.
The load is dis the load is distributed right now.
One user traffic space spike doesn't affect the reporting system.
If the our service crashes pepper stay up and running and we deliver,
deliberately avoid joins between, these domains and we don't need them.
Each domain is related, handled by its own system.
However there are limits.
It is a lot on one domain increase is, say the number of his gross tenix.
again, for a situation where a single can ha, can't handle it,
solution isn't about scaling in definitly, it's about extracting multi
extracting multi logical workload, which we can optimize independently.
at this stage, many teams, engineers came to an important realization.
The fifth normal form isn't minority.
You can live quite well without, normalization.
You can denormalize your data and transaction aren't always necessary.
Good performance for the product is more important than internal elegance
or stick adherence to cross model.
And what does this mean in practice?
Yes, we, let go of stick exceed guarantees.
It turned out that in most real world product scanners,
port integrity is overkill.
For example, if a user clicks, like on the post, it's not the end of the world.
If it gets lost due to a failure, you don't need the full transactional
support for every minor action.
Second, we normalize data.
In other words, we copy this fragments to different places.
Yes, it breaks called principles and may feel like a step back
to previous historic time.
But in, we get fast rates, fever joints, and bad scalability.
It's not just a technical decision, it's a product decision.
It allows us to handle massive load, and response to user quickly and approach.
And of course, this approach has this downside.
it requires careful thinking.
Need a deep and extended of the user journey.
You must concise like cons, constraints where appropriate while
still ensure that consistency in the places that matter there is
still physical performance asylum.
At this point, we still rely on a single, server.
You can't add more than.
Thousand CP cores or several terabytes of memory.
And the next common step is it's another way to reduce load on the
database, especially in systems where it's greatly overweight rates.
How does it work?
Very simple.
Each application instances, stores frequently use data in memory.
For example, a reference data, user profile, settings, query results.
This gives a great performance.
Instead of querying the database, the app f the data from Gram.
It's fast, simple.
It does require to request it.
Also reduce database a lot.
so it also use database loads, but it's, always there's a trade off.
' cause data may become stale.
If something changes in the database and the local cache doesn't know about it,
the user will, see an updated version that the back is part of the compromise.
Second question doesn't scale the system itself.
We still rely on the sa same single database.
We rise the selling, but we're still bound the limits on this server and we
are not getting, past those 1000 CP cores.
Now we're approaching one of the most fundamental concepts
in distributed systems.
The cap theorum the cover here, that in, in distributed systems, you can
achieve all three of the following guarantees at the same time, see
consistently consistency or not see the same date at the same time.
Availability.
Request receive a response, though it might be a bit over.
outdated part intolerance.
The system keeps operating even, when there is a network partition.
So as soon as a network partition happens in the yellow
part systems, it does happen.
You are forced to pick just two of the three.
That's what the famous triangle says.
Pick two, or let's say you want to, your system to always respond and
for all north to immune systems.
In that case, you can tolerate network practitioners.
As network breaks, your system goes down.
It's the classic ca model and it's typical of traditional relation
databases like graphs or MyQ, running as a single install.
And obviously such a systems guarantees stick data, correctness, networking,
garbage or stale data about cancer, right?
Network partitions or classifiers.
And here's where things get interesting once you want to scale.
To distribute, across nose, you are forced to sacrifice easily
availability or consistency.
Tap forces us to make deliberate architectural tradeoffs and
just maybe small fun fact.
The same was using the proposed purely from empirical observations without
formal, without a formal proof.
However, one was coordinate or, discussing it about very well.
It was just a hypothesis, but later it was formally proven.
Maybe in 10 year, in 10 years.
But, cap serum is a great starting point, but it has one significant limitation.
It only considers what happens during the n partition.
But what about normal operation?
When the network is stable, after all, partitions don't happen every day,
and this is the way is more complete.
complete model comes into employee puzzle system.
Puzzle system stands for basically if there is a partition.
You have to choose between availability and consistency.
Also, basically, under normal conditions, you have to choose between latency and
consistency In the other world, the model says even when your network is
healthy, you are still facing at head of between latency and consistency.
Take a look at the chart on the right.
Want a low latency, it will get faster responses.
Possibly with, with stale or inconsistent data.
If you wanna take consistency, you'll need to wait.
The patient has to reach all replicas.
Perform, synchronous replication.
Wait for acknowledgements.
Only there, does the client get a response?
The, this model gives us a more realistic picture of how distributed system behave.
Reminds us that tradeoffs aren't just for emergencies.
They, they're the everyday reality of distributed architecture.
And after all the tradeoffs and the partitions in question arises, so when,
should we actually use relation databases?
And answer ship?
Almost always.
Unless you are dealing with extreme load.
If you're hundred, for example, a thousand request per second relation
database is likely your best hand.
You don't need, you don't really need to scale it up.
Maybe you just need to do logical split, maybe cash in, for example.
You'll get all benefit of relation models, transactions, strong, consistency,
normalization amazing, and best of all, you don't have to think about cup or
puzzle C. You just walk out of the box.
and this is more important, even that scenario, you still need to
account for possible failures, right?
So like several scar discs, relation type aren't magic, and without they
become a single point of failure.
So basically you need to make some mirroring maybe, or basically
smart strategy of pick happen.
And obviously you can build a scalable system in traditional national databases.
I'll actually touch on that a bit later so we can do it with my scale or polyus
without any additional, extensions.
But in it, you'll lose many of the core advantages at that point.
You're basically using a no scale database with a scale interface and scaling will
be more complex to configure and maintain.
And let's go.
And we saw the hacks for Australian.
When the single database instance is no longer enough, especially
in terms of free performance, many teams, turns to replication.
So what does it actually mean?
We have on primary note, there's only one node that handles right from there.
The primary automatic replication data to read only replicas.
It give us virtually unlimited IT scalability, need to handle more
traffic, just add more replicas.
Users read froma.
It's fast, low, and efficient.
The important point.
Application introduced latency, so the right goes to the primary, then
it into queue, and only after that is, is, is propagated to the replicas,
the budget on the system load or, and the way the application is.
In Synchron, this delay and range from milliseconds, two seconds as results users
might see data, especially if they write one note and immediately get from another.
And this is a AP model in cap, the availability, but is he consistency?
And another key point overall still goes for a single primary.
So if the primary becomes overloaded, we hit the right throughput Cell
application is a good solution.
The new system is red heavy and red light, but it does make
your system fully scalable.
It requires careful consideration of consistency, tradeoffs.
Sars when we take one large table and split into, many smaller ones,
distributing them across different nodes.
For example, you might, roll one group of users to one server, another
group to a second server, and so on.
The key point is you starting with scale, not just rates, also, right?
Unlike application, we all right, go to a single primary.
Now each chart handles its own position of the use traffic.
This allows us to scale horizontally, serve millions of future, and who
knows, maybe even built our own version of Facebook first predate data.
The and how you split data.
The personal specific case, the most common approach is hash based
sharding, for example, by user.
Another option is using a cap table, the strong switch charts, hold, switch data,
some systems chart by geography, user type data, or even business domain as well.
But of course, it's not all smooth selling.
First.
In the design, we prioritize consistency of availability.
If a chart goes down, part of your data becomes temporarily unavailable.
Second, you lose multi shark transactions.
You can't update to charts automatically.
Using ac semantics means your data, model needs to be carefully designed,
analytical, like a simple cell count, star, become much more complex ate.
and no scale emerge.
The industry response to the change of scaling databases.
The industry began to look for new solution.
A new class or data stores appeared designed.
Not around strict schemas and messy transaction, but around
scalability and performance.
As I said, we can make our volution the base code using previous princip,
but it will be just a harder to do.
The design philosophy is also different.
It's very first we began, we begin by asking what kind of query will our
apps run, when we design the data model specifically to make, those queries
as fast as efficient as possible.
This is the opposite, o opposite of the traditional approach where
we build a clean normalize schema first and only later, try to
optimize growth on top of the of it.
Additionally, the scale system don't necessarily, necessarily use tables.
They do, they don't require scale.
Database has its own data model interface.
Some work with Jason documents like MongoDB.
Others use key value pairs like Radius.
And as for transactions in the classical sense, they're often missing.
Some systems do support local transactions with a single entity of partition,
some comprehensive operation, but global asset guarantees across notes
as exception rather than the rule.
Instead scale database, focus on simplicity for tolerance
and horizontal scalability.
But what if you do need transactions?
The span across multiple nodes.
Once all that is spread across multiple nodes or database and you change appears,
how can you ensure that all of them is complete set of appear together, or none
of them do it all to make this work?
We use coordination protocols, for example, to phase commit or more
advance or ones like Axus or rt, which are used in consensus based system.
He's a sin in scenario, distributed transaction to do serious dose.
Firstly, performance.
To commit the transaction, you have to wait for, for a
response from every participant.
Even if just one of them is slow.
Consensus is inherently slow and in many cases, resource are locked
until the second phase is complete.
This means as the parish might be stuck away.
If the K fails, it's exactly the wrong moment.
The system might, get stuck in the awkward middle state.
The transaction is committed, but it's not rolled back either.
This is a classic weakness of two phase commit, for example.
It's not resilient to kta crashes and complexity Distributed
transac are hard to implement.
You have to think about ML ized failures and duplicate in every single step.
All use cases for all, operators.
So distributed transactions are possible, but have a lot of problems.
That's why in most distributed systems, we try to avoid them where we can.
And sometimes we need to have, as the need possibility kept going, but so did
the desire to keep transaction guarantees A new cluster of the base emerged.
This class is known as distributed sq. The cover is simple.
Take a traditional scope database, make it distributed.
But keep guarantees and familiar interface every, just like in public
customer scale only now it can scale across dozen or hard of nodes.
scale system, preserved transactional consistency.
You can use familiar transactions, inspect, expect state consistency
and design data models using well known relation patterns.
It makes the from legacy system much easier.
Yes, and the scale is a current interface as well.
No need to learn a new language or special a i And the system has not scalable.
They support automatic charting, application dispute,
transactions, architecture.
They period closely to no scale systems, but with relation passade.
But of course, nothing come for free.
Even this automatic charting, you still have to keep in mind that when
designing your schema, if frequently perform joint across tables that
live, that live on different charts, those queries are become very slow.
Even though everything works the same goes for distributed transactions.
Yes, you can update multiple charts automatically, but it's going to be slower
as then perform the transaction inside a single node, which means whoever is
schema design is absolutely critical, and again, you still need to use it.
Query first approach rather than traditional domain First approach.
in relation databases from cap perspective, distributor skill
system usually prioritize consistency or availability and latency.
Consistency preserved, but due network issues, part of the system
may become temporarily unavailable.
And now in practice, global transactions across charts are rarely needed.
Most of our code can be handled ju defined by no scale databases
with eventual consistency.
And what about learning a nuclear language?
To be honest, it's not really a problem for more developers.
So it takes just, it's a matter of days usually.
And, let's go to more practice.
On this slide, we see how the cap serum applies to medicine practice.
Radius is a blazing fast in my key value store.
It can be configured in different ways depending on your needs.
In this setup, a register run in a clustered configuration charging.
Each key is hashed and rotate to a specific chart in W
horizontal screen, oh, sorry.
Gimme a second.
Let's look at this.
Throw a lens of cap.
Consistency is prioritized.
Rise, go to the primary node and the main correct, which is critical
for things like distributed logs.
For example, our ability sacrificed if, shark fails may reduce separation.
Strong in consistent data.
Partitions is supported.
Risk can continue operating with the healthy part of the system.
We're able to scale already by increase number of charts.
In this configuration, we sacrifice availability to maintain
consistency, which is important for correctness, critical use cases
like distributed logs, air section.
You mentioned before.
on this slide, we see already in a different configuration.
We still use Shagan, but now his chart is replicated.
Client write to the primary battery from either primary or replicas.
Boots availability part and partition tolerance in of some nodes.
If some nodes go down, the systems stay responsive.
The tradeoff, is consistency because of application like
that may reach still data.
So the set high availability or stick consistency, ideal ation or right heavy
systems where a freshness isn't critical.
For contrast, here's another red configuration with no application and
no shotgun in this case re behaves more like a traditional error database, a
single instance and write, go to one node.
Now let's take a moment and talk more specifically about consistency.
What, what you see here is the classic classification of consistency models.
On the left, we have traditional relation models.
This mostly describe different levels of transaction is installation.
On the right we have consistent models used in distributed systems.
At the top we see the strongest models where this appear instantly on all nodes.
The further down you go the week is guarantees data might take time to
propagate or be consistent for a while.
But he think, but he is the thing.
In practice, people often don't talk about consistency in such formal terms.
Let's break down some of the most common consistency models used in practice.
Strong consistency.
There is the most reliable and stick model after, right?
All client immediately sees the same value.
There is no delay, no certainty.
Everything is perfect in sync.
Eventual consistency.
This is the weakest consistency model.
After a while, right after right, data is gradually, synchron cross nodes eventually
becomes consistent, but not right away.
In practice, this delay is usually just a few milliseconds, and it's more than
enough for more real work use cases.
This small model is typical for distributed, no scale device like DY DB
or Cassandra and Turnable consistency.
Here you can adjust the level of consistency per request.
For example, consider, the right success only if it's technology
by at least two out of three.
This approach is supported by system like Cassandra and cib.
It gives you the flexibility to balance performance and the availability
the depend on the specific use case.
a letter.
I want to, basically show different, based on c db, different
configuration of the C db.
DB is a modern distributed no scale database, inspired by Cassandra by
designed for much higher performance.
One of its key feature is tuneable consistency.
As a query level, you can define how many must control, or this gives you
de control or the trade off between speed, availability, and consistency.
In our request works, the client connects to any node as the
note becomes to be, a coordin.
the coordinator, figures out which nodes are responsible for
the requested keys based on hash.
It then follows the request to the app, appropriate replicas.
For example, if, a pick factor equals three, the right is sent to three nodes.
The coordinator wastes for acknowledgement based on the choosing consist level.
Once they're done, it returns the result to the client.
The key point is the client is completely unaware of how the data
is distributed or the complexity is handled by the coordinator.
This allows scale horizontally in point most performance and resilience.
So as you add monos and, this now single point of failure or not are equal.
And see how a right works with, consist, level equals all the client
sends a right to the coordinator.
The coordinator follows their right to all arabicas.
The right is only successful if all three arabicas control the operation.
If even one replicas are reachable, for example, due to any drug
partition, the person is rejected.
What this means, So we have a strong consistency, lower
availability, and lower performance.
So basically we, prioritize consist consistency, o over availability
in, in case of any partitioning or latency in normal behavior,
in normal operation level.
Level, all is just in critical scenarios.
Such a bank transfers payment procession.
On any case where data occurrence is more important than availability.
In this situation, it is better to fail the operation than
to risk data in consistency.
Another example, Right work is level one.
The client sends the right to the coordinator.
The client sends the right to all replicas.
But the two success after just one acknowledgements, the remain
replicas updated asynchronously.
What this means high availability, even if most node are down
right, risk can still proceed.
Let's say we able to, work just, using 33% of all.
No, consistency is not guaranteed at all.
Clients merit stale data and maximum performance responses
are returned very quickly.
and this consist level is ideal for newsfeed ads, analytic
compliance, and et cetera,
and how a right work with consist level equals chrome.
The client sends the request to a coordinator, the coordinator
for us to it to all applicants.
The right is successful once a majority of her confirm and
mainly update asynchronously.
And that's basically a good balance with we have a good consistency, better
availability than in from level equals all and still pretty good performance
faster than level equals all mass.
Lower consist level equals one.
So it's a balanced between speed, availability, and consistency.
Ideal for important, but not mission critical operations.
And, another you have in databases.
I want to, talk a little bit about in Giants I. This is a classic data structure
that value used generation databases, and even in some no scale databases.
Many of you probably ed from university again.
So I want to go into internals.
In short, all data is stored on this, but frequently accessed that data is
cashed in memory for better performance.
Immunizing these successes is key for keeping the since fast.
between optimize for IT fever, disc lookups, I needed to find a record.
Perfect for it.
Heavy workload, but it's not ideal for heavy rights.
It lots of insert updates.
The B three extraction is constant, constantly balancing, which can slow
things down significantly because it will produce a lot of discards.
And we have another type of, database on giants which are based
on L lsat, LST three or LS three.
Lti, is designed for higher alterna.
Go first into memory and only later half flash to disc and batches, which
greatly improves right performance and reduce disc wear our, the system,
first checks in memory cache, roll cache, ma table, and immutable tables.
If data isn't found.
There it is from this.
Balloon filter helps keep files that defin, definitely don't
contain the request scheme.
blue filter, basically it's a probability set.
It's a data structure which, which help you understand whether,
data is, exist anywhere or not so ideal for white heavy workloads.
But if you system is already heavy with slightly few rights,
a, BT maybe better to fit.
So only choose the storage.
And then based on workload, write heavy go with LS mt. right heavy choose between.
And a little more about modern trends with the finish method.
Today many companies take popular, battle test databases and remove them from the
ground app using the fast language like c plus, and a heavy optimized architecture.
Important is IP stays the same, the developers don't need to learn anything,
but under the hood everything is new.
No, no second and full CP organizations and result is amazing.
My performance game sometimes techniques or more is the
same features and interfaces.
Developers already know and won't surprise me.
We usually, everybody knows that we should not write every from scratch.
Right now we have a lot of companies which did exactly the same.
They chose, already existing technology.
They write completely using modern, architecture, and they
make a pretty nice business model.
And how, what's the core feature of, or, every such,
systems, they use architecture.
Take a look at this image, for example.
On the left, we see a typical multi thread systems, with tri resources all, docks.
The thread are fighting over the same ball.
The scales, logs, contention, CPU under horizon.
Now look at the right side.
Each dock has its own ball.
Each threat has its own data, its own CPU core.
No contention, no locks, noon, just pure efficiency.
And this model, as I said, is called one third per car.
Handles its own workload, its own memory, its own responsibility, no context,
region nor shared memory bottlenecks.
Everything stays with the CP cache line.
It's the radically fast and it's perfectly fit for, for modern hardware.
And, in conclusion, what to say that there's no perfect base, only the
right one for your specific needs between consistently performance
and scalability are intentional.
The standing of this is what enables us to design reliable efficiency
in the well justified architecture.
This's all for me.
Thank you.
If you have any questions, I'll be happy to answer them.
Feel free to reach out any time, through any of my
availability, available contacts.
Bye-bye.