Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone. My name is Adam Fulmanigan. Thank you for coming to this
talk in which we are going to talk a little bit about effective
database observability. No matter what SQL database you
use, whether it's postgres, MySQL, Microsoft SQL, or maybe even
ibmDb. Two, that doesn't matter. We are going to talk how
to effectively observe what's going on in databases,
how to debug our problems, and how to get some understanding
of all the issues that happen in our database management system.
My name is Adam Furmanek, and I will be your guide for the
next 40 minutes. So let's jump straight to
it. So the very first thing I'd like to start with is the
world has changed significantly in the last years.
So previously, 20 years ago, all we had was
very simple two tier architecture. We had like a database.
We had maybe a couple of nodes talking to the same database.
Typically it was just one node, but sometimes we had like two or three of
them. And obviously we had a user that was using our application.
Nothing special here. If we wanted to have
another application, if we wanted to do something else,
typically we started building another monolith,
right? We had monolithical applications that were all talking to
one data store, and they were completely independent,
not talking much to each other. So if there was a need
to build something else, it was completely independent
from the previous monolith. The problem was that,
hey, those things were not necessarily scalable.
We didn't know, or we didn't have any means to deploy
them efficiently, to deploy them quickly, and to iterate
many times a day with our changes. Whenever we wanted
to release a change, we basically had to go through a big script
of the deployment, deploying like whole application,
then making sure everything works. And obviously it was
time consuming, it was very slow. And the worst part was
everything was tightly coupled with Chada. So then
we started introducing DevOps. We realized that,
hey, the plan in which we deploy applications,
or the people that deploy applications are completely independent from
the people that develop those applications. This is not necessarily
the efficient way. So we wanted to bring them together.
So DevOps emerged, and what we wanted to do is we wanted to put
both teams side by side so they could closely cooperate
and deploy things faster and participate in both
development and deployment. And then we realized
we can bridge this gap even more. We can actually
come up with DevOps engineers that are both capable
of writing the application code and deploying
it. And at the same time, the world complexity
increased significantly. So what we have now
is our applications, our microservices, they are
much more complex, because they talk to many data
sources, many incompatible databases or data stores.
They use very many different computation
platforms. We have serverless, we have full blown bare metal,
we have on the edge computing, we have many other things. And most
importantly, everything talks to each other.
So that's the biggest issue. That's the biggest challenge we face.
Now, in order to deploy the changes, we don't need to deploy
everything. We can just deploy a small piece of
our modifications, like one microservice, one serverless
lambda, or function, and off we go. The problem though,
is that now everything talks to each other. The increased complexity,
we didn't get rid of it. We only moved it
in some different, into different part of the ecosystem.
And so we pushed the complexity from the
deployment to maintenance and operations.
Now everything talks to each other. And whenever there is
an issue, no matter whether it's like an issue with
database, with queuing system, with service bus,
we kind of don't know how to fix it,
how to deal with that, because we need to deal with logs
that are scattered throughout the system. We need to deal with pieces
of information scattered here and there, and not even being
able to tell what was the order of operations,
because everything is now distributed,
globalized, so we can't easily tell how the things
executed. And when there was an issue, then, was it just a
manifestation of some bug in this part of the system?
Or was it a coincidence? Because everything is now distributed
and moving forward, there is this thing that we know
as Conway's law. Conway's law basically tells us
that the architecture of our applications will
mimic the organization that we live in. Meaning that
if we want to deal with microservices
that all have their independent data stores,
that all have their independent, like code
base, are deployed completely independently,
then it means that we basically need to build
teams that have the same architecture,
same structure. Conway's law is not something
that we impose as an axiom, is not something that
tells us we need to do it this way, because otherwise,
like, we are doing something wrong. No, Conway's law is an observation,
telling us that if we stop fighting really,
really hard, then our applications will
resemble the structure of our organization.
This basically means that if we want to impose some
architecture, we should start with imposing the
right teams structure in our organization.
Meaning that if we take pictures from teams topology. Very nice book.
If we want to have microservices that are completely separate,
completely independent, and they deal with different independent databases
and code bases, then what we need to do is we need to take this
picture and basically turn it 90 degrees
so we get exactly the same team structure. Notice that those
images are exactly the same. They are just rotated by
90 degrees, but they show exactly the same.
If you want to. If we want to have an efficient
independence and decoupled applications,
what we need to do is we need to have decoupled and independent
teams. And taking into account what happened with the increased
complexity of our world, this means that
we can't rely on the solutions that we used to have
for observability, monitoring and maintenance that
were like, we used them when we were dealing with monolithical
applications, because it just doesn't work the
same way we change the way we develop and deploy
our applications, the same way we need to change the way
we maintain them and the way we structure our
ops teams. This effectively means that because of the
increased complexity, we can't rely on dbas anymore.
We need to change the way how we structure our organization.
And we need to do it with help of platform engineers,
DevOps engineers and developers all working together to
reshape the world. Let's see how to reshape the world.
The important part that we all know
about is that communication is the
main factor in reducing efficiency and performance.
The things that we can automate, things that can be done
by tools, they are fast, because computers are fast
in general. But the things that we need to do manually,
or the things that we need to do by cooperating with other
teams, those things are slow.
This means that the highest bandwidth we'll have inside
our teams, where we can communicate easily and quickly,
if we need to go cross teams, then the communication
is going to be way slower, which will lead to
far slower, like troubleshooting and issue resolution.
This increases mean time to recovery,
time to fixing an issue, basically time to do
anything. Just because we spend time on communication,
then we cannot speed up. This is also another
outcome of Conway's law. So what we need to do is
we need to understand that in order to have efficient observability,
monitoring and efficient DevOps infrastructure or DevOps
culture, what we need to do is we need to minimize communication.
That's the very important part. And in order to
minimize communication, we need to basically minimize
the reasons to communicate. And those reasons are typically
that we understand or realize there is a bug,
but we figure it out very late in our pipeline.
So what we want to do is we want to shift left
all the checks. In order to have efficient communication,
we need to not communicate. The less we communicate,
the better. So we need to avoid communication at all.
And how can we avoid communication? We can
do it by shifting all the things to the left and
building teams that don't need to communicate
with other teams. We need to build small and highly efficient
teams with well defined scope so they can deal with
everything they need without the need
to go to some other teams and ask for help.
So this is why we need to have the shift left. And the
important part, talking about databases. Now the problem
with databases is we can't shift
left our testing and our maintenance,
because many times developers are very relentless
to test their solutions. They don't have
tools, they don't have means to do so, and most
importantly, they lack working knowledge how to do that.
So our ultimate goal is always self service.
We always want our teams to be small,
to be independent from other teams and
teams that can do everything they need on their
own, meaning that if they need to get some
data, some locks, access to some systems,
they need to do it in a self serve way.
They shouldn't ask for permissions, they shouldn't communicate
with others, they should do this in a self service
way. So this is the goal, this is how we need to
turn our organizations to have efficient DevOps
culture. And this affects everything we work on
like web services, desktop applications, and most importantly
for the purpose of this talk is databases.
We need to reshape our organizations,
to have new team structures that really
embrace the complexity that increased and make it straightforward
to maintain, deploy and troubleshoot databases.
Let's see how we can do all of that and let's see
what we can build now to shift left the ownership.
Make developers own their databases, make developers
self serve their issues, and most importantly,
finally fix all the problems as early as
possible. And this is a must, no matter whether we are a small company or
big Fortune 500 enterprise, we need this
shift in the organization because basically
that's a must have for us to move faster, to have
highly efficient DevOps culture that maximizes
Dora metrics so we cant move fast and in a reliable,
robust way. So the very first thing when
we talk about observability in general,
we said that we would like to do it around
databases, but what we have now around databases is we
typically have dashboards. This applies to the whole observability
in general. We have dashboards that present lots
of data, but they lack any information that
they should convey. We have charts, we have
like we are swamped with raw data, with signals,
with metrics, with everything around, but we do not get
the information. What's going wrong. We have lots of
signals. This includes like deployment signals,
metrics, traffic maybe sometimes like very detailed
metrics from particular regions or dimensions. But we don't
understand the big picture and we don't understand the coherent
story of what is going on. So we cant do the
efficient observability if we don't have better tools,
with current tools, just like the one we see on the screen,
we just can't do the efficient observability. So let's
see how to build this efficient, effective observability
for databases. But in order to do that, we need to understand what
can go wrong. So let's see a couple of things that break in databases
and then let's see how we can actually build the stuff
better. So the problems in databases, they typically fall
into one of three areas. So first area is code changes.
So we change the code, we basically execute stuff. We execute
different queries because we modified our application and this obviously leads
to performance degradations and problems, right? So this is code changes.
The second area is schema changes. Whenever we
change schema of our applications, we change databases,
columns, data types,
everything can always affect the way our
applications execute and can break our applications.
But last but not least, there is yet another area which
we typically miss when it comes to building observability.
Building resilient testing suits and building all
the processes around CI CD for that. And this is how queries are
executed. And I'm not talking about what query change,
meaning that we added new columns, started joining another table.
No, what I mean here is that we have a query
that is now like over days, changes the
way it is being executed by the database engine,
changes the execution plan, because statistics
changed, because indexes changed, because configuration
around changed. So all those things may
affect how the query is executed. And we
won't notice that in our CI CD and our old
DevOps culture cannot see that now easily.
So let's see a couple of examples. Obviously we
have queries, for instance like query like here on the left.
So we get user, we want to extract some details for this user.
So we basically join many tables and this query effectively
joins multiple tables. And in turn
this is from one of my production systems, in turn
extracts nearly 300,000 rows. Even though
we extract data just for one user,
effectively this query was running for around 25
seconds. Why? The problem here is that we join
many tables and because we join many tables we effectively
explode the number of rows we extract from
the database, most of the rows being duplicates. So because
we join those we join data from multiple tables.
We effectively have exponential increase in the number of rows
which are then later translated by our orm library,
mostly discarded, to build just one entity
representing this particular user. And now can
we avoid having an issue like this? Can we somehow
realize that hey, we have a query that is going to be super
smooth in production? The problem with our current approach for DevOps,
for CI CD, for automated testing is that we focus
on correctness of the data. We always
focus on whether the data that we extracted from
the database is correct or not. We do assertions
on the content that we extracted from the database,
but we do not assert the way the data was
extracted. Sometimes even worse. Sometimes we may say that hey,
this is an implementation detail that we shouldn't
care about. We don't care how things are executed by the database
engine as long as the data is correct and this is wrong.
Because yes, we shouldn't care how it's done,
but we need to care whether it's done fast enough.
And for that we need to have ways to
assess that those queries are executed fast
enough and that nothing we changed in our application code is
going to degrade the performance and basically kill it.
So in order to do that, we need to come up with new ways
of testing applications. And I'll speak later how to
achieve that. But apart from just testing the content, we need
to also test the performance of our applications. In this
case, the example, the solution to this problem was rather straightforward.
Instead of just joining all the tables, we basically run multiple
queries that execute multiple SQL statements against
the database. And you may think, okay, this is going
to be way slower because we run eight queries instead of one.
But the reality is it's completely the opposite because we extract
queries that get very small amount of data from
each of these tables and then we join this data
in our application code. This thing actually executes in like
a fraction of seconds in a couple of milliseconds,
so way faster than it executed
when we were running just one single query.
Another example when we may not necessarily see
the performance drops is when we try rewriting our
queries so they are more readable. So we have a thing that is
called CTE, common table expression. Let's say that we have boarding passes
table which contains something like 8 million rows. And what
we would like to get is we would like to get those
boarding passes and calculate the checksum
of the ticket number. Do it twice using MD five
algorithm, just for the sake of showing the performance issues.
So we create a CTE which is kind of like temporary,
named table here called CTE performance, that we
then join three times on the same ticket number,
flight id and boarding number and we get the data for a particular
checksum, right. So this is one of the query
that we cant write this way, but we can also write the same query
without using this CTE. We can basically join boarding passes
three times and then calculate MD five checksum.
In the work condition, those two queries are equivalent.
The difference in performance though is tremendous.
In this case, when we don't use CTE, the query runs
in 8 seconds. When we do use CTE, that's 13
seconds. So this is nearly twice as long
as the previous query. So if we now step into
our system and take query here that we see at the bottom
and decide to rewrite it to the query at the top because it's
more readable, then yes, all our unit tests are
going to still be green because the data
is exactly the same. Those queries extract exactly the same
data. The problem is, again with the performance.
We extract data differently and so we break
the performance of our system. The biggest issue is we won't
learn about that until we go to production.
Why is that? That is because in our local environment
we typically have very few rows,
so we won't see any performance issues. But when
we go to production and we move from like a couple
hundred rows to millions of rows, then we'll
immediately see things are not working well.
But it's going to be very late in our pipeline when we
realize those things are slow moving
on. Another issues that may pop up when we are
dealing with databases is around schema migrations.
So all those issues, whenever we add a column,
we drop a column, we change the column data type.
Whenever we do those things, we risk that we will need to
rewrite the table, meaning that we will need to take the data,
copy it on the site, drop the table, recreate it with proper schema,
take the data back and put it in the table.
And when we are doing that, when the database engine is doing that,
we effectively need to take the database offline.
This table needs to be taken offline and this will lead
to the outage of our system. The problem
with this is we will take our databases
offline when going to production.
Even worse, it's not easy to
stop such a migration when we realize something is wrong,
because we can't just easily roll things back in
case of schema migrations. And the worst part, if we
wanted to avoid issues like this one, is that our
unit tests don't even check schema migrations.
Why is that? That is because all they do is
they verify the content of the database when the database is
already back online. They run against
the living database, so they don't verify stuff
when the schema migration is going through, meaning that
they will always just work either with the old schema
or with the new one, so they won't capture any performance
issues. Not to mention that those performance issues.
Again, if we try figuring finding them in our local environments,
then we probably have a couple of rows which is going to be migrated
very, very fast. Until we go to
bigger databases, we won't see any of those issues.
Yet another problem we may have around the databases is
indexes. Obviously when we finally spot
a problem that something is slow, what we can do is
we can add indexes. We obviously need to answer questions like what
index to add, when to apply it, which columns
to include, et cetera, et cetera. And then we can run our select statements
and verify that yes, they now work much faster.
This is great. The problem with indexes though is that
indexes affect not only the
query that read the data, but also the
queries that modify the data. Meaning that if
we add an index, then yes, we improve the performance
of reading the data, but at the same time we decrease the
performance of modifying the data. And if we tell one
of junior developers, hey, if you have slow query at index,
then we end up with this. We have many, many indexes in
our database that are now competing and slowing us
down. Because yes, it may be that they speed things up
when it comes to select statements, but when it comes to
inserts, updates or deletes, they make it way
slower. Not to mention there is yet another problem with indexes.
When we add an index, we verify that our query
uses the index because we see the performance improvement. That is
cool. But what happens six months later when we change
the query and we do not notice
that the performance dropped? The query may stop using an
index. Maybe the index is not used anymore, not usable
anymore. Maybe we have too many rows and databases.
Engine decides not to use such an index. Maybe we have many indexes,
and generally using an index is inefficient. How do we verify
that during our CI CD pipeline? How do we notice that?
With our unit tests. Again, impossible,
because our unit tests, they focus on the content that we extracted from the
database, not on the way that we extracted the data.
Moving on, orms they include increase the
complexity even more because for instance, we may have n
plus one problem with RmS n plus one means that let's say
that we have aircraft stable that is connected one too many with
seat stable and we would like to get all the aircraft, iterate over
those aircrafts and basically calculate the seats count.
Okay. The problem with this approach is that if our
orm works in so called lazy mode, then we
will have n plus one queries. We will have one query to
load all the aircraft from the database and
then we will have n additional queries for
each of the aircraft to get the number of.
Lets obviously this is going to lead to n plus one
queries, which is inefficient. We could
change that and deal with this whole data using just one query
select with join. The problem is we won't notice
the problems when we are talking, when we are dealing
with unit tests or integration tests, because once
again they just focus on the correctness of the data,
not the performance. But you might say, okay, let's switch
all the queries to always be eager, to always join the data,
to run fast, and then we end up with the query that we already noticed,
we already saw in our slides. That is not working fast because
we join many tables and then we kill the performance.
So it's not a silver bullet. We can't just fix this issue once
and for all and make it work all the time. This just doesn't
work like that. Another problems we have
around Orms is they hide tons of
complexity. They hide what transaction levels we use, what transaction
scope we have when we start a transaction, when we commit it,
when we roll it back, who is in charge of committing
or rolling back the transaction? Can we have nested transactions?
How do we cache data? Can we cache data across nodes?
Do we have connection pooling? Do we use query hints? Do we use many,
many other things? Not to mention migrations.
Is our orm in charge of migrations? What if we have like heterogeneous
applications, many applications, talking to the same data source, who is
in charge of migrations? What happens if the ORm realizes
that the schema is different from expected? Is it going to just
deal with it or is it going to drop the table and recreate
it? So many companies, so many companies actually
face this issue that Orms decided to
drop the table just because it couldn't
reconcile the schema in production. When going to production,
this is something we really should be aware of. How do we test
all of that? How do we deal with migration during testing?
How do we fix errors when we spot those issues? How do we keep
that, maintain that and keep it all those things in place.
Orms are great, but they really increase
the complexity by hiding the complexity. And we
need to be aware of that. And now we come to question,
are tests working? Do they work like we have
the pyramid of tests, we have unit tests, integration, lets end
to end. Lets. And obviously, as I stated many times,
none of them is going to work because those tests,
they focus on the correctness of the data, not on
what data or how we extract the data, not on the performance,
not on whether the indexes were used or whether the query plans
are performant. No, they just focus on the correctness of the
data. So we won't see any of the issues here.
But you may say that hey, we have load tests, so we could
go with load tests and we could really verify the
performance. That is true. There is a problem with load
tests though, or many problems. Let's see a couple of them. First, they are
slow and expensive. They are slow because
to do a successful load test we need to basically run it for
hours, right? They are expensive because we need to
build a separate fleet for those load tests. Not to
mention that we really want to hammer our
machines to verify if they can keep up with high load that we
expect to see in production. So we really need to pay for those
machines, for the traffic, for the cpu and resources for hours.
That is the first thing. Second, it's not easy to even run
proper load tests to get the data, anonymize it,
use it outside of production environment. It is
tricky, especially in phase of GDPR and other confidentiality
policies. So generally doing load tests is
also a pretty demanding
and expensive and hard task. But last but
not least, load lets happen at the very end
of our pipelines just before we go to production.
If they find an issue, this is great,
but it's way too late to fix that issue.
We really need to go back to whiteboard,
fix the design reimplement, the changes,
go through code reviews again, push to pre production environments,
run automated test suits and then repeat load tests.
This is going to take time. We just cant deal with
that. We can't let that happen so late in
the pipeline. We need to change that. Yet another
thingy that we can't deal with during observability
is nonoximal configuration. We need to answer a couple of
questions first, do we have the right solutions in
our databases? Do we use proper indexes like
trigram indexes, JSON B indexes, or indexes that
are tuned to our workload? Do we use
proper databases? Should we use separate
vector database or should we use Pgvector? Should we
have document database or should we just use JSON in
postgres SQL? Should we deal with object oriented
programming? Or maybe should we move the code to the database as stored
procedures? Those are the things that we need to understand,
and not only from the perspective of the correctness of the code,
but also from the performance perspective.
Second thing is, okay, we know we have the right solutions,
but do we have the solutions done right?
How do we deal with maintenance windows, with vacuuming, with the fragmentation,
with inefficient indexes, with configuration that is nonoptimal,
with like misconfigured extensions or connection
pools or whatever else or buffers? Generally those are the things
that can severely affect the performance. And it's
generally even though we use good tools,
we don't get good results. Finally,
load handling, how do we know that we handle
the load properly? How do we know whether we are using efficient hardware,
good hardware configuration, whether we have all those places configured
well? What about differences between regions?
What about differences between continents or locales or countries?
What to do if we need to deal with multitenancy, meaning that
we have two tenants, one of them consumes like very little of resources,
the other one is basically hogging everyone else,
right? How do we deal with that? What to do
if we can't use proper additions or we use databases?
We use non optimal solutions. Just because we have old version of
databases, or old version or lower version, lower editions of
the licensing model, cant we even change the database with
our model? The way we work, those are the things
that even though we do consider them,
like theoretically, we very rarely deal
with them in practice and we very rarely get proper
tooling that supports us in answering those
questions. And here comes the solution. All I
told you, all those inefficiencies, dashboards that do not show
the story, configuration that is unclear,
all of that can really be fixed. We just
need to have proper databases, guardrails.
So let's see how to do that. The very first thing
to build the modern age observability in
DevOps culture is to understand the context.
We need to understand the context. We need to see
what's going on in our systems in order to be able to
efficiently monitor those solutions. The very
first thing to understand is there are many levels of observability
or monitoring or other stuff, and we need to understand the
differences. We start with telemetry. Telemetry is just the ability
to collect signals, logs, traces,
metrics, whatever else. If we have telemetry
in all the places, then we have visibility because
we really can go to any part of the system, use open
telemetry or telemetry in general that is there,
and see what's inside the system.
Then if we start capturing those signals automatically
and constantly and monitor them and introduce
alerting and other stuff, we turn into application performance
management, APM, then we can finally have
dashboards that can show us hey, fire is there
or everything is green. But those dashboards
the way we have them now, they are completely inefficient
because they just swamp us with raw data instead of
showing us the proper understanding. The story that
hey, you changed this code deployed on last Wednesday.
This in turn made the application not use the index anymore
and now we have peak traffic on Saturday evening which led into
different traffic distribution. And that's why your database is slow.
And this is what we call observability. Observability is
the ability to tell what and why,
not only what. Monitoring is just alerting
about errors, swamping with raw data. We need to tune thresholds,
set metrics, set alarms manually.
Observability connects all those dots
and shows it the way we can finally
tell what's the story. We can finally do
all of that. Now if we have proper observability
in place, then we do not need to
have multiple teams dealing with our
databases because if we have proper tooling
then we can just use these tools. We can fix issues on self
serve way because we have tools that tell us what's going on
and those tools can minimize communication, can finally
free and unleash the power of our teams and
can finally make it possible for us to
reshape our organization so we finally have true
DevOps culture. So let's see what
to observe and to understand what we need to observe,
we need to understand how databases work. Databases whenever we
send a query to database, such a query goes through multiple stages,
one of which is planner. Planner is basically
a stage that data when databases engine plans
how to execute the query and it comes up with many
plans and then picks the cheapest one to execute
the query in the fastest possible way. So if we take a
query like this, select star from many tables, we can
always prepend this query with the keyword explain this.
Explain basically tells the database hey,
tell me how you are going to execute this query and it
gives us the query plan. If we now take a look at
what this query plan shows is we see that it
represents many operations. Basically each row in the plan
is an operation. Operation just like scanning the table using
an index, sorting rows, joining two
tables, filtering, et cetera, et cetera. Each operation
also has cost associated with it.
Cost is basically an arbitrary number showing us how
hard it is to execute the operation.
It's not a number of cycles, memory used or nothing
like that. It's just an arbitrary number telling
us how hard this is. The big part is
we can take many plans and pick the cheapest
one in this way to decide
onto using plan that will lead to the fastest
query execution. This is how databases work,
so we can ask the database hey, how are you
going to execute this query? And then based on the execution plan we
get, we can tell whether it's going to use an index or not.
How do we observe this data? Well, we can use opentelemetry.
Open telemetry is basically an open standard that
gives us sdks that we can integrate with our applications
to collect signals like locks, traces, metrics, et cetera,
et cetera. The idea here is that it's standardized, meaning that
most likely your applications already use Opentelemetry
because opentelemetry works with so many languages,
JavaScript, C plus plus Java or JVM
in general net in general Python and others.
Right? What's more, we can enable open telemetry
even without changing our applications.
We can enable open telemetry by dropping the
libraries next to our application in the deployment
and then setting a couple of environment variables that
hey, just enable instrumentation for your
orms, for your web servers and whatnot. You can take open
telemetry and enable telemetry without
modifying a single line of code in your application.
That is really great feature. So we can use
Opentelemetry and we can capture things
that happen in the database. So how do we do that?
The idea is that we have a process software development
lifecycle that works in like many many
steps. So we have our developer writing some code.
They use their local environment, local laptop.
They modify the application code and basically
change the way the application works. They next run the
application locally and the application talks to a local
database. Local databases, which has like a couple
hundreds of rows, runs locally, whether natively
or in Docker, doesn't matter. The application gets the data
and answers to APIs. But then we enable
the application to use open telemetry to send the
trace of what the application did. We send it to
telemetry collector. Telemetry collector can then go
to databases and ask the database hey,
I heard that application was running select star from table
I would like you to tell me how you executed this query,
how would you execute such a select statement? So we
get the execution plan and we can deliver this execution plan
to some centralized platform. What we can do next
is developer can consult this platform or
be notified by this platform or get insights from
this platform telling hey, you have inefficient query
in here or your databases is not going to work well,
right? So you need to change that. We don't need
to run slow and expensive load tests to get
that. We can just get those things without
modifying a single line of our application inside the
developer environment. What we do next?
Next step is finally we decide on the application code,
how it works and we just commit our changes to the repository
or create a pull request or feature branch and our continuous
integration pipeline is going to run tests. We cant
again let the application go talk to the database
answer or execute all the lets. We can again capture
the telemetry and send it to the platform again to
confirm that everything worked the same way and nothing
changed. And what's more, we can also analyze
schema migrations in here because we can integrate
with CI CD pipeline to understand how things
were executed and understand what the schema
migration was supposed to do and how it did
that. So we can analyze those schema migrations.
And again we can notify the developer like
during pull request that hey, you're doing something
risky, something wrong, think twice before doing
that. Finally when we realize all is good, we can
push this code to like main branch and start the deployment
with our CD pipeline, like continuous deployment, our application
is now deployed, it talks to the databases and
we can have yet another tool that
talks to the production databases, extracts the
metrics configuration,
all the things around extensions, around schemas,
around shape of data, around the execution plans,
and deliver that to the platform again to see
what happened. And now the best
part of that, because we capture the data from the production
database, what we can do is
in this part, what we can do is we can tell
the developer, hey, your query was working
fine in your local database,
but in production database you have this many rows
and this is just not going to fly well. So now
we can tell the developer immediately that the query
they run, this is how it executed locally, this is how it
would execute in production. And this finally
closes the loop and shifts left everything
that we need, we now shift left the ownership, we shift
left the maintenance, we avoid issues and we
let developers own their databases. So now
the thingy is platform engineers. They need
to take tools like these, introduce them in
their organizations, and basically let developers
start using them. Software managers can take tools
like these and make sure that their roadmaps are not derailed
because developers will be able to use those
tools and avoid issues. And this is how modern,
efficient observability works with whatever
database you have. Notice there is nothing specific to SQL databases.
We just want to ask the database hey, how you executed such
a query. But this will work with whatever thing you have,
with whatever system you have, as long as the system can self report how
it performs and what's going on. So database
guardrails, they prevent bad code from reaching production
because they let us understand what's happening inside the application and
how things are going to work in production without leaving our local developer
environments. They monitor the system end to end because
they get all the insights from local environments, from continuous integration
pipeline, from continuous deployment, from the
actual production databases. And finally they can troubleshoot
things automatically because they have all the dots and
can connect all of them together and answer this is what's
going on, this is how things behave and this is what changed.
So this is how we can build successful databases,
guardrails and in turn effective observability.
In order to do that, we really need to be proactive and we really
need to push to the left. We need to lets developers own their
databases, but we can't just swamp them with raw
data and metrics that we have now. We need to build and introduce tools
that can do that and cant turn monitoring
into observability, can turn seeing
into understanding. And this is exactly what Matis does.
Let me really quickly walk you through the
system that we have at Matis and what it is capable of.
So when you integrate with Matis, when you register with Matis,
you create a project that captures all the interactions with
your database. So for instance, we can see that, hey,
your application exposes this API. If you run
this API, this is the SQL statement that has been executed.
This is the visualization of the query and what happened.
Couple of statistics. But most importantly we tell you, hey,
you are using table scans here and there and this is basically going to
be slow. And we tell you what's the impact and how to remediate
that. We show you how many rows were read, how many rows were returned,
how the query was executed, and in the best way
we just tell you how to fix that. We integrate with
CI CD. So apart from analyzing performance,
we can for instance look for like schema migration
and we can show you that hey, you would like to drop an index,
but notice this index was used in production over
6000 times in last seven days. So this index
is probably something you shouldn't drop because it is being
used. Figure out whether it uses
how the index is used instead of just dropping that.
But we can do even more. We can see the
monitor, the database, and for instance, we can show you,
obviously infrastructure metrics, just like cpu memory,
throughput, et cetera, et cetera. But we can also analyze database
oriented metrics. We can show number of transactions,
rows, cache hits, et cetera, et cetera. We can analyze
schema, we can analyze indexes, for instance. We can show you,
hey, those are the indexes and they haven't been used quite recently,
right? Maybe consider dropping those indexes, right? But we
can also show you like top queries. Those are the queries that
have been executed quite recently, right? This is
like the anomaly that we can detect with queries.
And we can show you other stuff. We can show you like insights around the
queries. We can show you execution plan and visualization.
All we do, we can show you all of that,
capture that and build this true observability
and understanding. So this is what we need.
We need a solution that integrates with the source code,
with CI CD, with our activity
that we run locally or in pipelines or in production.
We need to have proper observability that analyzes
the metrics and figures out anomalies, and can do that
like automatically or on demand or whatever else.
And this is modern observability in
databases. This is how we would like to do that. This is how we can
do that using the solutions we saw during this presentation.
In summary, databases may break for so many
reasons. And our unit tests,
integration tests, or generally automated tests do not capture those issues
because they focus on correctness, not on how
the data was extracted. And even load tests,
they happen way too late. They are
very expensive, and if they found issues, then we need to start
from scratch. So we waste much time. In order
to build a proper DevOps culture, we need
to minimize communication. We need to build teams
that can deal with all the issues without the need
to go to other teams to ask for permissions or access
to something. But in order to build these teams,
we need to have tools that will let them see everything,
reason about the system, and finally self serve
all the issues. And I showed you how to build those things,
and Matthews obviously covers all of that. So this is how we build effective
observability for databases, no matter whether
it's postgres, MySql or anything else. And being all of
that said, I'd like to thank you for attending this talk. I hope you
enjoyed it. Hope you learned something useful.
I encourage you to take a look at Matisdata IO, join our
discord channel, ask questions and let's stay in touch. Thank you
and enjoy the rest of your conference. Thank you.