Effective Observability in PostgreSQL, MySQL, and Other SQL Databases

Video size:

Abstract

You can’t deploy blindly and let your application fail. You need to prevent the bad code from reaching production, monitor your databases, and automatically troubleshoot when something is wrong. Let’s see how to do that in the SQL world!

Summary

Adam Fulmanigan will talk about effective database observability. We will discuss how to effectively observe what's going on in databases. Adam Furmanek will be your guide for the next 40 minutes.
Previously, 20 years ago, all we had was a very simple two tier architecture. The problem was that those things were not necessarily scalable. So then we started introducing DevOps. We wanted to put both teams side by side so they could closely cooperate and deploy things faster.
Microservices are much more complex, because they talk to many data sources. Because everything is now distributed, globalized, we can't easily tell how the things executed. We need to change the way how we structure our organization. Let's see how to reshape the world.
The problem with databases is we can't shift left our testing and our maintenance. Our ultimate goal is always self service. Make developers own their databases, make developers self serve their issues, and most importantly, finally fix all the problems as early as possible.
The problems in databases typically fall into one of three areas. Code changes, schema changes and changing the way queries are executed. Building resilient testing suits and building all the processes around CI CD for that.
Another issues that may pop up when we are dealing with databases is around schema migrations. The problem with this is we will take our databases offline when going to production. And the worst part, if we wanted to avoid issues like this one, is that our unit tests don't even check Schema Migration.
indexes affect not only the query that read the data, but also the queries that modify the data. If we add an index, then we improve the performance of reading the data but at the same time we decrease the performance. How do we verify that during our CI CD pipeline?
Orms are great, but they really increase the complexity by hiding the complexity. There is a problem with load tests though, or many problems. To do a successful load test we need to basically run it for hours. How do we test all of that when going to production?
The first thing to build the modern age observability in DevOps culture is to understand the context. All those inefficiencies, dashboards that do not show the story, configuration that is unclear can really be fixed. We just need to have proper databases, guardrails.
Open telemetry is an open standard that gives us sdks that we can integrate with our applications to collect signals like locks, traces, metrics. We can enable open telemetry even without changing our applications. And we can also analyze schema migrations in CD pipeline.
We need a solution that integrates with the source code, with CI CD, with our activity that we run locally or in pipelines or in production. We need to have proper observability that analyzes the metrics and figures out anomalies. This is modern observability in databases.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. My name is Adam Fulmanigan. Thank you for coming to this talk in which we are going to talk a little bit about effective database observability. No matter what SQL database you use, whether it's postgres, MySQL, Microsoft SQL, or maybe even ibmDb. Two, that doesn't matter. We are going to talk how to effectively observe what's going on in databases, how to debug our problems, and how to get some understanding of all the issues that happen in our database management system. My name is Adam Furmanek, and I will be your guide for the next 40 minutes. So let's jump straight to it. So the very first thing I'd like to start with is the world has changed significantly in the last years. So previously, 20 years ago, all we had was very simple two tier architecture. We had like a database. We had maybe a couple of nodes talking to the same database. Typically it was just one node, but sometimes we had like two or three of them. And obviously we had a user that was using our application. Nothing special here. If we wanted to have another application, if we wanted to do something else, typically we started building another monolith, right? We had monolithical applications that were all talking to one data store, and they were completely independent, not talking much to each other. So if there was a need to build something else, it was completely independent from the previous monolith. The problem was that, hey, those things were not necessarily scalable. We didn't know, or we didn't have any means to deploy them efficiently, to deploy them quickly, and to iterate many times a day with our changes. Whenever we wanted to release a change, we basically had to go through a big script of the deployment, deploying like whole application, then making sure everything works. And obviously it was time consuming, it was very slow. And the worst part was everything was tightly coupled with Chada. So then we started introducing DevOps. We realized that, hey, the plan in which we deploy applications, or the people that deploy applications are completely independent from the people that develop those applications. This is not necessarily the efficient way. So we wanted to bring them together. So DevOps emerged, and what we wanted to do is we wanted to put both teams side by side so they could closely cooperate and deploy things faster and participate in both development and deployment. And then we realized we can bridge this gap even more. We can actually come up with DevOps engineers that are both capable of writing the application code and deploying it. And at the same time, the world complexity increased significantly. So what we have now is our applications, our microservices, they are much more complex, because they talk to many data sources, many incompatible databases or data stores. They use very many different computation platforms. We have serverless, we have full blown bare metal, we have on the edge computing, we have many other things. And most importantly, everything talks to each other. So that's the biggest issue. That's the biggest challenge we face. Now, in order to deploy the changes, we don't need to deploy everything. We can just deploy a small piece of our modifications, like one microservice, one serverless lambda, or function, and off we go. The problem though, is that now everything talks to each other. The increased complexity, we didn't get rid of it. We only moved it in some different, into different part of the ecosystem. And so we pushed the complexity from the deployment to maintenance and operations. Now everything talks to each other. And whenever there is an issue, no matter whether it's like an issue with database, with queuing system, with service bus, we kind of don't know how to fix it, how to deal with that, because we need to deal with logs that are scattered throughout the system. We need to deal with pieces of information scattered here and there, and not even being able to tell what was the order of operations, because everything is now distributed, globalized, so we can't easily tell how the things executed. And when there was an issue, then, was it just a manifestation of some bug in this part of the system? Or was it a coincidence? Because everything is now distributed and moving forward, there is this thing that we know as Conway's law. Conway's law basically tells us that the architecture of our applications will mimic the organization that we live in. Meaning that if we want to deal with microservices that all have their independent data stores, that all have their independent, like code base, are deployed completely independently, then it means that we basically need to build teams that have the same architecture, same structure. Conway's law is not something that we impose as an axiom, is not something that tells us we need to do it this way, because otherwise, like, we are doing something wrong. No, Conway's law is an observation, telling us that if we stop fighting really, really hard, then our applications will resemble the structure of our organization. This basically means that if we want to impose some architecture, we should start with imposing the right teams structure in our organization. Meaning that if we take pictures from teams topology. Very nice book. If we want to have microservices that are completely separate, completely independent, and they deal with different independent databases and code bases, then what we need to do is we need to take this picture and basically turn it 90 degrees so we get exactly the same team structure. Notice that those images are exactly the same. They are just rotated by 90 degrees, but they show exactly the same. If you want to. If we want to have an efficient independence and decoupled applications, what we need to do is we need to have decoupled and independent teams. And taking into account what happened with the increased complexity of our world, this means that we can't rely on the solutions that we used to have for observability, monitoring and maintenance that were like, we used them when we were dealing with monolithical applications, because it just doesn't work the same way we change the way we develop and deploy our applications, the same way we need to change the way we maintain them and the way we structure our ops teams. This effectively means that because of the increased complexity, we can't rely on dbas anymore. We need to change the way how we structure our organization. And we need to do it with help of platform engineers, DevOps engineers and developers all working together to reshape the world. Let's see how to reshape the world. The important part that we all know about is that communication is the main factor in reducing efficiency and performance. The things that we can automate, things that can be done by tools, they are fast, because computers are fast in general. But the things that we need to do manually, or the things that we need to do by cooperating with other teams, those things are slow. This means that the highest bandwidth we'll have inside our teams, where we can communicate easily and quickly, if we need to go cross teams, then the communication is going to be way slower, which will lead to far slower, like troubleshooting and issue resolution. This increases mean time to recovery, time to fixing an issue, basically time to do anything. Just because we spend time on communication, then we cannot speed up. This is also another outcome of Conway's law. So what we need to do is we need to understand that in order to have efficient observability, monitoring and efficient DevOps infrastructure or DevOps culture, what we need to do is we need to minimize communication. That's the very important part. And in order to minimize communication, we need to basically minimize the reasons to communicate. And those reasons are typically that we understand or realize there is a bug, but we figure it out very late in our pipeline. So what we want to do is we want to shift left all the checks. In order to have efficient communication, we need to not communicate. The less we communicate, the better. So we need to avoid communication at all. And how can we avoid communication? We can do it by shifting all the things to the left and building teams that don't need to communicate with other teams. We need to build small and highly efficient teams with well defined scope so they can deal with everything they need without the need to go to some other teams and ask for help. So this is why we need to have the shift left. And the important part, talking about databases. Now the problem with databases is we can't shift left our testing and our maintenance, because many times developers are very relentless to test their solutions. They don't have tools, they don't have means to do so, and most importantly, they lack working knowledge how to do that. So our ultimate goal is always self service. We always want our teams to be small, to be independent from other teams and teams that can do everything they need on their own, meaning that if they need to get some data, some locks, access to some systems, they need to do it in a self serve way. They shouldn't ask for permissions, they shouldn't communicate with others, they should do this in a self service way. So this is the goal, this is how we need to turn our organizations to have efficient DevOps culture. And this affects everything we work on like web services, desktop applications, and most importantly for the purpose of this talk is databases. We need to reshape our organizations, to have new team structures that really embrace the complexity that increased and make it straightforward to maintain, deploy and troubleshoot databases. Let's see how we can do all of that and let's see what we can build now to shift left the ownership. Make developers own their databases, make developers self serve their issues, and most importantly, finally fix all the problems as early as possible. And this is a must, no matter whether we are a small company or big Fortune 500 enterprise, we need this shift in the organization because basically that's a must have for us to move faster, to have highly efficient DevOps culture that maximizes Dora metrics so we cant move fast and in a reliable, robust way. So the very first thing when we talk about observability in general, we said that we would like to do it around databases, but what we have now around databases is we typically have dashboards. This applies to the whole observability in general. We have dashboards that present lots of data, but they lack any information that they should convey. We have charts, we have like we are swamped with raw data, with signals, with metrics, with everything around, but we do not get the information. What's going wrong. We have lots of signals. This includes like deployment signals, metrics, traffic maybe sometimes like very detailed metrics from particular regions or dimensions. But we don't understand the big picture and we don't understand the coherent story of what is going on. So we cant do the efficient observability if we don't have better tools, with current tools, just like the one we see on the screen, we just can't do the efficient observability. So let's see how to build this efficient, effective observability for databases. But in order to do that, we need to understand what can go wrong. So let's see a couple of things that break in databases and then let's see how we can actually build the stuff better. So the problems in databases, they typically fall into one of three areas. So first area is code changes. So we change the code, we basically execute stuff. We execute different queries because we modified our application and this obviously leads to performance degradations and problems, right? So this is code changes. The second area is schema changes. Whenever we change schema of our applications, we change databases, columns, data types, everything can always affect the way our applications execute and can break our applications. But last but not least, there is yet another area which we typically miss when it comes to building observability. Building resilient testing suits and building all the processes around CI CD for that. And this is how queries are executed. And I'm not talking about what query change, meaning that we added new columns, started joining another table. No, what I mean here is that we have a query that is now like over days, changes the way it is being executed by the database engine, changes the execution plan, because statistics changed, because indexes changed, because configuration around changed. So all those things may affect how the query is executed. And we won't notice that in our CI CD and our old DevOps culture cannot see that now easily. So let's see a couple of examples. Obviously we have queries, for instance like query like here on the left. So we get user, we want to extract some details for this user. So we basically join many tables and this query effectively joins multiple tables. And in turn this is from one of my production systems, in turn extracts nearly 300,000 rows. Even though we extract data just for one user, effectively this query was running for around 25 seconds. Why? The problem here is that we join many tables and because we join many tables we effectively explode the number of rows we extract from the database, most of the rows being duplicates. So because we join those we join data from multiple tables. We effectively have exponential increase in the number of rows which are then later translated by our orm library, mostly discarded, to build just one entity representing this particular user. And now can we avoid having an issue like this? Can we somehow realize that hey, we have a query that is going to be super smooth in production? The problem with our current approach for DevOps, for CI CD, for automated testing is that we focus on correctness of the data. We always focus on whether the data that we extracted from the database is correct or not. We do assertions on the content that we extracted from the database, but we do not assert the way the data was extracted. Sometimes even worse. Sometimes we may say that hey, this is an implementation detail that we shouldn't care about. We don't care how things are executed by the database engine as long as the data is correct and this is wrong. Because yes, we shouldn't care how it's done, but we need to care whether it's done fast enough. And for that we need to have ways to assess that those queries are executed fast enough and that nothing we changed in our application code is going to degrade the performance and basically kill it. So in order to do that, we need to come up with new ways of testing applications. And I'll speak later how to achieve that. But apart from just testing the content, we need to also test the performance of our applications. In this case, the example, the solution to this problem was rather straightforward. Instead of just joining all the tables, we basically run multiple queries that execute multiple SQL statements against the database. And you may think, okay, this is going to be way slower because we run eight queries instead of one. But the reality is it's completely the opposite because we extract queries that get very small amount of data from each of these tables and then we join this data in our application code. This thing actually executes in like a fraction of seconds in a couple of milliseconds, so way faster than it executed when we were running just one single query. Another example when we may not necessarily see the performance drops is when we try rewriting our queries so they are more readable. So we have a thing that is called CTE, common table expression. Let's say that we have boarding passes table which contains something like 8 million rows. And what we would like to get is we would like to get those boarding passes and calculate the checksum of the ticket number. Do it twice using MD five algorithm, just for the sake of showing the performance issues. So we create a CTE which is kind of like temporary, named table here called CTE performance, that we then join three times on the same ticket number, flight id and boarding number and we get the data for a particular checksum, right. So this is one of the query that we cant write this way, but we can also write the same query without using this CTE. We can basically join boarding passes three times and then calculate MD five checksum. In the work condition, those two queries are equivalent. The difference in performance though is tremendous. In this case, when we don't use CTE, the query runs in 8 seconds. When we do use CTE, that's 13 seconds. So this is nearly twice as long as the previous query. So if we now step into our system and take query here that we see at the bottom and decide to rewrite it to the query at the top because it's more readable, then yes, all our unit tests are going to still be green because the data is exactly the same. Those queries extract exactly the same data. The problem is, again with the performance. We extract data differently and so we break the performance of our system. The biggest issue is we won't learn about that until we go to production. Why is that? That is because in our local environment we typically have very few rows, so we won't see any performance issues. But when we go to production and we move from like a couple hundred rows to millions of rows, then we'll immediately see things are not working well. But it's going to be very late in our pipeline when we realize those things are slow moving on. Another issues that may pop up when we are dealing with databases is around schema migrations. So all those issues, whenever we add a column, we drop a column, we change the column data type. Whenever we do those things, we risk that we will need to rewrite the table, meaning that we will need to take the data, copy it on the site, drop the table, recreate it with proper schema, take the data back and put it in the table. And when we are doing that, when the database engine is doing that, we effectively need to take the database offline. This table needs to be taken offline and this will lead to the outage of our system. The problem with this is we will take our databases offline when going to production. Even worse, it's not easy to stop such a migration when we realize something is wrong, because we can't just easily roll things back in case of schema migrations. And the worst part, if we wanted to avoid issues like this one, is that our unit tests don't even check schema migrations. Why is that? That is because all they do is they verify the content of the database when the database is already back online. They run against the living database, so they don't verify stuff when the schema migration is going through, meaning that they will always just work either with the old schema or with the new one, so they won't capture any performance issues. Not to mention that those performance issues. Again, if we try figuring finding them in our local environments, then we probably have a couple of rows which is going to be migrated very, very fast. Until we go to bigger databases, we won't see any of those issues. Yet another problem we may have around the databases is indexes. Obviously when we finally spot a problem that something is slow, what we can do is we can add indexes. We obviously need to answer questions like what index to add, when to apply it, which columns to include, et cetera, et cetera. And then we can run our select statements and verify that yes, they now work much faster. This is great. The problem with indexes though is that indexes affect not only the query that read the data, but also the queries that modify the data. Meaning that if we add an index, then yes, we improve the performance of reading the data, but at the same time we decrease the performance of modifying the data. And if we tell one of junior developers, hey, if you have slow query at index, then we end up with this. We have many, many indexes in our database that are now competing and slowing us down. Because yes, it may be that they speed things up when it comes to select statements, but when it comes to inserts, updates or deletes, they make it way slower. Not to mention there is yet another problem with indexes. When we add an index, we verify that our query uses the index because we see the performance improvement. That is cool. But what happens six months later when we change the query and we do not notice that the performance dropped? The query may stop using an index. Maybe the index is not used anymore, not usable anymore. Maybe we have too many rows and databases. Engine decides not to use such an index. Maybe we have many indexes, and generally using an index is inefficient. How do we verify that during our CI CD pipeline? How do we notice that? With our unit tests. Again, impossible, because our unit tests, they focus on the content that we extracted from the database, not on the way that we extracted the data. Moving on, orms they include increase the complexity even more because for instance, we may have n plus one problem with RmS n plus one means that let's say that we have aircraft stable that is connected one too many with seat stable and we would like to get all the aircraft, iterate over those aircrafts and basically calculate the seats count. Okay. The problem with this approach is that if our orm works in so called lazy mode, then we will have n plus one queries. We will have one query to load all the aircraft from the database and then we will have n additional queries for each of the aircraft to get the number of. Lets obviously this is going to lead to n plus one queries, which is inefficient. We could change that and deal with this whole data using just one query select with join. The problem is we won't notice the problems when we are talking, when we are dealing with unit tests or integration tests, because once again they just focus on the correctness of the data, not the performance. But you might say, okay, let's switch all the queries to always be eager, to always join the data, to run fast, and then we end up with the query that we already noticed, we already saw in our slides. That is not working fast because we join many tables and then we kill the performance. So it's not a silver bullet. We can't just fix this issue once and for all and make it work all the time. This just doesn't work like that. Another problems we have around Orms is they hide tons of complexity. They hide what transaction levels we use, what transaction scope we have when we start a transaction, when we commit it, when we roll it back, who is in charge of committing or rolling back the transaction? Can we have nested transactions? How do we cache data? Can we cache data across nodes? Do we have connection pooling? Do we use query hints? Do we use many, many other things? Not to mention migrations. Is our orm in charge of migrations? What if we have like heterogeneous applications, many applications, talking to the same data source, who is in charge of migrations? What happens if the ORm realizes that the schema is different from expected? Is it going to just deal with it or is it going to drop the table and recreate it? So many companies, so many companies actually face this issue that Orms decided to drop the table just because it couldn't reconcile the schema in production. When going to production, this is something we really should be aware of. How do we test all of that? How do we deal with migration during testing? How do we fix errors when we spot those issues? How do we keep that, maintain that and keep it all those things in place. Orms are great, but they really increase the complexity by hiding the complexity. And we need to be aware of that. And now we come to question, are tests working? Do they work like we have the pyramid of tests, we have unit tests, integration, lets end to end. Lets. And obviously, as I stated many times, none of them is going to work because those tests, they focus on the correctness of the data, not on what data or how we extract the data, not on the performance, not on whether the indexes were used or whether the query plans are performant. No, they just focus on the correctness of the data. So we won't see any of the issues here. But you may say that hey, we have load tests, so we could go with load tests and we could really verify the performance. That is true. There is a problem with load tests though, or many problems. Let's see a couple of them. First, they are slow and expensive. They are slow because to do a successful load test we need to basically run it for hours, right? They are expensive because we need to build a separate fleet for those load tests. Not to mention that we really want to hammer our machines to verify if they can keep up with high load that we expect to see in production. So we really need to pay for those machines, for the traffic, for the cpu and resources for hours. That is the first thing. Second, it's not easy to even run proper load tests to get the data, anonymize it, use it outside of production environment. It is tricky, especially in phase of GDPR and other confidentiality policies. So generally doing load tests is also a pretty demanding and expensive and hard task. But last but not least, load lets happen at the very end of our pipelines just before we go to production. If they find an issue, this is great, but it's way too late to fix that issue. We really need to go back to whiteboard, fix the design reimplement, the changes, go through code reviews again, push to pre production environments, run automated test suits and then repeat load tests. This is going to take time. We just cant deal with that. We can't let that happen so late in the pipeline. We need to change that. Yet another thingy that we can't deal with during observability is nonoximal configuration. We need to answer a couple of questions first, do we have the right solutions in our databases? Do we use proper indexes like trigram indexes, JSON B indexes, or indexes that are tuned to our workload? Do we use proper databases? Should we use separate vector database or should we use Pgvector? Should we have document database or should we just use JSON in postgres SQL? Should we deal with object oriented programming? Or maybe should we move the code to the database as stored procedures? Those are the things that we need to understand, and not only from the perspective of the correctness of the code, but also from the performance perspective. Second thing is, okay, we know we have the right solutions, but do we have the solutions done right? How do we deal with maintenance windows, with vacuuming, with the fragmentation, with inefficient indexes, with configuration that is nonoptimal, with like misconfigured extensions or connection pools or whatever else or buffers? Generally those are the things that can severely affect the performance. And it's generally even though we use good tools, we don't get good results. Finally, load handling, how do we know that we handle the load properly? How do we know whether we are using efficient hardware, good hardware configuration, whether we have all those places configured well? What about differences between regions? What about differences between continents or locales or countries? What to do if we need to deal with multitenancy, meaning that we have two tenants, one of them consumes like very little of resources, the other one is basically hogging everyone else, right? How do we deal with that? What to do if we can't use proper additions or we use databases? We use non optimal solutions. Just because we have old version of databases, or old version or lower version, lower editions of the licensing model, cant we even change the database with our model? The way we work, those are the things that even though we do consider them, like theoretically, we very rarely deal with them in practice and we very rarely get proper tooling that supports us in answering those questions. And here comes the solution. All I told you, all those inefficiencies, dashboards that do not show the story, configuration that is unclear, all of that can really be fixed. We just need to have proper databases, guardrails. So let's see how to do that. The very first thing to build the modern age observability in DevOps culture is to understand the context. We need to understand the context. We need to see what's going on in our systems in order to be able to efficiently monitor those solutions. The very first thing to understand is there are many levels of observability or monitoring or other stuff, and we need to understand the differences. We start with telemetry. Telemetry is just the ability to collect signals, logs, traces, metrics, whatever else. If we have telemetry in all the places, then we have visibility because we really can go to any part of the system, use open telemetry or telemetry in general that is there, and see what's inside the system. Then if we start capturing those signals automatically and constantly and monitor them and introduce alerting and other stuff, we turn into application performance management, APM, then we can finally have dashboards that can show us hey, fire is there or everything is green. But those dashboards the way we have them now, they are completely inefficient because they just swamp us with raw data instead of showing us the proper understanding. The story that hey, you changed this code deployed on last Wednesday. This in turn made the application not use the index anymore and now we have peak traffic on Saturday evening which led into different traffic distribution. And that's why your database is slow. And this is what we call observability. Observability is the ability to tell what and why, not only what. Monitoring is just alerting about errors, swamping with raw data. We need to tune thresholds, set metrics, set alarms manually. Observability connects all those dots and shows it the way we can finally tell what's the story. We can finally do all of that. Now if we have proper observability in place, then we do not need to have multiple teams dealing with our databases because if we have proper tooling then we can just use these tools. We can fix issues on self serve way because we have tools that tell us what's going on and those tools can minimize communication, can finally free and unleash the power of our teams and can finally make it possible for us to reshape our organization so we finally have true DevOps culture. So let's see what to observe and to understand what we need to observe, we need to understand how databases work. Databases whenever we send a query to database, such a query goes through multiple stages, one of which is planner. Planner is basically a stage that data when databases engine plans how to execute the query and it comes up with many plans and then picks the cheapest one to execute the query in the fastest possible way. So if we take a query like this, select star from many tables, we can always prepend this query with the keyword explain this. Explain basically tells the database hey, tell me how you are going to execute this query and it gives us the query plan. If we now take a look at what this query plan shows is we see that it represents many operations. Basically each row in the plan is an operation. Operation just like scanning the table using an index, sorting rows, joining two tables, filtering, et cetera, et cetera. Each operation also has cost associated with it. Cost is basically an arbitrary number showing us how hard it is to execute the operation. It's not a number of cycles, memory used or nothing like that. It's just an arbitrary number telling us how hard this is. The big part is we can take many plans and pick the cheapest one in this way to decide onto using plan that will lead to the fastest query execution. This is how databases work, so we can ask the database hey, how are you going to execute this query? And then based on the execution plan we get, we can tell whether it's going to use an index or not. How do we observe this data? Well, we can use opentelemetry. Open telemetry is basically an open standard that gives us sdks that we can integrate with our applications to collect signals like locks, traces, metrics, et cetera, et cetera. The idea here is that it's standardized, meaning that most likely your applications already use Opentelemetry because opentelemetry works with so many languages, JavaScript, C plus plus Java or JVM in general net in general Python and others. Right? What's more, we can enable open telemetry even without changing our applications. We can enable open telemetry by dropping the libraries next to our application in the deployment and then setting a couple of environment variables that hey, just enable instrumentation for your orms, for your web servers and whatnot. You can take open telemetry and enable telemetry without modifying a single line of code in your application. That is really great feature. So we can use Opentelemetry and we can capture things that happen in the database. So how do we do that? The idea is that we have a process software development lifecycle that works in like many many steps. So we have our developer writing some code. They use their local environment, local laptop. They modify the application code and basically change the way the application works. They next run the application locally and the application talks to a local database. Local databases, which has like a couple hundreds of rows, runs locally, whether natively or in Docker, doesn't matter. The application gets the data and answers to APIs. But then we enable the application to use open telemetry to send the trace of what the application did. We send it to telemetry collector. Telemetry collector can then go to databases and ask the database hey, I heard that application was running select star from table I would like you to tell me how you executed this query, how would you execute such a select statement? So we get the execution plan and we can deliver this execution plan to some centralized platform. What we can do next is developer can consult this platform or be notified by this platform or get insights from this platform telling hey, you have inefficient query in here or your databases is not going to work well, right? So you need to change that. We don't need to run slow and expensive load tests to get that. We can just get those things without modifying a single line of our application inside the developer environment. What we do next? Next step is finally we decide on the application code, how it works and we just commit our changes to the repository or create a pull request or feature branch and our continuous integration pipeline is going to run tests. We cant again let the application go talk to the database answer or execute all the lets. We can again capture the telemetry and send it to the platform again to confirm that everything worked the same way and nothing changed. And what's more, we can also analyze schema migrations in here because we can integrate with CI CD pipeline to understand how things were executed and understand what the schema migration was supposed to do and how it did that. So we can analyze those schema migrations. And again we can notify the developer like during pull request that hey, you're doing something risky, something wrong, think twice before doing that. Finally when we realize all is good, we can push this code to like main branch and start the deployment with our CD pipeline, like continuous deployment, our application is now deployed, it talks to the databases and we can have yet another tool that talks to the production databases, extracts the metrics configuration, all the things around extensions, around schemas, around shape of data, around the execution plans, and deliver that to the platform again to see what happened. And now the best part of that, because we capture the data from the production database, what we can do is in this part, what we can do is we can tell the developer, hey, your query was working fine in your local database, but in production database you have this many rows and this is just not going to fly well. So now we can tell the developer immediately that the query they run, this is how it executed locally, this is how it would execute in production. And this finally closes the loop and shifts left everything that we need, we now shift left the ownership, we shift left the maintenance, we avoid issues and we let developers own their databases. So now the thingy is platform engineers. They need to take tools like these, introduce them in their organizations, and basically let developers start using them. Software managers can take tools like these and make sure that their roadmaps are not derailed because developers will be able to use those tools and avoid issues. And this is how modern, efficient observability works with whatever database you have. Notice there is nothing specific to SQL databases. We just want to ask the database hey, how you executed such a query. But this will work with whatever thing you have, with whatever system you have, as long as the system can self report how it performs and what's going on. So database guardrails, they prevent bad code from reaching production because they let us understand what's happening inside the application and how things are going to work in production without leaving our local developer environments. They monitor the system end to end because they get all the insights from local environments, from continuous integration pipeline, from continuous deployment, from the actual production databases. And finally they can troubleshoot things automatically because they have all the dots and can connect all of them together and answer this is what's going on, this is how things behave and this is what changed. So this is how we can build successful databases, guardrails and in turn effective observability. In order to do that, we really need to be proactive and we really need to push to the left. We need to lets developers own their databases, but we can't just swamp them with raw data and metrics that we have now. We need to build and introduce tools that can do that and cant turn monitoring into observability, can turn seeing into understanding. And this is exactly what Matis does. Let me really quickly walk you through the system that we have at Matis and what it is capable of. So when you integrate with Matis, when you register with Matis, you create a project that captures all the interactions with your database. So for instance, we can see that, hey, your application exposes this API. If you run this API, this is the SQL statement that has been executed. This is the visualization of the query and what happened. Couple of statistics. But most importantly we tell you, hey, you are using table scans here and there and this is basically going to be slow. And we tell you what's the impact and how to remediate that. We show you how many rows were read, how many rows were returned, how the query was executed, and in the best way we just tell you how to fix that. We integrate with CI CD. So apart from analyzing performance, we can for instance look for like schema migration and we can show you that hey, you would like to drop an index, but notice this index was used in production over 6000 times in last seven days. So this index is probably something you shouldn't drop because it is being used. Figure out whether it uses how the index is used instead of just dropping that. But we can do even more. We can see the monitor, the database, and for instance, we can show you, obviously infrastructure metrics, just like cpu memory, throughput, et cetera, et cetera. But we can also analyze database oriented metrics. We can show number of transactions, rows, cache hits, et cetera, et cetera. We can analyze schema, we can analyze indexes, for instance. We can show you, hey, those are the indexes and they haven't been used quite recently, right? Maybe consider dropping those indexes, right? But we can also show you like top queries. Those are the queries that have been executed quite recently, right? This is like the anomaly that we can detect with queries. And we can show you other stuff. We can show you like insights around the queries. We can show you execution plan and visualization. All we do, we can show you all of that, capture that and build this true observability and understanding. So this is what we need. We need a solution that integrates with the source code, with CI CD, with our activity that we run locally or in pipelines or in production. We need to have proper observability that analyzes the metrics and figures out anomalies, and can do that like automatically or on demand or whatever else. And this is modern observability in databases. This is how we would like to do that. This is how we can do that using the solutions we saw during this presentation. In summary, databases may break for so many reasons. And our unit tests, integration tests, or generally automated tests do not capture those issues because they focus on correctness, not on how the data was extracted. And even load tests, they happen way too late. They are very expensive, and if they found issues, then we need to start from scratch. So we waste much time. In order to build a proper DevOps culture, we need to minimize communication. We need to build teams that can deal with all the issues without the need to go to other teams to ask for permissions or access to something. But in order to build these teams, we need to have tools that will let them see everything, reason about the system, and finally self serve all the issues. And I showed you how to build those things, and Matthews obviously covers all of that. So this is how we build effective observability for databases, no matter whether it's postgres, MySql or anything else. And being all of that said, I'd like to thank you for attending this talk. I hope you enjoyed it. Hope you learned something useful. I encourage you to take a look at Matisdata IO, join our discord channel, ask questions and let's stay in touch. Thank you and enjoy the rest of your conference. Thank you.

Slides

Download slides (PDF)

See all 57 talks at this event!

Conf42 DevOps 2024 - Online

January 25 2024

Effective Observability in PostgreSQL, MySQL, and Other SQL Databases

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

DevRel @ Metis

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2024 - Online

January 25 2024

Effective Observability in PostgreSQL, MySQL, and Other SQL Databases

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

DevRel @ Metis

Join the community!