The missing chapter in your CI playbook: database guardrails

Video size:

Abstract

CI/CD is now a part of software development. We review our code and make sure it is safe to deploy to production. However, what about SQL queries? Do we make sure they will work in a production environment? In this talk we’’ll track metrics, analyze queries, and use Metis for automation.

Summary

In the next like 40 minutes, we are going to track a little bit about CI CD pipelines. How to make sure that we prevent the bad code from reaching production. How do you know that the code you deploy to production is actually going to work well?
Another thingy is the third area we'd like to discuss is completely different context and completely different workload when we run in production. That's because there's completely different country, different environment you're running in. What kind of issues can you capture when running unit tests or when running integration tests?
We will cover databases, we will cover orm libraries, and we'll finally cover lack of context and lack of understanding of all the moving pieces around. For instance, we may get slow queries that work pretty well on your testing environment but do not work in production. How can we fix that in that case?
Even though they are equivalent, they work completely differently. Unit tests will not capture, and especially your CI CD pipeline may not capture the performance characteristics. Another thing we may hit is incompatible changes in schemas when deploying to production.
The model we store in our object oriented applications is different than the model in our SQL databases. Even strings is something that may be very prone to the impedance mismatch. Another thing with RMS is they lack the visibility, meaning there are so many moving pieces that rms take care of.
How do you deal with heterogeneous scenario when you have multiple applications talking to the same database? And how do you test whether your migration is going to run fast enough? With the typical CI CD pipeline with no database guardrails, we have no idea how it's working.
In order to understand how SQL queries are executed, we need to understand SQL engine. To do that we can use the modern observability tooling, which is called open telemetry. Open telemetry allows us to capture logs, track metrics from our applications to provide better observability.
Matis integrates with your source code with various languages, various orms, various databases. It analyzes your pull requests, merge requests, for instance in GitHub actions. Built proper database guardrails into your CI CD pipeline. Hope you enjoyed the rest of your conference.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, everyone, thank you for coming to stock. In the next like 40 minutes, we are going to track a little bit about CI CD pipelines and how to make sure that we prevent the bad code from reaching production. We are going to see how to improve the monitoring and observability of our databases and why we actually need a thing that we call database guardrails very early in our pipelines. So let's go. So the very first thing we would like to discuss is whenever there is an issue, we would like to catch it before we actually go to production, right. We would like to have some automation checks, some guardrails, some safety nets that will prevent the bad code from reaching production. Important part here is that we would like to have these checks as early in our pipeline as possible. We would like to push them to the left so they get executed as early during the early development stage, not very late during the load tests or even after we deploy to production. This is super important because in today's world, we just can't let ourselves to identify issues after we deploy to production. We need to catch all these issues before going to prod. And in order to do that, we need to do that automation. Right. We don't want to have this process like manual and handcrafted. We want to get it fully automated and work like a charm before even we go to load tests to pre production environments as early as possible. That's very first thing. But the second thing is whenever something breaks, we need to be able to automatically troubleshoot the issue. We need to know the context. We need to know everything that is around those things in our cluster, in our environment. And we need to basically be able to monitor and observe what's going on there. We would like to have a good tooling that is going to pinpoint the issues directly and let us know very precisely what happened and what the reason is. So this is what we are going to discuss during the upcoming minutes. So you might ask a very good question, right. How do you know that the code you deploy to production is actually going to work well? Right. They tell you deploy every time, deploy constantly, deploy on Friday night, everything is going to be good. But how do you make sure that you are not going to take your production down and that everything that you deploy is going to work well. Right. And before answering the question, we need to actually understand what may go wrong. And there are various areas that may break during our deployment. And when dealing with production environment, the very first area is the deployment may go wrong. So, you know, there are multiple operating systems. There is windows, there are Linux, macOS and other operating systems and there are differences between them, right? Like characters indicating the end of line, right. Different ways of setting permissions, right? Docker with different parameters and net host that works on Linux but doesn't work somewhere else. And other things that may break along the way. Not to mention that during the deployment you also need to propagate the changes from your local environment, from your staging environment to the production environment. And you may simply get that wrong or just forget to do so. For instance, how many times it happened that everything worked well in pre production environment when we deployed it to the cloud, but once we went to production it simply stopped working, right, because we forgot to propagate. Like IM roles, permissions, settings for databases, connection strings and other stuff, right? Just getting the software deployed is not as straightforward as possible. And your unit test may capture that. Your logic is flawed, but they won't capture whether you got the deployment right. Okay, so that is the very first thing that may break. The other thing that may go wrong is like your code may just start failing in production. This is because of various things you basically got wrong. For instance bugs in your application, but also because of different assumptions and different environment that you are now running in your locale may be different. The country you deploy to may be different like already mentioned, things like different end of line characters may be completely different. And all that stuff that affects that. Your application that worked well in local environment or in testing pre production environment doesn't work well when you go to production, right? And those things we can capture again to some extent using automated tests, unit test integration tests or whatnot. But generally what we need is we need to be able to tell very early that things that we are checking in our local environment will not break when deployed to production. Another thingy is the third area we'd like to discuss is completely different context and completely different workload when we run in production. This may be that like obviously different country, different data patterns, different data distribution. You tested things in Europe, but then you deploy to the US and you just get completely different input to your application because you have different clients working over there, right? Another thing is different data load. You tested stuff on very small database with like 100 rows in it. But when you go to the production you get millions of rows, right? And your application just can't keep up with the load, with the pace that you observe in completely living environment. Last thing is like edge cases, right? You have completely different situation when you deploy to big country versus to small country, right? You may lack some content, pre populated content like of your ecommerce, eshop or whatnot. You may get different distribution, you may get different characters, different encoding. Things may just break because you did not expect them to break because you didn't foresee how they may break. That's because there's completely different country, different environment you're running in. So now the question is, okay, so how can we make sure that on Friday afternoon we can safely deploy to production? And the very first thing they tell you is go with CI CD, right? If you know CI CD, everything is going to work well, right? So just to set things straight, CI CD stands for various things. CI stands for continuous integration. Continuous integration is basically when, after we merge or after we commit to the repository, we merge all the changes into one branch and then we build the package that we would like to deploy. We build the package, we run all the unit tests, all the automated tests, we run everything, we prepare all of that so that we are pretty sure that the package we have is going to work well. Okay, then we have another thingy, which is continuous delivery. In continuous delivery, what we do is we take the package and we deploy it to all non production environments. We deploy this package, we install all the dependencies, configure everything, and start testing whether the package worked correctly. Also, we make sure that all our teammates and all the teams around are using exactly the same code if they, for instance, don't deploy things locally, but instead use some developer or team environments deployed in the cloud, right? And finally, we have continuous deployment. And continuous deployment is when you get the package that you have just built and you deploy it to the production environment directly. Right? And this is cool. You take the package, you already tested it in like non production environments, and then you go and deploy into production, right? And this is when things may break. Obviously they tell you CI CD is going to protect you from all the issues, but is it? Right. What kind of issues can you capture when running unit tests or when running integration tests, what things you may see, what things you may spot, and what issues may just go unnoticed and appear in production. So what we are going to do over the next minutes of this talk is we will see some specific issues that may actually happen because you can't catch them easily in your CI CD pipeline. So we will cover databases, we will cover orm libraries, and we'll finally cover lack of context and lack of understanding of all the moving pieces around. So let's jump straight to it. So the very first thing with database is like we do deploy them, but they consist of multiple parts. First thing is when we change, we typically change the code that is talking to the database. So we change the queries that we send to the database. And this may result in various things. For instance, the query may now just be inherently slower. For instance, if you restructure your query of if you extract more data or if you change columns, you may change how your query executes and you may for instance get lower performance or just bugs, right? So that's first thing. The other thing we may get is schema changes. So in schema changes, whenever we deploy something to database, we may end up in a situation that this schema modification is going to work for a very long time. If you run things locally with small tables, then adding new column or changing the column type, or doing something with your schema in your database is going to run really really fast. But once you go to production and you have millions of rows, then simply innocent schema migration may literally takes minutes or even hours to get deployed. Your table may need to be rewritten, meaning that your SQL engine may need to copy the data from the table on the site, recreate the table, and then restore the data back. And this is not something you can do in a split second. This is something that is going to take minutes. So this will take your application down. And another thing is query changes. It may be the case that something in your database have changed for whatever reason, indexes changed, configuration, installed extensions, views, stored procedures, triggers, whatnot. Multiple things may just change which will result in your query now being executed completely differently. For instance, previously it was using index to scan your table. Now it can't use this index because of whatnot. Or maybe it was doing like hash join or merge join. Now statistics are out of date and it just goes with nested loop join which will just be slower. There are other things that may change in the database like statistics, bugs, missing indexes, data quality, configuration locks, partitioning, other stuff that is in your database and your unit tests will just not catch that. Your unit test will test your code and assume how it's going to work in your database. But those tests do not work how you configured your database. Especially they have no idea how you configured your production environment. So let's see what else may go wrong. For instance, we may get slow queries that work pretty well on your testing environment but do not work in when you go to production. So imagine that we have this application code right. In this application code we have an aggregate root of the user and what we do is we would like to also query for some details of the user, like location pages, text, whatnot. Right? And this in turn because we use orm library, this could create some query like this one when we basically join multiple tables. And this is the example of the actual production environment I had. And this is the code I was debugging once. And this code resulted in just when getting one entity to our application. Because this entity had so much of this data, these details of like questions, text reports and other tables that we joined together, this resulted in getting 300,000 rows to the application. And then the application RM worked hard for nearly 25 seconds to deduplicate everything and to construct the final aggregate object, the aggregate root of this user. So this is what may happen. You won't catch this issue when you run your tests against like very small database available locally, right? You won't catch that when you have 100 rows in your database. You can catch that if and only if you have literally thousands or millions of rows in your database. So because only then you see the cross join product of all those tables. How can we fix that in that case? Well, once we identify what the issue is, the fix was pretty straightforward. Instead of getting like the aggregate roots in one go, I was basically getting it with multiple queries, with multiple objects. This resulted in sending multiple SQL statements to the database and then joining all the results in the application code. One might say, okay, but now this sends more SQL queries to the database. And yes, you'd be right. But the only thing that is worth noticing here is that those queries run way faster now. They execute in like split second, and because they do not extract 300,000 rows. So all of that is now much, much faster. And just because we split one query into multiple ones, that's not a problem at all, because this ultimately leads to a better performance. So we can see something that, hey, your unit test probably won't tell you. You could capture that with some load tests and we'll get to that a little bit later on. But generally by testing things locally in your CI CD pipeline, you won't spot an issue like this one because it will just work fast enough. Moving on, your libraries and your tools, or even you may write queries that are equivalent in terms of what they do, but are completely different in terms of performance. So let's say we have this table that we call boarding passes. It has something like 8 million rows. And what we would like to do now is we have a handcrafted query that is getting all the boating passes and just calculating a checksum of the ticket number and doing it twice using like MD five algorithm. Right. What we do next is we basically join this CTE three times and we filter for some specific like double hash of the ticket number. And this query, this query at the top that uses CTE, which is common table expression, which is kind of like temporary table used only for this query. This statement is equivalent to the one down below, meaning that the one down below does not use CTE, but instead extracts from the boarding passes three times directly and then does the filtering in the work condition. So those two queries are equivalent in terms of what the logical output they produce. The only difference is the query at the top runs in 13 seconds total, whereas the query down below runs in 8 seconds. So there's nearly half of the time of the query above. Right? So again, this is something you can't check with your unit tests because hey, those queries just give you the proper answer. And unit tests, they check whether your code works correctly, checks whether the code production, the expected result, right? But unit tests will not capture, and especially your CI CD pipeline may not capture the performance characteristics. And one spot that those two queries, even though they are equivalent, they work completely differently. Another thing we may hit is incompatible changes in schemas, right? Adding a column seems like something that will not break your database production at all, but this may take a lot of time to apply. If you add a column, then your table may need to be rewritten. So data may need to be copied from the table on the site. Table must be recreated and then the data is restored back to the original table. And this is something you won't do in a second. It will take you minutes. Dropping a column seems like something that should be easy enough, but the problem is if you do deploy to a really big fleet of your machines, then you may end up with split brain, meaning that half of your machines are already running with the new code, whereas the other half of the machines run with the old version of code. That still does expect that column being there. Another scenario is when you have heterogeneous application, meaning that you have one database that is accessed from multiple applications being written in various different technologies, one in JavaScript, the other one in Python, in Java, in. Net, in rust, you name it. And those different technologies do not control the schema. So if you change the table schema in your database, then all those applications need to be updated. If you do not keep the backwards compatibility then you may end up with issues. Another thing is just changing the column type. If you change the column type, you may get your table rewritten. But this may also lead to some problems that hey, now you lose like precision, you lost some data quality, or maybe your application cannot read the column anymore because the internal representation has changed, right? So those are the things that may go wrong when we are playing with or changing the table schema when deploying to production. Another thing is about indexes, right? If you track an index in production, then your query may just be very, very slow, and your unit test is not going to catch that. So how can you fix that? You just need to configure a proper index index of proper type, whether it's Btree, hash, index, gen index, or whatnot. If you configure a proper index, your application is going to work faster. Right? But what happens if we tell some developer that hey, if your query is slow, configure the index. Then we end up, obviously with index all the things. So you have too many indexes configured in your database. And while index can help speed up the querying, it will slow down the data modification. Because now not only you need to just modify the entity, but you also need to update all the indexes around, meaning you need to update indexes on every single entity and every single column that you just configured. So one data modification may now lead to multiple indexes being modified, which may be slower, again, something you won't catch with your CI CD pipeline. And we also get to this situation of like we do modify, we do deploy our code to production, and this code is buggy. There are different kind of bugs. It may be buggy just because we didn't implement it properly. This is something our unit tests will be able to spot most of the times, obviously, but there may be also bugs in the engine we use in the database we use. One of the examples is the Halloween problem. Halloween problem was a case like back a couple of decades, back when if you tried to update the salary of all the employees, just as you can see on the screen, the database engine updated the same row multiple times because it was still meeting the condition. So here on the screen, we want to update the salary when it's below 10,000. So imagine that we start with 1000 salary and we want to increase it by 10%. So we end up with 1100 and this is still below 10,000. So the database engine kept updating this row over and over again until it finally stopped meeting the filtering condition. So until everyone was earning at least 10,000. So this is what happened now? Your databases protect themselves from the Halloween problem, but you may end up hitting some other edge cases. Did you even know that in your SQL database when you have read committed isolation level, then according to the standard and implementation details of the databases, your application is allowed to read the same rows twice or to skip a row using read committed isolation level. And read committed is most of the times the default isolation level you use. But that's it. Those are the big areas when we are talking about databases. But we often interact with databases using orms. Orms. So object relational mappers are libraries that help us simplify querying the database, mapping data back and forth between our application and the SQL engine. So let's see what may go wrong. The very first thing that breaks often is the n plus one select problem. Imagine that we start with the aircraft stable, which is in one too many relationship with seat stable. So what happens now is we would like to get all the aircraft and then for every single aircraft we would like to get the number of seats. What may happen behind the scenes is this will generate n plus one queries. Why? Because first it will go to the database to get all the aircraft from the database. And then for every single aircraft, just as we are looping over all of them, we get a query going to the database to get one particular aircraft from the seats table. So this results in one query sent to get all the aircraft and an additional queries to get like seats for every single aircraft. And this can be improved. This problem is easy to solve. Instead of just going with n plus one queries, we can just join two tables together in one query and bank. We are good to go. However, in order to generate this query, this eager query that will eagerly get all the data. Instead of doing that in a lazy mode fashion, what we need to do is we need to reconfigure our Orm. Do we see with this application code whether it's going to send like lazy queries and n plus one queries, or whether it's going to go with eager mode? We don't see that and ORM doesn't show that to us easily because the configuration is obscured and stored somewhere else. However, even if we reconfigure our OrM to always go eagerly, which may not be the best idea, but even if we did that, then we end up with different issues. Just the query. We already seen a couple of slides back, right? We now get multiple tables joined together. That slows down the performance and the solution for that is just to split that into multiple queries. So do you know how your RM is going to work behind the scenes and what it's going to do? And more importantly, can you catch that automatically with your unit tests and with your CI CD pipeline? And the answer is, most of the times you just cannot. Moving on. Another issues that ORM introduce are like issues related to impedance mismatch. Impedance mismatch is like generic term, meaning that the model we store in our object oriented applications is different than the model in our SQL databases. In general different because in all the applications what we do is we for instance, have polymorphism, so we can inherit from one class, can inherit from another. And the question is how do we represent that in our SQL database to not lose data and to not lose performance? There are obviously a couple of different approaches to do so, for instance, table per hierarchy or table per type. But generally this is something that may lead to issues in terms of the performance or in terms of the data quality. Another thing that we may end up with is the data types to be used. It sounds simple enough to store data in the SQL database, right? But how do you for instance, store the spatial data? Spatial data is basically geographic location, like longitude and latitude that we store somewhere around the globe, right? We can store that in our database. And typically SQL engines have dedicated data type to store that. But how is our orm going to deal with that? Isn't going to store this thing as like pair of numbers or maybe as a string or maybe as something different. And you may think, okay, I don't use spatial data, this is some weird edge case I'm not interested in. But even strings is something that may be very prone to the impedance mismatch. Because in our applications, in JavaScript, net, Java, whatever, we have just one string type. But in SQL engine what we can often do is we can configure a thing that is called collation. Collation is basically the order of characters, whether like lowercase letters are less than uppercase ones, or maybe we have some national characters, how to order basically, how to compare those characters and what to do with the encoding and whatnot. And you can configure that per database in your SQL engine, meaning that you can reconfigure that dynamically and your application should reflect that. But because you have just one string type in your application, then everything you have will not work well in that case. So this is another thingy that is caused by the impedance mismatch. Yet another thingy is like precision, right? In our applications we have floats and we have doubles and that's typically all we can use. But in SQL engine you can use decimal on numeric with the precision you can specify. You can basically configure the precision of your numbers to not lose the data. But once you go to your application, you may basically decrease the quality of your entities. Another thing with RMS is they lack the visibility, meaning that there are so many moving pieces that rms do take care of. Starting with transaction isolation devrel or transaction in general transaction scope. Who starts the transaction? Who rolls it back when it goes wrong? How is it rolled back? Can you nest transactions together? Is the data cached for your transaction or not? What is your pooling for your database connections or query hints or stored procedures and whatnot? All those things your RM is taking care of. But the problem is, do you see this configuration? This configuration is not stored very close to your application code. It's basically hidden somewhere in your app config, and it may differ between your environments. So again, your unit tests are not going to catch that. Moving on. Migrations. Is your Orm handling migrations or is it some set of SQL scripts doing all of that? Is even your application dealing with schema migrations and with the schema of your database? Or maybe it's some other application doing that. What happens if your Orm recognizes there is a schema drift? So it recognizes that some differences in columns or in data types? Is your Orm going to come step in and fix those changes, those differences automatically because some orms do that? How do you deal with heterogeneous scenario when you have multiple applications talking to the same database? Right? Those are the things that are very hard to see just because they are not very clear from our application code. Not to mention that how do you test whether your migration is going to run fast enough? So generally your Orm hides tons of various configurations behind the scenes. It starts with like the application code that you have no idea whether it's going to produce n plus one queries or whether it's going to download the data in eager mode. We can carry on with like models, migrations, configuration of the RM, stored procedures, functions, triggers and other stuff. And generally all those pieces are there, but are very hidden very deeply, very deep inside your application. So how do you capture all of that? How do you verify whether what you actually ultimately implemented, whether it's going to work well in production or not? How is your CI CD pipeline going to help you with that? And the thing that we lack is the context. We do not see how things that we implemented are going to work in production. We do not see the performance, we do not see what they actually do behind the scenes. And we lack this context. And we need to have this context in order to build a proper database guardrails. What we need to do is we need to have this ability to understand everything that happens in our database, in our environment, right? What the running configuration is, what the query is, what's the execution plan and other stuff. And with the typical CI CD pipeline with no database guardrails, we have no idea how it's working. What we need to do is we need to build database guardrails. So we need to build elements that are going to protect us from deploying bedcode to production and the elements that are going to observe how things work in your production SQL to let you know right when you are typing the code that these things are not going to work well in your local database. So let's see how we can do that. So in order to do that, first we would like to understand how SQL queries are executed, right? In order to do that, we need to understand SQL engine a little bit. So nearly every single database, when it executes the query, the query goes through multiple stages. The very first stage is parser. So the query is being parsed, meaning its textual representation is parsed into a thing that we call abstract syntax tree Ast for short. AST is basically a representation of what the query is trying to do. Then this query is getting rewritten using the rewriter because you can write the same query like there are multiple queries that are equivalent but are just written differently. You use different aliases, you number columns instead of naming them and whatnot. And those queries are doing the same. But still we would like to be able to reason about that in the same way. So that's why they need to be rewritten into some standardized form that is just easier to process. Then is the third step, which is called planning. SQL engine plans how to execute the query so it goes through the database and figures out what indexes there are, what tables there are, how it can join tables, how it can extract the data, whether it can use indexes, whether it should cache the data, hash it, sort it, and whatnot. This planning is a process that is actually crucial to executing the query because it provides the plan, the actual idea, how the query is going to be executed by the database. And finally, the executor is going to just get the plan and start doing that. So let's see the execution plans in action. So whenever we send a query like the one on the left. What we get is we can always ask the database to explain the query for us, how it's going to execute it. For instance, this is example from PostgreSQL and what we can see here is it generates a very nice plan for us. Let's dive deep into this plan. So every single plan consists of nodes. Basically every row you see in this plan is a code that represents some operation executed by the database. And every node has type, meaning that the type of automation it does, for instance sequential scan or bitmap, heat scan or nested loop join or whatever else. So generally operations that the SQL engine is going to execute. But apart from the type of the node, we also have the thing that is called cost. Cost is basically an arbitrary measure of how hard it is to execute given operation. How hard? Mostly in terms of like I o operations, how much data needs to be read from the drive, how much data needs to be spilled over to the hard drive and whatnot. But generally this represents the complexity of the operation. The higher the cost, obviously the slower the operation is. So what we can do now is if we could ask our SQL engine, hey, this is the query that my orm generated. Please tell me how you are going to execute that. Then we would end up with the query plan just like the one on the right. And based on this query plan we could tell how expensive it is. Obviously your SQL database is going to generate multiple plans and compare them based on the cost to pick the cheapest one. But for us, what is important is we can take this plan and see how the database decided to execute the query, which indexes it decided to use and why it decided to use these indexes and whatnot. So this is the first thing we could do to start building our database guardrails. The other thing we need to do is how are we going to do that? How are we going to extract all of these pieces? And to do that we can use the modern observability tooling, which is called open telemetry. Open telemetry allows us to capture logs, track metrics from our applications to provide better observability. Open telemetry, Otel for short, is basically a standard provided by CNCF cloud Native Computing foundation standard defining and providing a set of sdks for the instrumentation. Sdks for every single language you can think of modern languages that support it and whatnot. So you generally drop an SDK to your application. And now open telemetry is going to process things that are called signals. Signals are track our metrics, our logs, and other stuff that we can combine together to understand what's going on behind the scenes in our application and how things are actually executed. And with Autel, what we can do is we can create spans and traces. So imagine that we have an application that is processing some workflow that could be like a checkout workflow in our ecommerce app. So imagine that user clicked a checkout button and wants to start processing the payment, right? So our application, like our node a, this could be like web server, load balancer or whatever, gets this request. And then it needs to call some other microservices, like some queries, some database, some log storage or whatever. So we can see that this request coming from the node a is propagated down the line to other services, right? And this is what opentelemetry can capture for us. So we can capture the whole view of what's going on, the whole view that we basically call track. And track represents one particular workflow in our application and consists of spans. Spans are basically those single pieces representing how a particular code executed given stuff. So how there was a call to some other node like Node C, how long it took, what was the parameters and stuff and whatnot, showing exactly what's going on. So we can see what details we can capture using open telemetry, for instance, friendly name, for instance, timestamps, for instance, some attributes and whatnot. So this is what we can use to instrument our application and capture all the stuff. And now if we combine both of these things together, like the execution plans we already considered and open telemetry, what we could do is we could capture the true behavior of our application. The only problem is when do we capture that? And obviously we could go with load tests, get our application deployed, start load testing it, to get all the execution plans, to get all the statistics, all the metrics and whatnot, to see how it works. The problem with load tests is though that first they are super expensive in terms of time and money. It takes hours to execute proper load tests. It also takes lot of money to basically pay for the fleet, for the generated traffic, for the hardware. If we are doing load tests for ML based application, then we need to have GPU, which is also crazy expensive and whatnot. Second thing is we need to reproduce the traffic properly. So we need to get the proper cardinality data distribution, we need to anonymize the data, we need to be GDPR compliant and whatnot. It's not as straightforward as possible. And finally, those load tests happen very, very late in our CI CD pipeline. They happen after we merge the branch, after we run unit tests and integration tests, after we do the code review, after we deploy to pre production environment, they happen at the very end of this pipeline. And this is typically too late for us to get a meaningful and actionable feedback because when we implemented our stuff and we realize how we did that and then load tests tell us, hey, this thing is not going to work in production, then we are already probably like hours, if not days after we had the implementation phase of given particular feature, right? So load tests are expensive, are slow, and are way too late in our pipeline. So what we need to do is we need to be proactive. We can't let issues to appear in production. We need to find the issues during our CI CD pipeline as early as possible. We need to push all those checks to the left and find the issues automatically and monitor and observe our applications constantly to get better troubleshooting and better root causing. And we need that now. No matter what our application is, whether it's small application or big Fortune 500 corporate with enterprise application, we need that now. We need a completely new modern approach for getting proper database guild rows. And Mattis does exactly that. Matis is the solution that prevents the bad code from reaching production, that can monitor and observe all your databases and that can automatically troubleshoot them for you. And this uses all the principles, all the things that we've just discussed, open telemetry, execution plans and whatnot, and can improve your CI CD pipeline by providing a proper database guardrails. So let's see that a little bit in action. So Matis prevents your database code from breaking production. Once you register into the application, what you end up with is you have a project. Project basically represents, let's say one of your application with interacting with database. Right? What we can do is you drop one independency to your application dependency that uses open telemetry and does all that we discussed behind the scenes. And what we can do then is we can show you the recent activity. So for instance, we can tell you, hey, your application is exposing like those rest endpoints. And when you did call the rest endpoint, you get like 200 HTTP code as a response. But more importantly, those are the SQL queries that were generated behind the scenes by your orm, by your SQL driver, whatever. We are developer centric. So we want to have a very straight and very direct information, whether it's going to work well or whether it's going to fail. And what the impact is and how to fix a particular query. So Matis does exactly that, shows you all of that, and gives you very straight, very direct signal whether the things you have are going to work well in production or not. But if you want to dig deeper, then feel free to do so. You can, for instance, get all the SQL statements with all the tables listed, how they were accessed and executed. You can get metrics of your queries, so you can see exactly what the cost was, what the execution plans was and whatnot. You can get like a nice visualization of all the operations, how your SQL engine actually executed the stuff right. If you want, you can get the raw execution plan for you to process it further. And you can also see for instance all the tables and metrics and timings and other stuff that you have in your application, right? So this is what you can do. And this gives you the observability that you need to have during your CI CD pipeline more. We can also integrate that with your CI CD actions. So for instance, if you use GitHub actions that you can configure your pull request to basically have Matis interact with you with the pull request to analyze all the queries and all the schema migrations from your application automatically, for instance, we can also analyze that hey, you tried to migrate this schema. Those are the indexes you try to configure. Maybe this is not going to work and whatnot. So this is how you build a proper database guardrails into your CI CD pipeline. You can get the immediate feedback just when you are typing things down as a developer. But also you can get this kind of like database review for your SQL database interactions during CI CD pipeline. But it's not the end of the story. Also after deploying stuff, Matis can monitor and observe your databases. For instance, it can automatically analyze the schema of your database and suggest that hey, you do not have indexes configured on that. Maybe you should do that in order to improve the performance. But you also get a very nice observability dashboard dashboard showing you like slowest queries. So for instance you can see hey, this is the query that has been executed recently and bank, this is its performance. And you can get anomaly detection with like details of the deployments of running configuration and whatnot. You can get statistics of your tables like number of dead rows, auto vacuum and other stuff. You can get index usage hey, you do have indexes configured in your database, but maybe this index hasn't been used for like last two weeks and it's going to break, you can get extension, you can get database config and all of that stuff just by dropping one open source docker container that runs alongside your database. So generally this is what you can get with Mattis, and this is exactly what we would like to have. We would like to have basically all our stuff covered. We would like to have covered the source code integration, pull request analysis. We would like to have constant monitoring on observability and other stuff running 24/7 left full cycle, preventing the bad code from reaching production and automatically troubleshooting the stuff if something breaks in the production. So metis integrates with your source code with various languages, various orms, various databases, no matter whether you host them on premise or in the cloud. It integrates with your CI CD pipeline, analyzes your pull requests, merge requests, for instance in GitHub actions. It can also give you the observability using like ad hoc analysis, web interface, CI tools and whatnot. So you get all of that at your hand with the modern open telemetry standard and modern approach to build the proper database guardrails into your CI CD pipeline. And that's the idea. Because databases, they may break, you may get slow queries, you may get wrong SQl schema migration, you may break the configuration by dropping, removing or stopping using the index right. And you can't wait for these issues to pop up in production. You need to catch them as early as possible. You can't rely on load tests. They are way too late, way too slow, and super expensive. You need to be proactive and push things to the left. And to all of that you can use Matis. Matis covers all of that and makes sure you do not fly blind and don't deploy bad code to your production. And being that said, I'd like to thank you for tuning in and coming to this talk. Hope you enjoyed. Hope you liked it, and please enjoy the rest of your conference. Thank you.

Slides

Download slides (PDF)

See all 16 talks at this event!

Conf42 Observability 2023 - Online

June 08 2023

The missing chapter in your CI playbook: database guardrails

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

DevRel @ Metis

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2023 - Online

June 08 2023

The missing chapter in your CI playbook: database guardrails

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

DevRel @ Metis

Join the community!