Managing Databases in the Cloud is BROKEN!

Video size:

Abstract

Managing databases in the cloud is more challenging than it should be—complexity, performance issues, and hidden costs create constant headaches. This session dives into why current approaches are broken and explores practical strategies to simplify operations, optimize performance, and regain control.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey guys. It's nice to see you all again. My name is Matan Taf, co-founder CTO of Rapido. we started Rapido about, three years ago when we saw companies struggling with managing fleets of databases on the cloud. Today we will dive into why there are so many fleets. We're not gonna talk about Kubernetes. We're gonna talk about cloud in cloud native architectures. We're gonna talk about the stepson of every CloudOps organization, the databases, and why today's way of managing databases in the cloud. In micro services cloud native environment is broken. So let's dive a little bit into the problem. So cloud native and multi-tenant apps change the way people are consuming databases Is. As opposed to what they used to do. one of the biggest concerns when managing a multi-tenant app is the noisy neighbor, issue that could essentially make one service interfere with another service. this is why auto scaling and scale out capabilities are very valuable in cloud native environments. And. The only piece that doesn't autoscale within the entire stack is gonna be our databases. managed services like RDS also helps organization to shift the. More responsibilities to developers. So they write the queries, they use oms. in many cases the specific dev team is in charge of managing a specific database and so on and so forth. on the other side of the scale, we still see operational teams, needing to, take care of databases and make sure that their uptime. Cost and performance are actually going to rock. and generally creating this situation where on one hand we have the SRE team, the CloudOps teams that are living in a world of minutes, right? So they need to respond now, they need to do it, efficiently, and it has to. Work. It has to be determined. So we see that CloudOps teams are essentially in charge on infrastructure costs, on, on database SLA, on the uptime of the database, as I said before, and on performance, right? Like when queries are running slow. The customer facing team will typically reach out to the cloud of Gu CloudOps guys, right? on the other side of the scale, as I said, we see, developers or dev teams that are in charge of other things, right? They need to deliver new features, they need to work fast. Um, they need to make sure that the code, that their code is running efficiently and so on. and they have. Almost 100% of the impact on the database. So on the one hand, we see these guys that are living in timeframe of days, right? Like they always need to wait to another dev cycles and they have all the impact, right? They generate the query. That will hit your database in the end of the day. And on the other side, we have these guys that typically have to take the consequences of what these guys are doing, of what the dev is doing, essentially. this is a situation which is not healthy. It makes SRE and CloudOps teams like not to meet their KPI, and face a lot of trouble. Managing databases, right? so the type of issues that we see that sre, and cloud ops teams are facing today, start with visibility, right? if I am living in a multi-tenant, I manage a fleet of databases, the first problem is how do I see what is the most pressing issues across my fleet of databases? How do I know where is the slowest query? How do I know where was like the latest CPU spike? Or where do I have. CPU spikes that are getting close to the edge and jeopardizing by SLA. So all these activities when managing a fleet of databases, are very hard and by definition, put cloud ops teams. Being responsive or being reactive instead of being proactive and respond and preventive and respond to issues before they actually rise and take the database down. So we see that when managing a fleet of databases, there are completely different set of challenges that CloudOps teams has to manage, especially when they don't have the DBA skill sitting within their teams typically. And even if they do, it's not easy. This is why we built a platform addressing this. Operational problem that CloudOps teams have when managing a fleet of databases, which always impact the business in various ways, it's gonna impact your business, in performance of your application on SLA of your application, on your AWS bill. So all these KPIs are actually something that RAPIDO can solve for you, and we address directly that operational challenge. In a nutshell, you would like your, ideal tool, to be able to gain observability across multiple instances. There are some tools that do that in the world, right? They, you have Datadog, DBM, you have, SolarWinds, you have, I know that New Relic released recently a new tool for managing a fleet of databases. All these tools are in charge of one thing. They do only one thing, and that gives you, it's to give you observability. The other things that you would like, the platform that manages your databases to have is automation. You want the ability to cache queries to protect your database from CPU spikes to, accelerate slow queries to, optimize the underlying infrastructure cost, right? so above and beyond anything, improve resiliency and improve. The underlying costs, The automation. So Rapido is the only tool in the world that does that. And today, I'm gonna show you a live demo of how it actually works in action. What you can see here is a CPU graph of one of our biggest customers. they had, couple of hundreds of RDS instances in that size, various sizes, but. By and large eight Excel, which they grew all the way up to that, right? So they started like one Excel. They had C-P-U-C-P-U Spike, CPU. Spike is emergency. They go ahead, they increase the infrastructure, the underlying database. They hit the roof again, they increase it again. And always in twofold, right? Like when you work with RDS, you gotta increase your infrastructure by, by 100%. Like any scale up, here. What happened when they started to use Rapido? So with that emergency, they inserted RAPIDO to their R five eight XL instance. It was exactly three or four months ago, last December. and as you can see, only by inserting Rapido with couple of rules, the CPU baseline dropped from 60 to 100% to 20 to 30%. so they really. Moved away from the danger zone, from the area where something can get wrong with the replication. at this point you can see that they decided to shrink 30 DB instance. And as you can see, the baseline went a bit up, from 20 to 30 to flat 30 maybe. they started like to shrink it again. and then you can easily see that, the baseline grew. But even when they were 75% smaller, I. Then day one or when, day one, when Rapido got in, you can see, you can still see that they didn't even reach the 50%, they didn't shrink it one notch further because it was not that significant. like once you, you reduce 75%, you are very happy. and they saved like in total more than half a million dollars per month on compute of RDS. How exactly that magic happened. Let's move and see Rapido in action. Okay guys, so this is the Rapido console, as you can see it. Here we have the master dashboard. As I told you, Rapido master dashboard will always be focused on showing you the most pressing issues across your entire fleet of databases. So I will see the average query time across all my databases. I can always drill down sort by average duration. I can see that every line item here is essentially, the average query runtime on each database or on each schema, if you're using my SQL terminology on, a DB instance, right? so you can easily see that my demo DB is by far the most, low or the slowest. Database in my fleet. I can drill down and see a specific queries. So if I'm gonna sort that by max duration or, you know what, by count you can actually see what is the query that is generating most of the stress on my database. So you can see that my demo DB on database classic models, I'm sorry, on database. It's already again by Max. you can see that my demo DB on database demo actually has a query that is running for 333 seconds. Sometimes you cran on one second so we can actually see that there is like a. Parallelism issue. You can see that the query is select sql calc found Rose. So this is like a calculation query, that is taking time, right? So again, I'm looking from the bird eye view. I dive in and see the specific query on the specific instance on the specific database that actually generated all the stress. You can do the same trick with locks and blocks, right? any lock that took place on my database, you can see here flat list of all the locks. When I group by, you can see that this Postgres database is the one with the highest amount of locks. when I go again in group by a locking thread, you could actually see which query is generating most of the locks, and you can see that's a specific update that is generating most of the locks across all my database fleet. So again. The whole notion is go from the bird eye view, dive in and see what is the root cause or what is the most pressing issues. This morning on my database, everything you see here is in context is in context of 12 hours. I can go and look on 48 or three or whatever. and as you can see, there is like a stubborn server here, like out of my. Entire fleet. There is one database story. Let's go back here. There is one database here, it's called Demo db. You can see that it's constantly, not constantly, but at least for the last few hours, it ran on 99% CPU. So we have a CPU spike on that database. and at this point we wanna go from a historical view. To a real time view. So here I can just navigate to my instances view. I can sort by CPU. And as you can see, dev DB here is still on 100% CPU utilization. By the way. The great thing here, you can manage Postgres MySQL instances in the very same pane of glass. you can see that we have queries running here two minutes, right? So it's not only high CPU, but there are actually performance issues on that specific box. I can click. Drill down and see what is happening on that specific database in real time, guys, right? so this is like refreshing every three seconds. Obviously it's modifiable, but currently it's, refreshed every three seconds. If I stop grouping by, you actually see a flat list of all the queries running on that box right now in real time. We see like queries of two minutes. We see query of 1 56. so we see some issues here. If I group by again, you can easily see that it's one query. That is generating most of the stress or two queries that are generating most of the stress on my database. And, I wanna do something about it, right? so if I look on that select star, I can actually see that it's running, but it's not like the slowest thing, right? It's not very slow. On the other hand, the cow crows are actually what is. My database, as you can see here, like at the top, you can, one of the great things, about Rapid and typically, when you, explore, performance issues, you really are, you are really interested, to see what happened, One hour ago, right? Or two hours ago. so using that tool, I can easily go and navigate back in time and see what happened one hour, two hours ago. And I can always go back to life, right? So if I go back to life, I can literally select the problem. You can see with your bare eye how the problem start, how the problems started to compound here. And I can go and look second by second and keep the, The group by. And as you can see here, we had four executions of that. We in one execution of the select. if I go five minutes ahead, we 19 executions. I go. Few minutes ahead. you can see that we have 50, 50 or 47 executions, and then I had 50 executions and so on and so forth. So you can easily see how the problem started, like to compound on my database and build, a real issue here right now. One of the things that I can do is go back to real time, pull the trigger. Kill the query that is generating the stress on my database from the very same spot that I had the idea that something is wrong, right? So you can easily see how it's gradually going back to 50 14, but it'll spike back again. I know my app, this application is throwing that query on my database over and over. so you can see it's growing back to 16 and 19 and so on and so forth. So what I wanna do. Now is to be preventive. I wanna make sure that issue is not going to interfere. I still, and I still see the select star. I don't like that select star, but we have like more urgent issues to take care of. And that's the select SQL Cal found Rose. So what I'm gonna do now, I can copy that query text and I can go to the world of automation. And guys, this is where. Any other product in the world, right? So we show you what's happening on your database. We know that you are operational team. You don't have time to, f around, yeah, sorry about my French, but you don't have time to, to f around. And then what you really have to do is to solve the issue. You have to address the issue right away, right? So this is where you can see like the cortex rules and the ability to deploy preventive rules. What I wanna do is to limit that query. Right there. So this is the name of the rule, and I can now look on multiple ways, multiple triggers. So we have a rule engine here, rule-based engine that can do all kind of things, right? I can look on a query pattern and I can say whenever someone is doing a. That query. Or as a matter of fact, I don't care about the limit. I use a wild card here. I don't care about the table. So whenever someone is doing select SQL caulk found row, I wanna trigger an action, right? So I'm gonna call the rule limit SQL caulk found row right now. I can do it on a specific time of the day. I can do it 24 7. Let's keep it 24 7. and what do I wanna do when that, sorry, I. Oh, I lost my text. let me grab that again. It's still running. Sure enough. I'm gonna grab that. Gonna go back, build a new rule query pattern. gonna use wild card on the limit and on the. You know what? Let's forget about the limit. Like anything, anyone that does electrical cal, I don't care if it does limit or not, right? so I wanna go whenever it's like that, I wanna go and block it, right? So when I do block, this query will not run. I can block it only on certain times of the day. Block it between 7:00 AM and. 6:00 PM or 4:00 PM I can do it only on weekdays, right? So if it's like some kind of a mambo jumbo, oap, analytics query, I can allow it. Do it on weekends, during do it in the middle of the night. Don't interfere with my business. This is an o TP database, right? eh, I can be a bit more, Gentle and I can say rate limit or throttle, right? so in that case, let's choose throttle. eh, let's give this real name limit, Dave. Limit steep cock farm grows. I'm gonna save that guy. Now it's active between. We're not in the right time. So let's enable a rule that I prepared before and let's disable this guy. And now guys, I'm gonna go back to my database, right? So demo DB still a hundred percent CPU. I'm gonna drill down. I'm gonna select that query. I'm gonna move that message, kill that query, and let's see what happens. Now. Now what I expect to see is that my database will go back to a low. Execution amount. And as you can see, sure enough, it's running two times in parallel, right? We have various rules that you can deploy with Rapido. Some of it I showed you here. Some of them are related to controlling database connections, limiting users. Limiting tenants. So you have all kind of activities that you can do with rapido to mitigate stress. I can for instance, take that nasty query, I told you that I don't like that query. So I can take that query and say, you know what? If that's what you execute, let's cache it. So using rapido, you can say Cache that query, right? I didn't select text. with me. Let me copy the text here. I go to automation. I go to add a new rule. Query, query whenever that query takes place. use the table based invalidation. Table based invalidation means that whenever someone is going to update the versions table, I will go and invalidate my cache and make sure that the data in the cache is coherent with what's really going on. So guys, this is. In a natural, what Rado is all about. I encourage you to hit me and ask questions over email, over Slack, over, LinkedIn. I'm available in all medias. I look forward to get your questions, sending you warm wishes from New York City. Take care. Bye-bye.

See all 81 talks at this event!

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Managing Databases in the Cloud is BROKEN!

Video size:

Abstract

Summary

Transcript

Matan Nataf

Co-Founder & CEO @ Rapydo

Join the community!

Featured event

2026

2025

Info

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Managing Databases in the Cloud is BROKEN!

Video size:

Abstract

Summary

Transcript

Matan Nataf

Co-Founder & CEO @ Rapydo

Join the community!