Things Fall Apart: Navigating Managed Databases for Over a Decade as a Non-DBA

Video size:

Abstract

Learn to navigate database benchmarks wisely! From crashing managed instances to skyrocketing storage costs, I’ll share hard-earned lessons from a decade of managing production databases on the cloud without DBA expertise.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. My name is cio and today I would like to tell you my story As a known database administrator running managed databases on the cloud, specific AWS for Rob. I will share my. Mistakes, my hard challenges and hopefully as well some lesson and some things to learn along the way. Quick agenda, I will give you a brief review of my how D has journey so far, why that is important. Why does that help to understand what I will explain later on, I will show some interesting cases and by cases mean support cases. I'll discuss some challenges, hopefully as well, some solutions. And I will go through some, what I call super simple experiment. I wouldn't call them benchmarks, but something to give an idea about how to run our DS and manage databases. And we'll recap with some lesson learned. But before starting, I would like to remember you one number, a hundred thousand record. Think about how you can. Store a hundred thousand record databases. Sounds like a very simple task, but how do you do it? How long does it take milliseconds? 10 seconds, one minute. Who knows? Depends how you do it. We'll see that later on. My RDS journey so far, I've been running databases on the cloud for over a decade, since 2011. That makes me old, doesn't make me an expert. I've been mostly running my databases on Amazon. RDS, Amazon DS is a service that was introduced by Amazon just about, yeah, 15 years ago now this is the post that Jeff Barr published in October, 2009. That's almost the entire post that give you a really, an idea. What was at that time. Are this a very simple database compatible with my SQL 5.2. One back in 2011 when it started, so not long after it was announced the key part of the service were simplicity. There was not much to choose. We see that later on what the benefits and what the limitation of that are. There were of course, a lot of constraints and when we talk about once upon a time and a good time when the cloud was simpler we tend to forget that was very much more expensive. At that time. So anyway, what was the feature? What, why? First of all, I started to use managed databases. I was, and I'm still somehow a software engineer. I started as a Java software developer. I was familiar with web application, java, web application but databases. Yeah, I was using them. Was I able to install my SQR postcards on my machine or run one? Yes, definitely. Was I able to manage a cluster? Absolutely. No. I had no idea how to do failovers. I had no idea. I was scared about full tolerance. I had no idea how to manage a, active master. How to do a failover, how to manage a replica. What actually, even at that time, I dunno, a delay on replication meant so. What was the feature that drove me to move to the cloud was multi Z. Multi Z was a very simple feature conceptually from Amazon. RDS was the idea that here is your endpoint, a c name, that's your my Sqr instance, and behind the scene we do the failover for you. The replication is done for you out of the box. You can have backups, you can have whatever you. Whatever you want out of the box that drove me and many other people to the cloud and to use managed databases on RDS. When I say that the good of time when there were many, not many choices were, is rather more expensive time, was that even without considering that the new latest systems are much, much better as processor than the old one. The benefits are well beyond the just the price. You can see that even the price of an instance, if you can see an extra large instance. Went down along the time. So IM seven G instance is actually cheaper than an M1 instance over a decade ago, and it's definitely much more performant. So when I say there were many limitations, I means that in 2011 was somehow simple. The job of a cloud architect, somehow it's much better for you now. If you start now or if you work now on the cloud, because you have many options with many options, have a lot of responsibility in your choices at that time once you have made the decision to move to the cloud. It was pretty simple. There was just my SQL it was just one engine with 5.1 and no much else. There was no A ranges. There were no process, no SQL, no Maria db. And actually many people think that I'm not an expert, but I'm a big sport, a big fan of my SQL actually. When I moved to the cloud, I was running a posts database. I moved to Mascar to be able to have a managed service on a, on AWS. That time there was no choice on storage class. There was no GP P two, no GP three, no provision io no i, O one, IO two or whatever. You had magnetic and that was it. You were paying for how many operation you were doing on the storage, and of course there were no fees. Like of course there was no Amazon Aurora at that time. There was no performance in size, so no, no advanced money, no kind of. Previous feature that you used to find just on the that you were not used to find in the communication. So the very first lesson of this talk that you should take on is with you, is keep iterating keep yourself up to date. We love to go to reinvent. We had a big conference, not just on AWS and even clap when they do announcement of new cool features. Reality is that's the moment when your deployment might become a bit obsolete. I'm not saying change immediately, but roof time after two, three years, you haven't changed anything on your deployment, on your database. Probably are missing out. I'm not saying it's more expensive than before. It's just that as a cloud architect is your goal to keep yourself up to date and take advantage of new feature and reduce costs. I would like to now move to some what I call interesting cases. I call them cases because they're actually AWS support cases. Big. What here? Those are 10 cases that I opened myself on. The AWS support using tickets. Some of them come from having enterprise support, some other with business support, some other with developer support. Actually, it doesn't matter that much. I just want to use them as an example. And the very second lesson I want to share is I think interacting with AWS Apple has been one of the biggest way I've learned using and the service and as well the limitation of service. The subject are actually as they were I removed the case number, but. You can see quickly scanning through them is, some of them are issues. So I reported issue, I don't know, the replica was broken after scaling up, or, I dunno the RDS was stuck in storage optimization, 99% for almost 24 hours or whatever else. But some of them are actually, you can see I've just questioned the roadmap, minor roadmap versus the years of Amazon Aurora, or. Build on transaction compression support for Linda to be redo log capacity. And those question, why do I open them? Because it's not just to have information. They help you as well to drive decision. What do we mean by that? I mean that if you have an idea how you're gonna change your database also for storage operation, that might be hard to revert and might take a lot of time. Throw, write down your idea and how you're gonna do it and the doubts you have, and double check with support. If they see an issue sometime, I'm not saying ask them, oh, how do I do that? I'm just saying, give what you have in mind to do and check if there's something wrong that they can help a lot. And as well, maybe you say, okay, there's no transaction compression support that is available in the community edition. Okay, it's not there in R Ds. You can open a case and just advocate that they're gonna have it, but as well, you can ask, okay, then it's not there. How can I achieve that? How can I reduce the size of my binary? There are other ways, of course there's not gonna be compression, but maybe they help you in other ways. So let's go to some challenges. A solution. Let's take some of those cases that we say a few seconds ago. First one is. I said, when you play with a managed database, you have to think about storage on Aurora. That's pretty simple. You pay for the storage, you use, you have not. To allocate anything, so that's cool. That's no brainer. What do you do on RDS? On RDS, you allocate the storage up front. You decide on what storage you want. It might impact how many I oaks you have upfront. For example, on GP three, if there's storage, the size of this is below 400 gigabyte. You have 3000 iops. If you're above, you have 12,000. But there are many different combination, but generally you're just thinking about how much storage you need, how much free storage I need to avoid to fill up my disc, can end up that my database crashes. One is the option is to enable out scale storage. It's been there for a few years and that's cool. When you go below 10% for the storage is cased automatically. The problem with that is pretty hard to go back. Actually until recently, there was no real easy way to go back. Now you can do some blue, green and re reduce the storage, but it still requires some significant work because you need to create the BlueLine and you need to swap the point. So it's still not a easy task for large database, but this is an example. There are two terabyte of free storage is a lot. It's little. Why did it jump? It jumped there because I did a mistake to go to do a optimize of a large table, or you can as well do an altar table that requires a copy of your, so that is not a in place that you're copying the data from your old table to the new one. And what happened is my outer scaling was initiated on my instance was the implication of that. That basically. That's a rocky mistake that the moment you're doing something very intensive from the database that is modify your schema or change your data, the last thing you want is that first that you increase the free storage that you might not need because maybe that ten three percent that you had was good enough. And secondly is. You're doing an operation that is very heavy on iops and you are actually triggering a system operation that is applying changes to the storage that is actually consuming your IOPS as well. So that's the full first lesson I learned. The other way was disable out scaling. Anytime you do some optimization on some methods, so your database anyway. You may wonder, apart from application reason, why do I, how and why I want to do an al alter optimize an email DB table. We'll come back to why I want to optimize a table in a second to recover storage and why is that important? But how do you do a AL table or optimize as well? You have different, you can do an alter table using the engine. So on my sqr, I can do alter table, whatever. Change the way I want. I can as well use third party tool that sometime allow me to control better how I do it in term of load on the production database. So I might want to do it in a slower way, but it affects less the production or that is recoverable. So for example, pera PT online schema changes. Part of the per toolkit is perfect for that, but there are other option in the open source world as well. The key questions as well. Being on the cloud where there are so many options, where do I do my change? I might do it on the primary node is a small table or I want to take the risk, or I want a simple approach. Simple approaches. Go on the database on my multi z, go on the connect to the primary node and do alter table that has some disadvantage. If that might kill my database, I might be able to roll back, may have some challenges. Second approach is. Create a replica node. If the change I'm applying for for example, adding a column is compatible with the replication, I might stop the replication, apply the change on the replica, re-enable replication, and then at the end do the swap of the endpoint. That's much safer, a bit more complex with a lot of flexibility. Or I can do it with Amazon, RDS Bluegreen, something that's been around now for a couple of years. That is a wrap around my replica note approach. That is basically you have a blue and a green. You decide to do changes on your own. Green deployment, and then you swap. That's cool as well. The end result is that the free storage space you have might change a lot if you do alter table and optimization. And remember how much storage you need. It's not just the storage of the table. You are copying that, so at least that extra storage. So if you have a large table of one terabyte on a database of three terabyte means that you need. Quite some free storage on that table to be able on that database to be able to do it. But as well, if you, for example, stop the replica for a certain amount of time, you might build up a lot of binary doc that you need to replicate. And that's that you need to apply. So it is on your replica, so you might even need more storage. So consider front how much storage you need for those changes and make sure that there's no automatic change in storage while you apply those changes. Why do I say optimizing your table is vital? One of the things you want to have a managed database and what you love about managed databases that they are managed so as well, backups are out of the box. There's nothing to be done. They are there. And even Amazon from day one tells you how well up to a hundred percent of your total database storage for a region. It's free backup. That's nothing to be done. Our expert database means they'll tell you the majority of customers don't pay anything. That's pretty much true if you don't write a lot. And if you keep maybe one, two days of point in time recovery, if you have a backup point in time, recovery of 30, 35 days, that is the maximum. You might start to pay for that. Actually, that's. Second lesson I want to share is sometimes CPU is overestimated. It's overrated. I hear a lot of conversation at a coffee machine at conferences where we discuss what's the best instance class, what's the best CPU utilization for that specific workload. I have you already moved from M six G to M seven G. Are you using our instance or whatever? How they hear people discussing, or at least not as often people about storage and backup. This is an example of a production database that I'm managing. You can see that it's a. In three years, the CPU is going on. There's some reserve instance behind. It's a healthy project going up and costs are going up because it's a healthy project that is growing. That's cool. But you can see that the CPU is not even half of the total cost of your database and back up is significantly, it's almost as costly as the CPU. And why is that significant? Because if you think about your cables. How a backup is done. Backup is a bit like a snapshot on the EPS. There's no magic on our DS that's incre. Yes, it's incremental. The data backup is incremental. But what does it mean incremental? If your data is spread around your table is not optimized once in a while, your data that you modify every day maybe spread around. So at the end, your back daily backup, that incremental might be a lot of incremental. So you can see here that the daily gigabyte. Month, the usage you have. I think this one is a month by month view. Changed a lot. And those fluctuation and those increase really depends as well on, of course on usage, but as well on optimization done on the tables those fluctuation is after some optimization. And when you spend 6,000 euro month on backup, maybe at 20% saving after some optimize or significant or one 2000 euro a month, you're saving. So next one. Graviton, this a provocative tweet from Quin, I think you're all familiar with Cory. And the point is you're using a managed database. Why do I care if he's running a Graviton or Intel or whatever else behind? And it's provocative, but it's a good point. It's an interesting one actually. I was a very early adopter of Graviton and the reason why I was very. Every adopter of movie from M five Intel to M six G on RDS for MySQL was that is with managed is incredibly simple. You don't have to think about if you need to change the build the jar, whatever. Like with Lambda is, you just simply have to switch in the console and if the new instance is cheaper, that's it. Nothing else to be done. They take care of the engine behind. Why don't you do it? You do it, but it's sometimes it's not that straightforward. Here's a post that I wrote a few years ago, how we were managing at that time, the outer scaling of RD assistance in al, using a simple logic using patient command line. So basically using some matrix and CloudWatch, we're deciding when was the time to scale up and down our RDS cluster and go down according to time of the day. Memories, users, number of session. You build your own logic, some matrix, and you do it. That's cool. Everything works. So now you switch to Graviton and that's. What I had at that time, you had a crush. Okay. If there's a crush, that's not your problem. Of course there's a bag in the build, so you open a ticket, you talk to Amazon, but at the end, what you realize is that, yeah, it's true that moving to Graviton is transparent. It's free. But the challenge there was that, yeah, that was a bug, but the main issue was that I was running out of memory and why I was running outta memory is that because basically in instances were very different. The ratio between memory and number of CPU is different. So if, for example, you optimize my questionable parameter, I call them the your parameter group to get out every single benefit from your instance. You might reach the point of, on the graviton behave slightly different. Here is an example. For example, here I optimize certain things like the memory, the buffer pool size. I was a bit more aggressive, probably is a good idea. But the message was to say is that what happened in this scenario to me was that the optimization had in place for intel. Was perfectly fine to do my logic of auto-scaling with the graviton. What happened day one was that I reached the crash running outta memory before being able to scale. My instance. So lesson learned was keep an eye on optimizing. When you optimize, minimize changes to your parameter group, there's a limit between. Yes, you want to optimize, but keep in mind that you might need to change the instance both in size and class as a minimum. Keep the variable, keep the changes you make as variable on the instances class memory at least. One more challenge, one more things you have. You might think, okay, I want to make a read from my SQL on RDS to Aurora. That's something we have all done and we all love Aurora. So how do you do it? According to Amazon, in 10, 20 minutes, you learn how to do it with a intermediate exper experience, and you know how to do it without any downtime. How do you actually do that? If you go and look at the video and if you read documentation, the way you do those things is simply create a replica. The replica of RDS is an number of replica, and then that's when you're ready. You just promote that is your new primary easy. Any challenge here? The challenge is going too fast, missing Aurora. That's what happened to me the very first time I tried that. Just for the documentation can upgrade my, this is an example, a more recent one from my SQL 8 0 39 to Aurora three. What's wrong there? Reality is that you have to think how muscular or any basis release. There's a foundation for Postgres. There's a Oracle for my SQL that release every once in a while, a minor version and a major version. My SQL is every quarter there's a minor, and after a few weeks that minor is supported. Usually if there are no big bugs on RDS, Aurora may take a bit longer. Aurora may be a few quarters later that support that specific. Venture because that's a fork. And what you need to keep in mind is that the replica you can create to RDS as Aurora need to be at least at the same level of your primary RDS. So if you have a primary, it is 8 0 39, you need our order that's post that. And if that Aurora is not out yet, you are not able to migrate. So if you want to keep your production database, always the option to migrate to Aurora to have that safety net. If you want to grow whatever reason. Remember, you cannot keep yourself too close to the latest miner. You need to keep to the latest minor supported by Aurora. Now, let's change topic from, go from, cP manual release to RDS and running out iops. I wrote many blogs many one that no one wrote, but still today the most popular blog. I have a blog post I have on my own website.com is this one. RDS is running out. Iops, what can I do? I wrote it 68 years ago. Even if this one is out of, it's absolutely relevant. Now, the content is not up to date. It's still the most popular because it's a very challenged topic. So we're doing by running out of iops and what I learned in this decade, first one, you tend to always underestimate iops. You always think you have enough, you often eat the limit, whatever during an halter table, a scale up exchange, optimization of your table a peak in your usage. The kind of newish was not there many years ago. Total matrix is very useful, but don't rely just on that one. The total EOPS is give you a feeling of, oh, if you think you have 12,000 iops just if to, to that match, you can have a feeling if you're getting close. Remember that GP two GP three. By design, I'm not supposed to guarantee you a hundred percent of the time. Those iops. So 99% is a lot of time, but there's still, sometimes you might not eat those numbers. GP three versus GP two, they are at the same price on RDS. So how do you decide they have a different model in term of iops, GP three, when you are over 400 gigabyte, you get 12,000 iops. With GP two, you have three iops per single gigabyte. With a basic simple math, up to four terabyte of storage, GP three delivers more iops over four terabyte. GP two is technically provide by the full more iops. The difference is you cannot just choose according to that. The main difference is that you cannot change the baseline. You cannot pay to have more AOPs on GP two. While GP three is more flexible, you can decide how much the throughput and iops you can configure them separately from the size of the disc, and that's really helpful. This is IO one and IO two, that they are the provision iops IU one is the very first provision iops available on RDS. They've been there for many years. Last year, IO two Express came out and the thing, the key point is IO two is at the same price point of IO one. But it's better just use IO two whenever you can. Not every single business card can support it, but usually newest one. Yes. The key point about provision AOPs, yeah, it's good to have it, but it's expensive. And when you think, when you say that the CPU is overrated think that yes, of course if you get if you put provision IOPS on a production database, it's better not to have them. But you have to think about the overall cost of your. Database solution. Do you really need them or you need them temporary because you have battery designed database? Maybe sharding the database on multiple RDS give you 24,000 iops without having to pay for provision iops and so on and so forth. The other, so the lesson is use them, but be careful. Think if there a way to improve your product without having them almost all on very large database. And if you use provision apps, remember to watch out for copies, replica tests, staging environment, because when you create from a snapshot, they by default the storage the same class as the one you had. So you don't want to have a provision database app, staging database with a lot of IOPS provision and paid quite expensively. So I call it blue, green, and purple. So that's the next lesson. So previous lesson down storage, blue, green, purple blue. Green is the newest nice approach to change database do changes to database. Now this screen that I want to share with you and take a second about what we have here what happened between August 25 and 28. Actually this is a performance inside and there's no data. Why does the database is so low and there's no data before? Reality is that when I did a bluegreen switch using A-W-S-R-D-S bluegreen, my performance inside data is gone. I have no data Before A was 24th and the data between August 25 and 28 are the one from the the deployment that was the passive one, the green one. When you check the documentation tells you that the green, what is the green environment have, it's a copy of theology of the production environment, includes feature used by the instance, for example, red replica storage, DB snapshot, automatic backups performance inside. And so you expect when you use. Green deployment, everything is there. Everything is there. But remember that when you rename, things change. So this AWS when I ask them, why don't I have my old data performance side? Performance side is there is enabled, but performance side of the old blue instance will not be carried to the new blue instance after the blue green deployment failover. Is that an issue? Usually I love as an engineer and the operation to be able to look at my data back months here. So if you update a miner of a database, you have to use bluegreen, maybe? Yes. Depends. Or maybe not remember as well. The drawback. The drawback is you might lose certain data is significant because actually one of the reason after a blue green deployment, I would like to compare the data before and after. If I don't have the data before, that's a bit of a problem That doesn't apply to CloudWatch Matrix because they. Apply to the name, but that apply to performance inside on only. Same with backup. You might have with Bluegreen 30 days you have decide to have backup retention 30 days. It might be as well as something you have by contract with your customer. You cannot keep the data more than 30 days. But what type you do? Blue, green? The older distance is there. You may. Decide to stop it or delete it. And when you delete it, it tells you do you want to keep your old data yes or no? If you say yes, the data is there, but it's there for other 30 days. So retain, automatic backup are removed by the system. So what happen is if you use Bluegreen on any time you kill your instance with automatic backups, you have to remember that any retention period you set 30 days. It's not incremental on every database. So the backup of 29 days ago will be deleted tomorrow. No, all the data backup will be deleted in 30 days from today. That have, might have cost implication, but as well some compliance implications. So keep that in mind. You might want to deploy your own logic to avoid that. Anyway, the last lesson I want to share is underestimate time for a change that I've done many times. I say before we create a replica. So I create my replica, and it might take, you see it here from the Matrix that how long does it take to create a replica? If you look at the replica, maybe after half an hour, the console say that the replica is available. But what happened to my replica is that after he is available. The lag, the replica keep growing because my production database is under heavy load. So keep growing, because my disc on this snapshot, there's no magic there. The replica starts with a cold disc. It's gonna take time to retrieve data from S3 and warm up. At that point, by the time I'm able to, to have a replica, really, it's gonna be days or at least hours, but sometime this is three days. After three days, I was actually able to use my replica because until the replica leg is zero, I cannot really do any significant attack. I can stop it and do the al table, but still, at some point I will have to catch up. So remember that the time for a change is not just the time to have the replica ready or to have the instance up and running from a snapshot, but the time that instance is work. So if you don't do that, when you swap between a green and blue, what you see what happen here around 15, 10 minutes past 15? 15, 15 what happens there? That the disc depth explodes and explodes because my new blue instance is actually still very cold, the drive. So what happened is as soon as it's hit by the load of the production database, it's taking ages to recover. So for. Four hours. I was basically struggling with a database that was not, was underperforming. And the scary part about that is that you have nothing you can do. Thus, you can just wait. You cannot modify the instance. You cannot do anything that is significant at that point. So let's go close with some, what I call super simple experiment. I said before, I want to save a hundred thousand records in a database. So let's do that experiment. And that's why I want to do it because I want to test. Couple of simple things I want to say is the Aurora faster than RDS? For my SQL, is my SQL better than Maria DB is I one really better or worse than IO two Express? I want to do just simple comparison and to stick. I decide to use a setup of 400 gigabytes, so I have 12,000 iops on. GP P three and GP P two. I use an Insta Co comparable on Aurora and RDS. So in this case I use R seven G large. It is the smallest instance I could use on my credits that they want to spend too much. But you might wonder how you, let's say, okay, I insert one meter record or a hundred thousand records, whatever you want, and I want to insert one record at a time. Those are my only rules. It depends because otherwise you say it depends how you insert and you can load 1 million records. No, I want to insert, I'm using. Interaction one, one at a time. I want to use a single thread. So no, not a pool, nothing fancy, and I want to still stick to assets, so I'm not loading my data in a single query. So you might still say it depends. It depends on the CPU, it depends on the table. It depends on the storage of memory, on latency, on version, on which database engine. You're on the developer how you implement your logic, so let's even simplify more. I want to ask you the question otherwise. So let's say I create a simple table and my table as simply a data time and a value, and I just load my a hundred thousand record because I don't want to load more than that because it will take too long or. Or not, I don't know. And call the Lord demo. Why do I do it? With function? It's simply, it's a simple experiment. If I do it with function, the cool thing is that there's no latency. There's no nothing. It's just that function. So the question I have is, how long does it take? Seconds, minutes, many seconds. I dunno. Simple. Let's compare the execution of this one on the different one. So I'm using the C large, it's gonna take milliseconds, hour, seconds, minutes. We could do a demo, but we have no time for a demo life. So here are my results on RDS is 0 39. It takes five minutes and 49 seconds. Now, Aurora 2.7 seconds. What? 2.7 second versus five minutes? Something wrong there. I'm just using the full configuration and I'm doing mostly a. Experiment makes no sense. It's, I'm inserting user function, 1 million record from the server anyway. The difference here is the server. But is that true? Yes, it's actually true. If you do my demo and you look at the time, you count the record, you have 1 million record, you have a minimum, the time and a maximum target is different. So they being inserted one at a time and, so well, cool. Any difference between MariaDB and my sql? Actually no, that's the same time. And the reason it's pretty simple, the difference here is just really the storage layer. The main difference between Aurora and RDS is the storage layer. Will you get the 60 per benefit using real production workload? Absolutely not. The latency and other components multi try, are gonna make. Very different result. The reason I did that experiment was to show that you really need to be careful when you trust benchmarks unless you know who is doing that. This Maria DB versus you can see the counter second is similar around the experiment twice, and that's the, you can see that the time are very similar, iops, again, the logic, but. I say I want to test the third case. And the third case is IO two. Block express for me, shown critical is the new version of IO two that was announced. Now, one year ago. Is that better? I one, because they say in the announcement they say something that is pretty cool that is faster. So same price. So if I switch from a one to IO two. For 1 million record, I go from five minutes to two minutes 20 for the very same price. Quite a big difference considering that as well. I'm using the same concept, RDS instance. There's no difference engine, no difference. Anything else. And if you look at the what you really see that difference, the number of record per second is, so what's the key point of this one? Is that. Switch from IO one to IU two whenever you can, because you probably don't get that two per benefit. But you definitely see a big difference on IO one for the same price. So I'd like to close with a simple question that I offered. Are you really noted database administrator well? I'm not, but I argue that on many service on AWS or any cloud provider, you actually need to be a database. This is a cool slide that was shared by Amazon many years ago, but I still think it's relevant. The data database message of time spent on premises is mostly on the platform access and very little application level on RDS as a database administrator spend most of the time looking at performance Inside cloud was you actually. Able to help the developer, optimize the application and as well the database sche and you do some money, but you hardly have any, you don't care too much about the assets and platform level. So what are the key takeaways of this session? First one things I learned the hard way I allocate RDS storage might take hours of the plan accordingly. RDS is a man service, but logs, performance schema parameter are all available. So yeah, it's a man service, but you can still look under the, you can still ask, suffer for some help to understand what's going on behind the scene. Avoid changing parameter group setting for short term gain. You may need to scale your data. That's what I say about Graviton, for example, is. Try not to over optimize. Yes, it's good to optimize for your specific use case, but always think long term. Those short term optimization, layered back, validate can stress that you have you're finding with AWS operating have as well access to the backend of your database, not to your data, but how the database run to the file system. So that helps a lot. Warm up your database replica. Super important before directing traffic to it. The storage might still be cold, eh, cost optimization is important, but don't focus only on the CP eastern size, but think as well about backups and storage. And finally you can get crazy result doing some simple benchmarks. So approach any benchmark cautiously and think. What's the reason between behind that matrix and what message they want to sell you? But if I have to summarize the entire presentation in a single message, I will simply say leverage manage databases. So take advantage of them. Resist changes. So apply them cautiously. Think that default exists for a purpose, but almost all try to understand what runs under the hood. It's not an excuse that they're managed to ignore what's happening yet. There's no magic understanding what's happening behind. It's gonna help you have a successful deployment. With that one, a big thank you. It has been a pleasure. Goodbye.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Things Fall Apart: Navigating Managed Databases for Over a Decade as a Non-DBA

Video size:

Abstract

Summary

Transcript

Slides

Renato Losio

Principal Cloud Architect @ Funambol

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Things Fall Apart: Navigating Managed Databases for Over a Decade as a Non-DBA

Video size:

Abstract

Summary

Transcript

Slides

Renato Losio

Principal Cloud Architect @ Funambol

Join the community!