Conf42 Cloud Native 2021 - Online

Save The World And Money with MongoDB Data Lakes

Video size:


Data centers are expensive, and also not very good for the environment. By 2040, storing digital data is set to create 14% of the world’s green house emissions. As a developer you probably work with a lot of data. Your clusters balloon and become more expensive every day. Now is the time to be a hero, save the world and your wallet.

In this live coding session, I will show you how to archive your cold MongoDB data automatically to an AWS S3 bucket using Serverless Triggers. I will also demonstrate how to keep querying this archived data using MongoDB Atlas Data Lake with zero downtime.

You walk away from this session with a clear understanding of data lakes, their features and capabilities. Join this session and be equipped to save the world.


  • Joe Carlson is a developer advocate for MongoDB. He talks about how to scale data lakes faster than ever before. These data demands on our system require power and money to keep the lights on. Any opinions in this piece are my own and don't reflect the opinions of my employer.
  • MongoDB has an amazing new service called the Data lake. The data lakes allows you to save MongoDB data. It can auto archive it into an AWS s three blob storage. This saves money and the planet all at once.
  • Joecarlson dev mdb datalike all right, cool. Now we want to set up our data lakes so that we can connect it and query all this data and auto archive our data that we need in there. You can also auto archive it too with our serverless functions with realm.
  • Joe: We have an always free dev tier with MongoDB Atlas, which is our cloud hosted database as a service. Join our community if you have any other questions, that's at community mongodb. com. And lastly, just want to say my name is Joe and please follow me on Twitter wherever else.


This transcript was autogenerated. To make changes, submit a PR.
Hello, everybody, and welcome to save the world with MongoDB data Lakes. My name is Joe Carlson. I'm a developer advocate for MongoDB, and I'm just glad you're here. Let's jump in. So, first of all, why should you care about this? Well, the goal of this talk is to a, save the world, great. And b, save some money, also great. And why is this a problem? Well, I'm sure you know that data needs online on every service is just growing and growing and growing. We're adding more and more features, we're saving more and more things, and we need to get access to them and scale these data needs faster than ever before. Right. Every application, we're just saving more and more data. Over time, as we're getting more and more users and more and more users are collecting more and more data, the data demands on our system just get bigger and bigger and bigger. Right. This even gets exacerbated further when we look data lakes scaling strategies. So, for example, with MongoDB, you can shard your databases horizontally so that you can easily scale these up. But if you're sharding these, you're also going to have to be supporting these databases with replicated servers, usually two of them, in case anything goes wrong with them. You have two backups on hand that you can recover up to if you need to. But these, of course, grow and grow and grow and grow too. It quickly become a big issue for our applications. Oh my gosh, right? Which I don't know if you know this, right? These data demands on our system require power and money to keep the lights on to keep those. Coding and data is the most important part of our application. We don't want to lose any of that, but we want to try to make sure that we're building and scaling these applications as quickly, as responsibly for the environment and as cheaply as possible. If we can get both those at the same time, that's a win. That's a win for us, right? So let's explore how to do that with data lakes. Now, before we do that, though, I just want to say hello. My name is Joe Karlsson. I'm a software engineer and a developer advocate, and I work for a company called MongoDB. Best way to get a hold of me is on Twitter. Otherwise, I'm on Twitch. Mongodb. Twitch too, if you want to check that out. I'm also making funny videos on TikTok and all my information is on my All of the resources, slides, and a recording of this video that you're watching right now can be found on my website. Excuse me, if you go to Jocarlson dev, artofmdb data lakes, or if you scan that QR code in the upper right hand corner at any time, that will take you there as well. Lastly, I just want to say, any opinions in this piece are of my own and don't reflect the opinions of my employer. All right, just saying. Don't fire me. I love my job. Okay, so we have one planet we need to protect. As data centers grow, they're using more power, more electricity, and this is very expensive for us and makes babies sad. Do you want to make babies sad? I don't want to make babies sad. Let's protect babies, please. Okay. We want to be leveraging something called a data lakes. But before we get to that, I want to discuss types of data that make a good use case for being saved in a data lakes. Hot data and cold data. And I'm coding to be using an example here today of IoT data. And with IoT data, typically we're saving in time series data to be displayed on a dashboard for whatever we want to understand trends over time with whatever our IoT data is collecting. But one of the downsides of IoT data is that it gets out of date or out of stale, and we don't care about as much. We still want that data for historical analysis or whatever, but that data is no longer hot. It's not being actively used a lot. Right? Just like think of your tweets from four years ago. It's good to still have those on hand, but you're probably not accessing those tweets very often. And we might want to do something differently with that data so that it gets saved, and it's cheaper for us to save that in the long run. So hot data. Hot data is data that is saved closer to the cpu. And you know this too. Like, if you've ever built your own pc, memory usage that is closer to the cpu is more expensive, right? We all know that motherboard is way more expensive than ram, and ram is way more expensive than hard drives. And if we can offload some of this data, some of this hot data that we don't need anymore, to cheaper forms of data storage, that's going to be a win for us. We're going to have lower electricity usage, which is going to save the planet, and that's going to be cheaper for us to save in the long run, and we can leverage that in the cloud. That's money, baby. Okay, so data lakes MongoDB actually has an amazing new service called the Data lake. The data lakes allows you to save MongoDB data. It can auto archive it into an AWS s three blob storage. That means for the long run, right? You can save things in a very cheap and efficient s three blob storage as a JSON, just a giant JSON text file, and you cold still query that data as if it was hot data, using your standard MongoDB queries to get that data out. But you're not paying for all that data to be saved actively in a MongoDB database. Again, that saves money and the planet all at once. Very cool. For those who don't know too, data lake is just a catch all term. You've probably seen it a lot in industry right now, but it allows you to save lots of different, unspecified, semi structured data altogether. This allows us to, like, we can pull a bunch of different things together and it's all going to be in one place and allow us to query that. With MongoDB data lakes, you can save things with different, or you could query things across databases, across collections, into s three blobs, whatever you need to do. We cold all put all that unstructured data together and make our lives a little bit easier. Okay, so in this talk today, I want to explore the MongoDB data lakes a little bit further. We'll be doing a function to save some data into our s three buckets, and then we'll be doing some queries on that. I also will be talking briefly about our new online archive, which makes this even easier to use. So MongoDB Atlas data Lake allows you to work with your data really easily. It scales automatically for you. We can auto put all of your MongoDB data into the data lake, and it's integrated with all the MongoDB cloud services, which makes working with that data just super easy. It's so fun and so easy to work with. It eliminates, because a lot of times what we're seeing is people are building their own data lakes solutions of MongoDB, and that's hard to do. We built it for you. You don't have to worry about archived that anymore. We want to make sure that it's easy to work with your data. It's easy to archive it, easy to get it out, and hopefully save you some money too. Also, as a developer, I just want it to be easy. We make that easy too. Yeah, you can still access your data just like you could with any other MongoDB database. You can access it with all of the drivers you're used to. It's going to feel just like you're playing with a MongoDB database. It's just going to be saved somewhere else and the queries might be a little slower. But again, if you're archiving data, that data doesn't need to be hot. You're going to be saving so much money for that. So it's up to you to determine what kind of data is going to work. Well, that you can archive, but we have a control plane that goes and accesses and points to where our queries need to go and where that data is saved and consolidates all that and then serves that up to you when you make those queries. Very cool. Right? And it can break it down to a bunch of different data sources. This isn't actually a live demo, but prerecorded anyways, I'm not really ready to live this dangerously yet. But firstly, what I want to do is set up a database with some hot and cold data and test it to make sure that it's going to work for our needs. What we're going to do, I wrote a Python script that imports a massive amount of lakes Iot data that we're going to be using. We're going to be using this fake IoT data to auto archived cold or old IoT data into our s three cluster, and we'll be able to query that. So you can see here every couple of seconds it's generating 2400 documents. It's a little slow, whatever. Every second inserts a couple, but. Great. I think I insert like 2400 here. I can't remember. We'll find out in the next step. So you can see here on our MongoDB database, we've inserted 24,000 documents, or 52 megabytes of data into our collection. So we just have an IoT. This is just a generic IoT database that you might be saving some of your IoT data or whatever it is. We're just using IoT for this example. Right, but we have 2400 documents in here that we've just inserted with our script. Not a problem. And again, if you want to check out the script at all, check out that page with all the resources. Joecarlson dev mdb datalike all right, cool. So we got all of our data in there. Now we want to set up our data lakes so that we can connect it and query all this data and auto archive our data that we need in there. So how do we do that? If you're on atlas, you can see here in the, there's a data lakes portion on the right hand thing. And you can configure your data lakes pretty easily. It's actually easier than it is now. You just can click around in the GUI and connect whatever data sources they are, whether it's an s three cluster or a local one, which is pretty nice. You can also auto archive it too with our serverless functions with realm, but we're going to skip through that today. Okay, so let's assume that we have our data lakes set up. Wonderful. We also need to configure it with our third party services, which in this case is AWS s three. So let's see here. Okay, and we're back. Okay, so we're able to add s three in Mongodb. So basically we're just allowing the IM rule. So we've configured in s three that we have access to this. Then we're going to configure a new s three data service with our realm application, which is our serverless function. It allows us basically to automatically interact with our s three blob storage. So you're coding to put wherever your blob is and your secret key. I'm not revealing that here today. And you can define some rules for that. Cool. All right. So you can also set up rules too with everything of MongoDB. You can set up user based roles that have configurable access. And that's for all the services. You configure that as well. We're going to define rules in here. So what I'm going to use that service, the s three service we just set up, and we're going to add a rule that when data reaches a certain age, we want to automatically move that data over to our s three cluster. So let's see here. I think we can do that here. So we're going to write a function, this is a serverless function that lives in the cloud. You don't need to set up a server for this and call it realm retire. And this is coding to automatically, this is going to automatically save that data for us that we need. So this is what that looks like. Again, all the code is available on the page that I linked earlier, but we're just saying where the data is going to get saved and what data needs to go there. So we're finding a date range, right. That's going to be whatever we decide. It's going to be within two months ago. We want to check to see if there's any data that fits a query that we write, and it's going to automatically put all that data for us in our s three clusters. You can also do this on your own server. Here's how you might want to do that, or like what that query would look like if you're writing that in Python. Okay, so what we're going to do then is run this python script that automatically archives our data to an s three cluster. It's taking a date range from 2022 to the third, so just the days it's coding to archive all that data for us. Okay, so it moves 240 documents to our s three clusters, and this is actually in s three. Now we can see the data that we've just archived in there. These are just JSon. If you notice, they're all JSoN documents. And if we click into it, it's going to download for us and we can check that out. I'm going to open this up in visual studio code, but you can see that this is just our IoT data that we've archived in a JSON format. So what we're coding to be able to do now with our data lake is query this JSON text file as if it was a MongoDB database. Very cool, right? So we can actually see now that 240 documents have been moved, right? We had 24,000 to begin with. Now we have 23,520. Very cool. Okay, so we've confirmed that they've moved, we've confirmed that they're in the s three data clusters, and the world is happy now we're saving money. We don't have to pay for that storage space. We still have access to it, and we can still query for all that data if we need it. We also now have an online archive. So I wrote this from scratch. You do not need to do this. You can just configure online archive to follow rules through a GUI. It's a piece of cake. To save all this data automatically for you. I highly recommend checking out. It's incredible. Okay, we're going to try this one more time to sum up, we were able to archive data into an s three data cluster by moving data over programmatically. We still have access to query all that data we archived. Even though it's in an s three cluster and it's data with a JSON format, we can still query it like it was a normal MongoDB database. We're saving the world, and we're saving money by keeping cold data in a cheaper storage form that's more sustainable for growing data needs. That makes us happy, that makes babies happy. We're finally happy. Questions ip on Twitter if you have any questions. So what's next? If I've inspired you at all to want to explore this, I would encourage you to go out there and do it. I think watching a talk like this is a good way to learn, if it's worth your time to learn. But if you're working for a place or you have an application that has growing data demands that is costing you a lot of money. And if you're especially on MongoDB, definitely want to encourage you to check out how to do this. We have an always free dev tier with MongoDB Atlas, which is our cloud hosted database as a service. You should definitely check it out. It's the easiest way to use MongoDB. I swear it's great. There's not a free trial, it's free forever as long as you're under a certain size. Your database please. It's the best way to learn it. Join our community if you have any other questions, that's at community We have courses on university which are also free and a great way to learn MongodB. And I probably write for developer which is all of our amazing developer resources are located there. If you want $100 in free AWS credits, be sure to use code Joe when you sign up for atlas. All right, some more resources there, you can take a look at that. All of them are available at my website. And lastly, just want to say my name is Joe and please follow me on Twitter wherever else. And thank you so much. I had a blast and I love being here and y'all are so amazing. Have a great day y'all.

Joe Karlsson

Software Engineer @ MongoDB

Joe Karlsson's LinkedIn account Joe Karlsson's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways