Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hello, everybody, and welcome to save the world with MongoDB
            
            
            
              data Lakes. My name is Joe Carlson.
            
            
            
              I'm a developer advocate for MongoDB, and I'm just glad
            
            
            
              you're here. Let's jump in. So, first of all, why should you care
            
            
            
              about this? Well, the goal of this talk is to a,
            
            
            
              save the world, great. And b, save some money,
            
            
            
              also great. And why is this a problem?
            
            
            
              Well, I'm sure you know that data needs
            
            
            
              online on every service is just growing
            
            
            
              and growing and growing. We're adding more and more features,
            
            
            
              we're saving more and more things, and we need to get access
            
            
            
              to them and scale these data needs faster than
            
            
            
              ever before. Right. Every application, we're just saving more and more
            
            
            
              data. Over time, as we're getting more and more users
            
            
            
              and more and more users are collecting more and more data, the data demands
            
            
            
              on our system just get bigger and bigger and bigger.
            
            
            
              Right. This even gets exacerbated further
            
            
            
              when we look data lakes scaling strategies. So, for example, with MongoDB,
            
            
            
              you can shard your databases horizontally so that you
            
            
            
              can easily scale these up. But if you're sharding these,
            
            
            
              you're also going to have to be supporting these databases with replicated
            
            
            
              servers, usually two of them, in case anything goes wrong with them.
            
            
            
              You have two backups on hand that you can recover up to if you need
            
            
            
              to. But these, of course, grow and grow and grow and grow too.
            
            
            
              It quickly become a big issue for our applications.
            
            
            
              Oh my gosh, right? Which I don't know if you know this, right?
            
            
            
              These data demands on our system require power and money
            
            
            
              to keep the lights on to keep those. Coding and data is the most important
            
            
            
              part of our application. We don't want to lose any of that, but we want
            
            
            
              to try to make sure that we're building and scaling these applications as
            
            
            
              quickly, as responsibly for the environment and as cheaply
            
            
            
              as possible. If we can get both those at the same time, that's a win.
            
            
            
              That's a win for us, right? So let's explore how to do that
            
            
            
              with data lakes. Now, before we do that, though, I just want to say
            
            
            
              hello. My name is Joe Karlsson. I'm a software engineer and
            
            
            
              a developer advocate, and I work for a company called MongoDB.
            
            
            
              Best way to get a hold of me is on Twitter. Otherwise, I'm on Twitch.
            
            
            
              Mongodb. Twitch too, if you want to check that out. I'm also making funny videos
            
            
            
              on TikTok and all my information is on my
            
            
            
              website@jocarlson.com. All of the resources, slides,
            
            
            
              and a recording of this video that you're watching right now can be found on
            
            
            
              my website. Excuse me, if you go to Jocarlson dev,
            
            
            
              artofmdb data lakes,
            
            
            
              or if you scan that QR code in the upper right hand corner at any
            
            
            
              time, that will take you there as well. Lastly, I just want to say,
            
            
            
              any opinions in this piece are of my own and don't reflect the opinions
            
            
            
              of my employer. All right, just saying. Don't fire me. I love my job.
            
            
            
              Okay, so we have one planet we need to protect. As data centers
            
            
            
              grow, they're using more power, more electricity, and this is
            
            
            
              very expensive for us and makes babies sad.
            
            
            
              Do you want to make babies sad? I don't want to make babies sad.
            
            
            
              Let's protect babies, please. Okay. We want to be leveraging
            
            
            
              something called a data lakes. But before we get to that, I want to discuss
            
            
            
              types of data that make a good use case for being saved in a data
            
            
            
              lakes. Hot data and cold data. And I'm coding to be using an example
            
            
            
              here today of IoT data. And with IoT
            
            
            
              data, typically we're saving in time series data to be displayed on
            
            
            
              a dashboard for whatever we want to understand trends over
            
            
            
              time with whatever our IoT data is collecting. But one of the downsides of
            
            
            
              IoT data is that it gets out of date or out
            
            
            
              of stale, and we don't care about as much. We still want that data for
            
            
            
              historical analysis or whatever, but that data is no longer
            
            
            
              hot. It's not being actively used a lot. Right? Just like think of your tweets
            
            
            
              from four years ago. It's good to still have those on hand, but you're probably
            
            
            
              not accessing those tweets very often. And we might want to do something differently
            
            
            
              with that data so that it gets saved, and it's cheaper for us to save
            
            
            
              that in the long run. So hot data. Hot data is data that is
            
            
            
              saved closer to the cpu. And you know this too. Like, if you've ever built
            
            
            
              your own pc, memory usage that is closer to the cpu is
            
            
            
              more expensive, right? We all know that motherboard is way more expensive than ram,
            
            
            
              and ram is way more expensive than hard drives.
            
            
            
              And if we can offload some of this data, some of this hot data that
            
            
            
              we don't need anymore, to cheaper forms of data storage,
            
            
            
              that's going to be a win for us. We're going to have lower electricity usage,
            
            
            
              which is going to save the planet, and that's going to be cheaper for us
            
            
            
              to save in the long run, and we can leverage that in the cloud.
            
            
            
              That's money, baby. Okay, so data lakes
            
            
            
              MongoDB actually has an amazing new service called the Data lake.
            
            
            
              The data lakes allows you to save MongoDB data.
            
            
            
              It can auto archive it into an AWS s
            
            
            
              three blob storage. That means for the long run, right? You can save things
            
            
            
              in a very cheap and efficient s three blob storage as
            
            
            
              a JSON, just a giant JSON text file, and you cold still
            
            
            
              query that data as if it was hot data, using your standard
            
            
            
              MongoDB queries to get that data out. But you're not
            
            
            
              paying for all that data to be saved actively in a MongoDB database.
            
            
            
              Again, that saves money and the planet all at once. Very cool.
            
            
            
              For those who don't know too, data lake is just a catch all term.
            
            
            
              You've probably seen it a lot in industry right now, but it allows you to
            
            
            
              save lots of different, unspecified, semi structured data altogether.
            
            
            
              This allows us to, like, we can pull a bunch of different things together and
            
            
            
              it's all going to be in one place and allow us to query that.
            
            
            
              With MongoDB data lakes, you can save things with different, or you could query things
            
            
            
              across databases, across collections, into s three blobs,
            
            
            
              whatever you need to do. We cold all put all that unstructured data together
            
            
            
              and make our lives a little bit easier. Okay, so in this talk today,
            
            
            
              I want to explore the MongoDB data lakes a little bit further. We'll be doing
            
            
            
              a function to save some data into our s three buckets,
            
            
            
              and then we'll be doing some queries on that. I also will be talking briefly
            
            
            
              about our new online archive, which makes this even easier to use.
            
            
            
              So MongoDB Atlas data Lake allows you to
            
            
            
              work with your data really easily. It scales automatically for you.
            
            
            
              We can auto put all of your MongoDB data into
            
            
            
              the data lake, and it's integrated with all the MongoDB cloud services,
            
            
            
              which makes working with that data just super easy. It's so
            
            
            
              fun and so easy to work with. It eliminates, because a lot of times what
            
            
            
              we're seeing is people are building their own data lakes solutions of MongoDB,
            
            
            
              and that's hard to do. We built it for you. You don't have to worry
            
            
            
              about archived that anymore. We want to make sure that it's easy to work with
            
            
            
              your data. It's easy to archive it, easy to get it out, and hopefully
            
            
            
              save you some money too. Also, as a developer, I just want it to
            
            
            
              be easy. We make that easy too. Yeah, you can still access your
            
            
            
              data just like you could with any other MongoDB database.
            
            
            
              You can access it with all of the drivers you're used to. It's going to
            
            
            
              feel just like you're playing with a MongoDB database. It's just going to be saved
            
            
            
              somewhere else and the queries might be a little slower. But again,
            
            
            
              if you're archiving data, that data doesn't need to be hot. You're going to be
            
            
            
              saving so much money for that. So it's up to you to determine what kind
            
            
            
              of data is going to work. Well, that you can archive, but we have a
            
            
            
              control plane that goes and accesses and points to where our queries need to go
            
            
            
              and where that data is saved and consolidates all that and then serves
            
            
            
              that up to you when you make those queries. Very cool. Right? And it can
            
            
            
              break it down to a bunch of different data sources. This isn't actually a live
            
            
            
              demo, but prerecorded anyways, I'm not really ready to live this dangerously
            
            
            
              yet. But firstly, what I want to do is set up a database
            
            
            
              with some hot and cold data and test it to make sure that it's
            
            
            
              going to work for our needs. What we're going to do, I wrote a Python
            
            
            
              script that imports a massive amount of lakes Iot
            
            
            
              data that we're going to be using. We're going to be using this fake IoT
            
            
            
              data to auto archived cold or old IoT
            
            
            
              data into our s three cluster, and we'll be able to query that.
            
            
            
              So you can see here every couple of seconds it's generating 2400 documents.
            
            
            
              It's a little slow, whatever. Every second inserts a couple,
            
            
            
              but. Great. I think I insert like 2400
            
            
            
              here. I can't remember. We'll find out in the next step. So you
            
            
            
              can see here on our MongoDB database,
            
            
            
              we've inserted 24,000 documents, or 52 megabytes
            
            
            
              of data into our collection. So we just have an IoT.
            
            
            
              This is just a generic IoT database
            
            
            
              that you might be saving some of your IoT data or whatever it is.
            
            
            
              We're just using IoT for this example. Right, but we have 2400 documents
            
            
            
              in here that we've just inserted with our script. Not a problem. And again,
            
            
            
              if you want to check out the script at all, check out that
            
            
            
              page with all the resources. Joecarlson dev
            
            
            
              mdb datalike all right,
            
            
            
              cool. So we got all of our data in there. Now we want to set
            
            
            
              up our data lakes so that we can connect it and query all this data
            
            
            
              and auto archive our data that we need in there. So how do we do
            
            
            
              that? If you're on atlas, you can see here in the,
            
            
            
              there's a data lakes portion on the right hand thing. And you can configure your
            
            
            
              data lakes pretty easily. It's actually easier than it is now.
            
            
            
              You just can click around in the GUI and connect whatever data sources they are,
            
            
            
              whether it's an s three cluster or a local one, which is pretty nice.
            
            
            
              You can also auto archive it too with our serverless functions with realm,
            
            
            
              but we're going to skip through that today. Okay, so let's assume that we have
            
            
            
              our data lakes set up. Wonderful. We also need to configure it with
            
            
            
              our third party services, which in this case is AWS s three.
            
            
            
              So let's see here. Okay, and we're back. Okay,
            
            
            
              so we're able to add s three in Mongodb.
            
            
            
              So basically we're just allowing the IM rule. So we've
            
            
            
              configured in s three that we have access to this. Then we're going to configure
            
            
            
              a new s three data service with our
            
            
            
              realm application, which is our serverless function. It allows us basically to automatically
            
            
            
              interact with our s three blob storage. So you're coding to put wherever your blob
            
            
            
              is and your secret key. I'm not revealing that here today.
            
            
            
              And you can define some rules for that. Cool.
            
            
            
              All right. So you can also set up rules too
            
            
            
              with everything of MongoDB. You can set up user based roles
            
            
            
              that have configurable access. And that's for all
            
            
            
              the services. You configure that as well. We're going to define rules in here.
            
            
            
              So what I'm going to use that service, the s three service we just set
            
            
            
              up, and we're going to add a rule that when data
            
            
            
              reaches a certain age, we want to automatically move that
            
            
            
              data over to our s three cluster.
            
            
            
              So let's see here. I think we can do that here. So we're going to
            
            
            
              write a function, this is a serverless function that lives in the cloud. You don't
            
            
            
              need to set up a server for this and call it realm retire. And this
            
            
            
              is coding to automatically, this is going to
            
            
            
              automatically save that data for us that we need. So this is what that looks
            
            
            
              like. Again, all the code is available on the page that
            
            
            
              I linked earlier, but we're just saying where the data is going to get saved
            
            
            
              and what data needs to go there. So we're finding a date range,
            
            
            
              right. That's going to be whatever we decide. It's going to be
            
            
            
              within two months ago. We want to check
            
            
            
              to see if there's any data that fits a query that we write,
            
            
            
              and it's going to automatically put all that data for us in our s three
            
            
            
              clusters. You can also do this on your own server.
            
            
            
              Here's how you might want to do that, or like what that query would look
            
            
            
              like if you're writing that in Python. Okay, so what we're going to do then
            
            
            
              is run this python script that automatically archives our data to
            
            
            
              an s three cluster. It's taking a date range from 2022
            
            
            
              to the third, so just the days it's coding to archive all that data for
            
            
            
              us. Okay, so it moves 240
            
            
            
              documents to our s three clusters, and this is actually in
            
            
            
              s three. Now we can see the data that we've just archived in there.
            
            
            
              These are just JSon. If you notice, they're all JSoN documents.
            
            
            
              And if we click into it, it's going to download
            
            
            
              for us and we can check that out. I'm going to open this up in
            
            
            
              visual studio code, but you can see that this is just our
            
            
            
              IoT data that we've archived in a JSON format.
            
            
            
              So what we're coding to be able to do now with our data lake is
            
            
            
              query this JSON text file as if it was a MongoDB
            
            
            
              database. Very cool, right?
            
            
            
              So we can actually see now that 240 documents have
            
            
            
              been moved, right? We had 24,000 to begin with. Now we have 23,520.
            
            
            
              Very cool. Okay, so we've confirmed that they've moved, we've confirmed that they're in the
            
            
            
              s three data clusters, and the world is happy now we're saving money.
            
            
            
              We don't have to pay for that storage space. We still have access to it,
            
            
            
              and we can still query for all that data if we need it. We also
            
            
            
              now have an online archive. So I wrote this from scratch.
            
            
            
              You do not need to do this. You can just configure online archive to follow
            
            
            
              rules through a GUI. It's a piece of cake. To save all this data automatically
            
            
            
              for you. I highly recommend checking out. It's incredible.
            
            
            
              Okay, we're going to try this one more time to sum up, we were
            
            
            
              able to archive data into an s three data cluster
            
            
            
              by moving data over programmatically. We still have access
            
            
            
              to query all that data we archived. Even though it's in an s three cluster
            
            
            
              and it's data with a JSON format, we can still query it like it was
            
            
            
              a normal MongoDB database. We're saving the world, and we're saving money by
            
            
            
              keeping cold data in a cheaper storage form
            
            
            
              that's more sustainable for growing data needs.
            
            
            
              That makes us happy, that makes babies happy. We're finally happy.
            
            
            
              Questions ip on Twitter if you have any questions. So what's next? If I've
            
            
            
              inspired you at all to want to explore this, I would encourage you to go
            
            
            
              out there and do it. I think watching a talk like this is a good
            
            
            
              way to learn, if it's worth your time to learn. But if you're working for
            
            
            
              a place or you have an application that has growing data demands
            
            
            
              that is costing you a lot of money. And if you're especially on MongoDB,
            
            
            
              definitely want to encourage you to check out how to do this. We have an
            
            
            
              always free dev tier with MongoDB Atlas, which is
            
            
            
              our cloud hosted database as a service. You should definitely check it out. It's the
            
            
            
              easiest way to use MongoDB. I swear it's great. There's not a free trial,
            
            
            
              it's free forever as long as you're under a certain size. Your database please.
            
            
            
              It's the best way to learn it. Join our community if you have any other
            
            
            
              questions, that's at community mongodb.com. We have courses
            
            
            
              on university mongodb.com which are also free and a great way to learn MongodB.
            
            
            
              And I probably write for developer mongodb.com which
            
            
            
              is all of our amazing developer resources are located there. If you want $100
            
            
            
              in free AWS credits, be sure to use code Joe when you
            
            
            
              sign up for atlas. All right, some more resources there,
            
            
            
              you can take a look at that. All of them are available at my website.
            
            
            
              And lastly, just want to say my name is Joe and please follow me on
            
            
            
              Twitter wherever else. And thank you so much. I had a blast and I
            
            
            
              love being here and y'all are so amazing. Have a great day y'all.