Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everybody, and welcome to save the world with MongoDB
data Lakes. My name is Joe Carlson.
I'm a developer advocate for MongoDB, and I'm just glad
you're here. Let's jump in. So, first of all, why should you care
about this? Well, the goal of this talk is to a,
save the world, great. And b, save some money,
also great. And why is this a problem?
Well, I'm sure you know that data needs
online on every service is just growing
and growing and growing. We're adding more and more features,
we're saving more and more things, and we need to get access
to them and scale these data needs faster than
ever before. Right. Every application, we're just saving more and more
data. Over time, as we're getting more and more users
and more and more users are collecting more and more data, the data demands
on our system just get bigger and bigger and bigger.
Right. This even gets exacerbated further
when we look data lakes scaling strategies. So, for example, with MongoDB,
you can shard your databases horizontally so that you
can easily scale these up. But if you're sharding these,
you're also going to have to be supporting these databases with replicated
servers, usually two of them, in case anything goes wrong with them.
You have two backups on hand that you can recover up to if you need
to. But these, of course, grow and grow and grow and grow too.
It quickly become a big issue for our applications.
Oh my gosh, right? Which I don't know if you know this, right?
These data demands on our system require power and money
to keep the lights on to keep those. Coding and data is the most important
part of our application. We don't want to lose any of that, but we want
to try to make sure that we're building and scaling these applications as
quickly, as responsibly for the environment and as cheaply
as possible. If we can get both those at the same time, that's a win.
That's a win for us, right? So let's explore how to do that
with data lakes. Now, before we do that, though, I just want to say
hello. My name is Joe Karlsson. I'm a software engineer and
a developer advocate, and I work for a company called MongoDB.
Best way to get a hold of me is on Twitter. Otherwise, I'm on Twitch.
Mongodb. Twitch too, if you want to check that out. I'm also making funny videos
on TikTok and all my information is on my
website@jocarlson.com. All of the resources, slides,
and a recording of this video that you're watching right now can be found on
my website. Excuse me, if you go to Jocarlson dev,
artofmdb data lakes,
or if you scan that QR code in the upper right hand corner at any
time, that will take you there as well. Lastly, I just want to say,
any opinions in this piece are of my own and don't reflect the opinions
of my employer. All right, just saying. Don't fire me. I love my job.
Okay, so we have one planet we need to protect. As data centers
grow, they're using more power, more electricity, and this is
very expensive for us and makes babies sad.
Do you want to make babies sad? I don't want to make babies sad.
Let's protect babies, please. Okay. We want to be leveraging
something called a data lakes. But before we get to that, I want to discuss
types of data that make a good use case for being saved in a data
lakes. Hot data and cold data. And I'm coding to be using an example
here today of IoT data. And with IoT
data, typically we're saving in time series data to be displayed on
a dashboard for whatever we want to understand trends over
time with whatever our IoT data is collecting. But one of the downsides of
IoT data is that it gets out of date or out
of stale, and we don't care about as much. We still want that data for
historical analysis or whatever, but that data is no longer
hot. It's not being actively used a lot. Right? Just like think of your tweets
from four years ago. It's good to still have those on hand, but you're probably
not accessing those tweets very often. And we might want to do something differently
with that data so that it gets saved, and it's cheaper for us to save
that in the long run. So hot data. Hot data is data that is
saved closer to the cpu. And you know this too. Like, if you've ever built
your own pc, memory usage that is closer to the cpu is
more expensive, right? We all know that motherboard is way more expensive than ram,
and ram is way more expensive than hard drives.
And if we can offload some of this data, some of this hot data that
we don't need anymore, to cheaper forms of data storage,
that's going to be a win for us. We're going to have lower electricity usage,
which is going to save the planet, and that's going to be cheaper for us
to save in the long run, and we can leverage that in the cloud.
That's money, baby. Okay, so data lakes
MongoDB actually has an amazing new service called the Data lake.
The data lakes allows you to save MongoDB data.
It can auto archive it into an AWS s
three blob storage. That means for the long run, right? You can save things
in a very cheap and efficient s three blob storage as
a JSON, just a giant JSON text file, and you cold still
query that data as if it was hot data, using your standard
MongoDB queries to get that data out. But you're not
paying for all that data to be saved actively in a MongoDB database.
Again, that saves money and the planet all at once. Very cool.
For those who don't know too, data lake is just a catch all term.
You've probably seen it a lot in industry right now, but it allows you to
save lots of different, unspecified, semi structured data altogether.
This allows us to, like, we can pull a bunch of different things together and
it's all going to be in one place and allow us to query that.
With MongoDB data lakes, you can save things with different, or you could query things
across databases, across collections, into s three blobs,
whatever you need to do. We cold all put all that unstructured data together
and make our lives a little bit easier. Okay, so in this talk today,
I want to explore the MongoDB data lakes a little bit further. We'll be doing
a function to save some data into our s three buckets,
and then we'll be doing some queries on that. I also will be talking briefly
about our new online archive, which makes this even easier to use.
So MongoDB Atlas data Lake allows you to
work with your data really easily. It scales automatically for you.
We can auto put all of your MongoDB data into
the data lake, and it's integrated with all the MongoDB cloud services,
which makes working with that data just super easy. It's so
fun and so easy to work with. It eliminates, because a lot of times what
we're seeing is people are building their own data lakes solutions of MongoDB,
and that's hard to do. We built it for you. You don't have to worry
about archived that anymore. We want to make sure that it's easy to work with
your data. It's easy to archive it, easy to get it out, and hopefully
save you some money too. Also, as a developer, I just want it to
be easy. We make that easy too. Yeah, you can still access your
data just like you could with any other MongoDB database.
You can access it with all of the drivers you're used to. It's going to
feel just like you're playing with a MongoDB database. It's just going to be saved
somewhere else and the queries might be a little slower. But again,
if you're archiving data, that data doesn't need to be hot. You're going to be
saving so much money for that. So it's up to you to determine what kind
of data is going to work. Well, that you can archive, but we have a
control plane that goes and accesses and points to where our queries need to go
and where that data is saved and consolidates all that and then serves
that up to you when you make those queries. Very cool. Right? And it can
break it down to a bunch of different data sources. This isn't actually a live
demo, but prerecorded anyways, I'm not really ready to live this dangerously
yet. But firstly, what I want to do is set up a database
with some hot and cold data and test it to make sure that it's
going to work for our needs. What we're going to do, I wrote a Python
script that imports a massive amount of lakes Iot
data that we're going to be using. We're going to be using this fake IoT
data to auto archived cold or old IoT
data into our s three cluster, and we'll be able to query that.
So you can see here every couple of seconds it's generating 2400 documents.
It's a little slow, whatever. Every second inserts a couple,
but. Great. I think I insert like 2400
here. I can't remember. We'll find out in the next step. So you
can see here on our MongoDB database,
we've inserted 24,000 documents, or 52 megabytes
of data into our collection. So we just have an IoT.
This is just a generic IoT database
that you might be saving some of your IoT data or whatever it is.
We're just using IoT for this example. Right, but we have 2400 documents
in here that we've just inserted with our script. Not a problem. And again,
if you want to check out the script at all, check out that
page with all the resources. Joecarlson dev
mdb datalike all right,
cool. So we got all of our data in there. Now we want to set
up our data lakes so that we can connect it and query all this data
and auto archive our data that we need in there. So how do we do
that? If you're on atlas, you can see here in the,
there's a data lakes portion on the right hand thing. And you can configure your
data lakes pretty easily. It's actually easier than it is now.
You just can click around in the GUI and connect whatever data sources they are,
whether it's an s three cluster or a local one, which is pretty nice.
You can also auto archive it too with our serverless functions with realm,
but we're going to skip through that today. Okay, so let's assume that we have
our data lakes set up. Wonderful. We also need to configure it with
our third party services, which in this case is AWS s three.
So let's see here. Okay, and we're back. Okay,
so we're able to add s three in Mongodb.
So basically we're just allowing the IM rule. So we've
configured in s three that we have access to this. Then we're going to configure
a new s three data service with our
realm application, which is our serverless function. It allows us basically to automatically
interact with our s three blob storage. So you're coding to put wherever your blob
is and your secret key. I'm not revealing that here today.
And you can define some rules for that. Cool.
All right. So you can also set up rules too
with everything of MongoDB. You can set up user based roles
that have configurable access. And that's for all
the services. You configure that as well. We're going to define rules in here.
So what I'm going to use that service, the s three service we just set
up, and we're going to add a rule that when data
reaches a certain age, we want to automatically move that
data over to our s three cluster.
So let's see here. I think we can do that here. So we're going to
write a function, this is a serverless function that lives in the cloud. You don't
need to set up a server for this and call it realm retire. And this
is coding to automatically, this is going to
automatically save that data for us that we need. So this is what that looks
like. Again, all the code is available on the page that
I linked earlier, but we're just saying where the data is going to get saved
and what data needs to go there. So we're finding a date range,
right. That's going to be whatever we decide. It's going to be
within two months ago. We want to check
to see if there's any data that fits a query that we write,
and it's going to automatically put all that data for us in our s three
clusters. You can also do this on your own server.
Here's how you might want to do that, or like what that query would look
like if you're writing that in Python. Okay, so what we're going to do then
is run this python script that automatically archives our data to
an s three cluster. It's taking a date range from 2022
to the third, so just the days it's coding to archive all that data for
us. Okay, so it moves 240
documents to our s three clusters, and this is actually in
s three. Now we can see the data that we've just archived in there.
These are just JSon. If you notice, they're all JSoN documents.
And if we click into it, it's going to download
for us and we can check that out. I'm going to open this up in
visual studio code, but you can see that this is just our
IoT data that we've archived in a JSON format.
So what we're coding to be able to do now with our data lake is
query this JSON text file as if it was a MongoDB
database. Very cool, right?
So we can actually see now that 240 documents have
been moved, right? We had 24,000 to begin with. Now we have 23,520.
Very cool. Okay, so we've confirmed that they've moved, we've confirmed that they're in the
s three data clusters, and the world is happy now we're saving money.
We don't have to pay for that storage space. We still have access to it,
and we can still query for all that data if we need it. We also
now have an online archive. So I wrote this from scratch.
You do not need to do this. You can just configure online archive to follow
rules through a GUI. It's a piece of cake. To save all this data automatically
for you. I highly recommend checking out. It's incredible.
Okay, we're going to try this one more time to sum up, we were
able to archive data into an s three data cluster
by moving data over programmatically. We still have access
to query all that data we archived. Even though it's in an s three cluster
and it's data with a JSON format, we can still query it like it was
a normal MongoDB database. We're saving the world, and we're saving money by
keeping cold data in a cheaper storage form
that's more sustainable for growing data needs.
That makes us happy, that makes babies happy. We're finally happy.
Questions ip on Twitter if you have any questions. So what's next? If I've
inspired you at all to want to explore this, I would encourage you to go
out there and do it. I think watching a talk like this is a good
way to learn, if it's worth your time to learn. But if you're working for
a place or you have an application that has growing data demands
that is costing you a lot of money. And if you're especially on MongoDB,
definitely want to encourage you to check out how to do this. We have an
always free dev tier with MongoDB Atlas, which is
our cloud hosted database as a service. You should definitely check it out. It's the
easiest way to use MongoDB. I swear it's great. There's not a free trial,
it's free forever as long as you're under a certain size. Your database please.
It's the best way to learn it. Join our community if you have any other
questions, that's at community mongodb.com. We have courses
on university mongodb.com which are also free and a great way to learn MongodB.
And I probably write for developer mongodb.com which
is all of our amazing developer resources are located there. If you want $100
in free AWS credits, be sure to use code Joe when you
sign up for atlas. All right, some more resources there,
you can take a look at that. All of them are available at my website.
And lastly, just want to say my name is Joe and please follow me on
Twitter wherever else. And thank you so much. I had a blast and I
love being here and y'all are so amazing. Have a great day y'all.