Conf42 Python 2023 - Online

Time series database: Should I use one in my application architecture?

Video size:

Abstract

Why do you need a specialized database for time-series data? How do I know if a time series database is right for my application? These are questions on top of every developer’s mind when building their applications architecture for time series data, and this presentation, we’ll help answer them for you.

Summary

  • Ana East is a developer advocate at Influxdata. Today she'll cover what is a time series database and how to use one in your application. She encourages you to connect with her on LinkedIn if you have any questions about time series, influxdB, or anything else.
  • Influxdata focuses on developers who build real time applications. The platform is widely adopted by open source developers and paying customers alike with over 550,000 unique deployments and over 1300 customers. Customers include Google, Cisco, SAP, Tesla, Disney, a bunch of people.
  • Time series data is a collection of observations obtained through repeated measurements over time. One of the most classic examples is weather stations. Another time series example is predictive maintenance. Another great example is in the healthcare space.
  • All time series data is represented on a two dimensional graph. The x axis is actually not an independent variable, but a dependent variable. There are often correlations between the value of something and the time. This makes time series analysis much more complicated.
  • There are two different types of time series. There are metrics, and metrics occur in regular time intervals. And then we also have events, and events occur at specific points in time. Both events and metrics are important time series elements that you need to monitor health.
  • The source of data isn't just humans anymore, it's machines and devices. DevOps and Internet of things, or IoT, has made a huge impact here. The number of devices is growing in dramatically faster than the number of people. The amount of data that each device can generate is exponentially more than one human can generate.
  • Time series databases are especially good at being able to handle really high ingest volumes. Because time series database records are organized by time, this means that time ranges of information are stored together. Another typical benefit of time series databases is the reduced storage size.
  • timeseries is definitely growing category for IoT use cases and for anything that has events and metrics. I recommend using the free tier cloud option first. I also want to make you aware of resources that you can use to learn more about influxdata.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, everybody, and welcome to today we're going to be talking about time series database. Should I use one in my application architecture? So, my name is Ana East, Otis, Georgia, as you know, and I'm a developer advocate at Influxdata, and I want to encourage you to connect with me on LinkedIn in case you have any questions about time series, influxdB, or anything else that you want to talk about today. So for today's agenda, first we're going to be talking about who is influxdb? What is influxdata? Then I'm going to be going into what is time series? Next, I'll cover what is a time series database? And how do I know if a time series database is right for my application? All right, let's dive in. So, first I wanted to give a brief introduction to influx data, which is the creators of Influxdata. So we were founded in 2013 by Paul Dix, who is our CTO and founder. We focus on developers who build real time applications, and we are most widely known for our time series data platform, Influxdata. But we also have an incredibly popular tool called Telegraph, which is an open source collection agent for metrics and events. InfluxDB is where developers can build and scale applications with time as the foundational component. An influxdata is one platform with one API across products and environments. The platform is widely adopted by open source developers and paying customers alike with over 550,000 unique deployments and over 1300 customers. And some of those customers include Google, Cisco, SAP, Tesla, Disney, a bunch of people. So even if you haven't heard of us before, chances are you've probably used products that actually use influxdata. So the first example that I want to talk about is Tesla. So specifically, Tesla Powerwalls for the home. So Tesla pulls time series data from connected walls and insoler powered user homes, and they monitor the health, availability and performance of those solar panels and battery setup with influxdb. They collect at these edge all of that data into influxdb that runs their backend system. The second example that I want to talk about today is Nest. So Nest is a smart thermostat for the home. Nest monitors the infrastructure that powers the IoT data collection system wide for all Nest devices. This includes their use of kubernetes and other software infrastructure that is run by Nest that is used to collect, manage, transform and analyze and aggregate device data. Disney plus. So as we know, Disney plus is an entertainment, streaming service and delivery application. And Disney plus uses a global content delivery network to distribute its video series, movies and shorts to its users worldwide and these monitor movement and performance of their video content throughout this global CDN using influxdata. So also Rapi. Rappi is a non demand foot and goods delivery application and Rappi uses influxDB to monitor, react and adjust the fluctuations in the on demand pricing of their driver and rider delivery network. They collect time series data from these mobile delivery network, all of their mobile delivery riders, and then send it to influxdata and into the cloud based application they've built on to run the rappy service. And this is an awesome service that exists in Latin America and hope one day they also expand to North America. Okay, so now that we understand roughly what influxdata is and what they create influxdata and also telegraph, and we understand some of the applications and some of the customers that use influxdata, let's take a step back and ask what really is time series? So time series data is a collection of observations obtained through repeated measurements over time. Okay, that's great, but what does that really mean? Well, the best way to understand something is through examples. So one of the most classic examples of time series data is weather stations. So I don't know if you've ever received a weather station as a gift or have ever used a weather station, but it's a pretty neat gift and it's a pretty good idea in case you want to get someone for someone you love. But essentially it's a console that sits inside the house and displays the current temperature, humidity, wind, and other conditions. But it also gives you optionally instructions on how to set up the weather station outside and send its results to weather underground. So you can do that as well. And you can view your data and send data every 5 minutes, and send temperature data, dew point data, humidity, wind speed, gust pressure, precipitation rate, et cetera. So all of this data is data that has a timestamp associated with it, and that is what makes it time series data. And additionally, having all these snapshots of data over time lets weather underground applications show much more than just the current weather conditions. So you can see your historical data, and you can see the high and low temperatures for the day and even graph it out for the user. Another time series example that exists is predictive maintenance. So one thing that I was trying to think to myself is, do I think about time series data all the time? Because I think about it for work, or do a lot of other people do as well? So I was thinking about a friend who has a predictive maintenance company, and does he think about time series every day? And then I realized that he does, because his entire business is based on time series data. So here's a screenshot from his website where he's showing vibration analysis over time. And he uses this type of data to help companies predict the life of things like bearings and fan ventilation systems and many other types of equipment. So he attaches sensors to this equipment to measure things like those vibrations and fluctuations over time to help his customers decide when these might need to do some sort of predictive maintenance and replace a component that will likely fail in the future so that they can ensure steady and consistent operation. And then I was thinking about other examples. So another great example of time series data is in the healthcare space. Specifically, when we think about heart health, we look at things like pulse, blood pressure, and temperature. And that's all time series data that you need to chart. And it's critical for delivering care to someone. You need to be able to monitor the change in those values over time. All right, so time series data is generally categorized in two different types. There are both metrics and events. So, first and foremost, all time series data is represented on a two dimensional graph, almost always. And where we have a y value and an x value, the y value is typically the value of the time series data itself. And the x value is the actual time series. And one thing that is really unique about time series data is that the x axis, the x value, is actually not an independent variable, but it's a dependent variable as well, because there are often correlations between the value of something and the time. Like almost always, it is colder at night. So that's one unique kind of property about time series data that it's just fun to mention, because it makes time series analysis, from a statistical perspective much more complicated. But back to focusing on metrics, events. So there are two different types of time series. There are metrics, and metrics occur in regular time intervals. So those are examples like temperature or pressure flow rate that you are gathering at a particular time. And then we also have events, and events occur at specific points in time. So in our health chart, for example, an event might be something like a seizure or maybe an arrhythmia. So both events and metrics are important time series elements that you need to monitor health, because a metric event might just be, or a regular metric might just be monitoring your pulse. So why do we need time series database? So, essentially, we want to answer this question, and the best way to answer this question is to understand where time series data is advancing in a lot of different areas. So the first and foremost is in customer and industrial IoT. So when we think of manufacturing and industrial platforms, renewable and alternative energy systems, or fleet and management and telematics, we know that these have sensor data that is collecting data like pressure, flow rate, temperature, humidity, concentration, all these things about your environment, rotations per minute, vibrations, et cetera. Then we have software infrastructure, which is a huge source of time series as well. DevOps monitoring is a huge source of timeseries data, whether or not you're monitoring your containers. Kubernetes, the availability of endpoints, CI CD. We also see timeseries data showing up in real time applications. So fintech is kind of an obvious example of time series data, and it's a unique one as well, where you are very likely collecting data at really, really high precisions, like even the nanosecond precision, which is a billion points in a single second. Then we also look at network monitoring and gaming applications as well. So some of our largest customers and banks and crypto companies like Capital one, Bank of America, and Crypto.com, to name drop, a few are some companies that use influxdb. So I wanted to talk about additional data sources as well. So we have seen an absolute explosion of data, and I still remember when megabytes of data was considered huge. But now people talk about petabytes and exabytes of data, and one reason for this is that the source of data isn't just humans anymore, it's machines and devices. So DevOps and Internet of things, or IoT, has made a huge impact here. And the number of devices is growing in dramatically faster than the number of people. And the amount of data that each device can generate is in most cases, exponentially more than one human can generate. So, I'm a home automation hobbyist, and I have dozens of devices, like smart light switches, that are sending data every few minutes, and each one generating more data than I could ever as a human. But I think all of you know and understand this already, and it's probably just as good or better than me. So let's move on to the next slide. So, the next thing I wanted to talk about within the context of timeseries databases is scalability. So, one thing that time series databases are especially good at are these ability to scale, to be able to handle really high ingest volumes. So it's not uncommon that people might want to be able to store data at a minute interval, but maybe also at a second millisecond, microsecond, and even nanosecond interval. So if that's the case, just take a look at some of the amount of records that you're generating per day. And if you are writing at a nanosecond or collecting data at a nanosecond precision, then you are writing trillions of points per day, which is just an insane amount of volume. So one problem that time series databases have to be able to solve is the ability to support that high ingest volume. Also, the way that time series databases achieve this is by making certain design assumptions and design trade offs, and those are typically around deprioritizing updates and deletes in favor of increased ingest and query. And the way that that is largely also executed is the fact that time series databases are indexed by time as well. So we also need to be able to prioritize queryability and query performance. So, time series databases are typically organized by time. And when you query for data, you are typically interested in querying for a collection of time series data of time within a certain range. And because time series database records are organized by time, this means that time ranges of information are stored together. And so therefore you're able to retrieve that data more quickly. And this wouldn't always be the case if their time is just being stored in one column in a relational database, for example, by contrast. So when analyzing timeseries data, you typically want to look at a range of data, and because of its organization within a time series database, this becomes a really quick observation or operation. And then another reason why time series databases get used is because of your ability to actually manage your timeseries data lifecycle. So one way that you do that is through having tools that allow you to automatically expire old data. That's critical, especially if you are collecting data at a nanosecond precision. You might not want to retain all that data, most likely more than times, often not. You don't need to retain that for a long period of time, and you need to be able to automatically expire it, and you want to be able to automatically expire it in a reliable fashion. The other thing that you typically want to be able to do is perform down sampling. So what is down sampling? Downsampling is the process of taking high precision data in its raw form, then applying an aggregate on top of that data to create a lower precision summary of it, and then only storing that lower precision summary. So, for example, maybe today we are storing one sample of temperature every 5 minutes, which would provide a total of 288 records, and we downsample that data to the average for the entire day. So we've reduced that data, excuse me, from 288 records to one record, and especially if you're using OSS versions of influxdata, that just helps you reduce your disk size as well as the index of your database, which would also in turn increase your query performance. And so with timeseries database, these type of actions typically happen at the database engine level, and they shouldn't require that you build these solutions yourself. Another typical benefit of time series databases is the reduced storage size that they can often have. We talked about downsampling already, and how you can utilize downsampling to additionally reduce your disk size and reduce the amount of data you need to retain. But compression is also another big benefit of the time series database. Timeseries records often have similar data. Think about a railroad track sensor as an example. It might show a zero meaning no train for almost all of the day. And critically important is to know when the train was there so that gate gets closed and the lights come on. But there may be hours of samplings of zero before there is a one. The only difference in some of the records may just be the time interval, which also has a pattern that can be used in compression. So you typically see great compression with at least some time series databases. Just because of the nature of time series data itself. There aren't going to be neighboring values that are identical or extremely similar, which also makes scanning the data more easy and more efficient as well. Time series databases also typically have data retention policies so that you can automatically delete your data after a period of time, like we mentioned. But you could certainly build an application to go and delete old records, but it's just easier to have the database do that for you. And because time series typically organizes by time, you don't have the inefficiency issues you see with some other types of databases when you delete large amounts of data from them. So when we do perform deletes, one of those trade offs that I was talking about is deprioritizing deletes and updates for ingest and querying. But when we say, when we talk about those deletes, we mean like individual points, whereas time series is better at deleting a large group of data. Okay, so you're a software engineer, and you want to know the following question. How do I know if time series is right for my application? So in order to answer that question, we should start asking ourselves a few questions. These first question to ask is, does my data have a time element? Most data does, but certainly not all of it. And sometimes it's not critical for understanding or problem solving. So if there's no time element, then a time series database is not a good fit and you should find a different database for it. The next question I would ask is do I care about changes in my data over time? Let's look at that basic weather station example. Its primary feature is to display the current temperature and humidity inside a house, and if those were the only features the weather station had, would these be need for a timeseries database? Maybe not. You could simply have a current temperature field in a relational database and just update that field whenever there is a change in the metrics. But if you do care about other samples of the data besides the current one, then there are still some more questions to ask, like do I think that I will get a feature request for my application that will these need to know about changes over the time what if someday we wanted to enhance the application to show today's high and low temperatures on the console? Now you could still use a timeseries database even if you didn't care about changes over time. Certainly with time series data you can often just retrieve the last record, which in this case would contain the same temperature and current temperature column as previously described. But these need to use a time series database is less in that case. Another question that you would ask yourself is how much data will I be working with? Will my application be working with a large amount of data now or in the future, or only a very small amount? If the amount of data is small, it doesn't mean that there is no need to use a time series application or database. It still might be the most efficient way to build your application. But if you're working with a large amount of data with a time element, then it will definitely help to build a case for scalability that time series databases can bring. So there's kind of a trade off between familiarity with the tools that you already maybe know and comfortability. You may be more comfortable working with a different database and you might be able to make it work if you have a small amount of data. But if you have a large amount of data, then you need to probably consider making that shift. And then another question to ask is, will there be any need to do any analytics with my data in the future? Data analytics are often performed by looking at data over a period of time, which makes time series database a really great choice for real time analytics. Really, any comparison to metrics or events in your data over time make it conductive to quick retrieval of time series data. Another question you might ask is am I concerned with storage costs. I know this sounds like a question that the answer is always yes, but let me state it better. Am I concerned with storage costs that could be reduced by a time series database? This doesn't always have an easy answer and sometimes takes some prototyping to figure out. But very often with time series data, the answer is yes, and sometimes by a dramatic amount. Between dancing, filling retention policies and the compression benefits that timeseries databases provide, another question that you ask yourself is am I concerned about the application performance with the time series data? And this is almost yes in all cases. But let's restate. Do I need to retrieve data in blocks of time for analytics to build a graph or do some other analysis over blocks of time and data? And is the speed in which I can do that important? This is where time series databases will shine. Your application may or may not have this need, but if it does, it's a good indicator to consider a time series database. Okay, so before we wrap it up, let's just get over a summary of some of the other great options out there. So there are places these relational databases are still far and away the best choice, and these are other great categories as well. But timeseries is definitely growing category for IoT use cases and for anything that has events and metrics. But if you have other types of data, you should use the right solution for it. So now if you want to start playing around with timeseries database and some of the vendors out there offer a free hosted database version. An example is influxdb. We offer an on prem open source free version and a free basic usage hosted cloud option. I actually recommend using the free tier cloud option first. That's usually what most people do, because you don't have to download anything and you can just play around with it and get an intuitive feel for whether or not it's something that you want to consider investing in. And then from there I usually see users install the open source version until they have a requirement where they don't want to host it themselves, in which case they sometimes return back to the cloud version. I also want to make you aware of resources that you can use to learn more about influxdata. So we have blogs, we have slack, the discourse forums at community, influxdata.com, Reddit as well. We offer influxdata University as well, which is a platform for learning about all things related to influxdata's technology, including influxdata. And you can earn badges is for any course that you complete there, and then our documentation is also excellent. So with that I want to thank you so much, and I look forward to seeing you next time.
...

Anais Dotis-Georgiou

Developer Advocate @ InfluxData

Anais Dotis-Georgiou's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways