Conf42 Golang 2024 - Online

Leveraging the Apache Flight Go Client and InfluxDB

Abstract

In this talk, we will learn about the advantages of Apache Arrow and Arrow Flight as a data format and framework for transport of large datasets. Then we’ll learn how to leverage the Arrow Flight Go Client to build an InfluxDB Go Client and use it.

Summary

  • Ana Ystodis Giorgio is a developer at Influx Data. In this talk, we'll learn about how to use the influxDB 3.0 go client library. InfluxdB University offers free, live and self paced training on a wide variety of topics related to influxdb.
  • Time series data is any data that has a time stamp associated with it. Industrial IoT or IoT data can include anything that's coming from a sensor. A time series database has four components or pillars. The first is ingesting time series data; the other is accommodating for really high throughput.
  • influxdB is built on data fusion, and data fusion is the query execution framework that allows influxDB users to query influxDB with SQL. Columnar format is really well suited for time series use cases for multiple reasons. We really believe in open architecture and open data format.
  • The influxdb go client library allows any influxdb user using go to write in query to influxdb v three. You'll also need an authentication token. And last but not least, I'll share some resources and some places that you can get additional help.
  • Line protocol is just the ingest format for influxdB. It consists of the following components, measurements, tags, fields, and timestamps. The only difference between tags and fields is that tags are used to store metadata and fields are where you contain your actual time series data.
  • The influx community on GitHub contains a variety of examples for how to use influxdb with a bunch of projects. Also our go client library documentation at Docs dot in fluxdata. com. Last but not least, I want to encourage you to try the free trial and get started for yourself.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome. In this talk, we'll learn about how to use the influxDB 3.0 go client library. My name is Ana Ystodis Giorgio and I'm a developer at Influx Data. Influx data is the creator of influxdb and we'll be learning about how to use influxdb's client libraries, specifically the go one. For those of you who aren't familiar with developer advocacy, the role of a developer advocate is someone who represents the community to the company and the company to the community. And the way that I do that is through giving talks like these, through creating technical code examples for how to use influxdb with a variety of other tools for a variety of use cases, whether that's an industrial iT example, predictive maintenance example, or data science example, or visualization example. I also answer questions on forums and Slack. So if you go to the influxdb forums or community and you have questions, likely I will be answering one of your questions there. I encourage you to reach out and connect with me on LinkedIn if you want to and you have any questions about today's presentation or anything, any feedback that you want to give about influxDB as well. That's another part of my role, is to try and give feedback back to products so that they can make more informed decisions. Before I begin and talk about how to use the go client library, I do want to take a quick second to introduce influxdB University. So influxdB University offers free, live and self paced training on a wide variety of topics related to influxdb. You know, things on how to get started, you know, how to build a hybrid architecture within FluxdB, how to query, how to use the client libraries, how to use additional tools and protocols. So yeah, if you have a time series use case and you want to learn more about how to use influencedB, highly recommend that you check out this resource. It's all entirely free and it allows you to earn digital badges so that you can share your achievements for, you know, specific classes that offer that. And before I go into how we are going to use the go client library to query and write data to influxdb, I wanted to take a step back and just kind of talk about time series data in general because it's kind of necessary to give us some context for why it is that you need a time series database. So what is time series data? Well, time series data is any data that has a time stamp associated with it. So probably one of the most classic examples of time series data is stock market data. And the other interesting thing about time series data is that individual points are not really that valuable. Right. When we think of, like, a single stock value. Not that interesting. What we're really focusing on is the trend of stock over time. That's interesting because it helps us determine whether or not we should buy or sell. So really, time series data is a sequence of data points. And that makes it quite different from data that you would store in a relational database, which usually, like, one point or one record is really one row is really everything that you need. And this is also true of any IoT data. You know, industrial IoT or IoT data can include anything that's coming from a sensor. So that can be vibration, temperature, pressure, humidity, concentration, light velocity, etcetera. Anything that we are measuring about our physical environment with a sensor is likely time series data. We also get a lot of time series data coming out of the virtual world, whether that's DevOps monitoring, whether that's application monitoring or performance monitoring, network monitoring. That's also a huge source of all time series data. And time series data usually falls into two categories, metrics and events. So metrics are usually predictable, and they occur at the same time interval. So if we were to think of an IoT use case, maybe we have a sensor that's on a robot, and we are looking at the vibration of this robot. So we have a vibration sensor. So we might be pulling the vibration sensor, you know, at a rate, we're getting a reading per second from an accelerometer, let's say accelerator, for let's say so in that case, that's a metric. An event, however, is an unpredictable type of time series data. We cannot derive when an event will occur, but we can still store the event data. And in this case, we might be thinking of something like a machine fault alert. We don't know when that machine will next register for a fault, but we can store that event, that time series event, when it does. One interesting relation between events and metrics, though, is also how we can transform an event into a metric. So if we did a daily count of how many machine faults occurred, then we would be getting a regular count of zero or more, and we'd get that reading every single day. So then we'd be converting that event into a metric. So what is a time series database? Basically, a time series database has four components or pillars. The first is that it ingests timestamp data, time series data. So every data point in the time series database is associated with a timestamp, and it allows you to query in time order. So that's similar to a data historian. One big difference between data historians and legacy tooling like that in influxdB is the lack of vendor lock in, especially with influxdB v three, we're really focused on interoperability with other tools and eventually even being able to allow you to later this year to, you know, query parquet directly from influxdb so that you can use it with other machine learning tools, for example, and also put it into other data stores as you need. So anyway, so yeah, one component of a time series database is ingesting time series data. The other is accommodating for really high throughput. So in most cases, in time series use cases, we're going to get really high volumes of batch data or real time streams from multiple endpoints. So you're thinking thousands of sensors or sensors with high throughput. For example, if we think of specifically that vibration sensor that we were talking about earlier, an industrial vibration sensor on average can be up to 10 khz/second that's 10,000 points every second that you'd be writing. And you'd want to write that not just for that one sensor presumably, but for multiple. So you can see how the throughput requirements for time series database has to be able to be really high. Additionally, makes sense that if you are writing a lot of data to a database, you also want to be able to query it with the same sort efficiency. And specifically with time series, you want to be able to do things like have access to performant aggregations over time, whether that's averages, sums, global minimums and maximums. So you need to be able to scan over large ranges of this data and return the results very quickly. And last but not least, you need to think about scalability and performance. You know, it should be designed to scale horizontally to handle the increased load that is often associated with time series use cases, often across distributed clusters of machines. So let's talk about influxdB 3.0 specifically. So this is what our architecture diagram looks like for the 3.0 build. Basically, we really believe in open architecture and open data format, which means that we are part of the Apache ecosystem and contributing to those upward projects, upstream projects. So influxdB is built on data fusion, and data fusion is the query execution framework that allows influxDB users to query influxDB with SQL. Then we have Apache Arrow and Arrow is what we use for our in memory columnar format. And columnar format is really well suited for time series use cases for multiple reasons. The first is that oftentimes, especially when we are monitoring our environment and we're looking at to monitor changes in our environment. That means that we'll also typically have a lot of repeated values. So if we think of monitoring, for example, the temperature of a room, most likely that room temperature is going to be the same. So we're going to have a lot of repeated values. And if we store things in a columnar format, then we get really cheap compression there. Additionally, you know, you're likely going to want to perform things like find the max and min values across, you know, millions and millions of points sometimes. And so not having to scan through every single row in order to find global mins and maxes make time series. Additionally really well suited to a columnar format, then our durable file format is parquet, and we're also branching out into leveraging iceberg as well. And for those of you who are familiar with Parquet, you'll know just how efficient it is and how much cheaper storage it is than something like CSV as its counterpart, for example. And being able to eventually have a feature come out where we'll be able to query parquet directly from influxdB and use those parquet files for further data processing or in conjunction with other machine learning tools just really increases the interoperability of influxdb. So today, though, we're going to focus on the go client library. We're going to talk about what it is, then the requirements to actually use it, and we'll learn about how we can write data to influxdb v three with the go client and how we can query. And last but not least, I'll share some resources and some places that you can get additional help if you have any questions about this presentation. So what is the influxdb go client library? Well, it's essentially a software package that provides a set of tools and functions that allow any influxdb user using go to write in query to influxdb v three. So some of the advantages to using the influxdb v three go client library is that first and foremost, it wraps the Apache arrow client in a convenient influxdb v three interface, so it allows you to execute SQL and influx QL queries. InfluxqL is a query language that is proprietary to influxdb, specific to influxdb. It's very SQL esque, but v one users will have been familiar with influxql, so we wanted to carry that over in influxdbv three. And essentially you get to leverage using Apache Arrow and Apache Aeroflight and that flight protocol with your pc. So it really enables you to transport really large datasets over network interface by leveraging arrow and also the flight protocol provides really efficient serialization and deserialization as well as bi directional streaming, which just makes it really efficient for querying really large datasets from influxdB. And kind of touching upon just handling that high throughput use case that we're talking about. So how does the go client library work? Well, writes are implemented via our own write API, endpoint. You write data with line protocol, and line protocol is the ingest format. With influxdB, of course, we have methods that will generate that format for you within the go client. So we'll talk about that and then queries are implemented, like I mentioned, via the Apache aeroflight client. So some requirements for getting started first but not last, is to actually have an influxDB cloud 3.0 account. So if you sign up, you can get a free tier. And that's probably the easiest way to get a feel for whether or not you even like influxdB, because you don't have to download anything or install anything, you can just get started. Then you'll need to actually create a database. This is also referred to as a bucket. They're the same thing. The only difference between a bucket and a database in like SQL is that within the context of influxdB, a bucket or database also has a retention policy assigned to it or associated with it. And a retention policy just determines how long you'll keep the data before you automatically expire it. And that's just a handy, convenient part of having a time series specific database. Just because that's a common function that you're going to want to perform is automatically expiring old data. You'll also need an authentication token. So let's actually learn how we can create a bucket and get our authentication token. Essentially what we'll do, this is what the UI looks like. We'll go to the load data pages, we'll hit the button tab. We'll say create a bucket. We'll give our bucket a name. Maybe our bucket's going to be called my bucket. We'll assign a retention policy. The default is 30 days. We'll just leave it that way. Now we can also create a token in the load data page. We'll say generate an API token. You can generate an all access token or customize it to read and write just from a specific bucket. And then you have your token there that you might need as well for actually initializing your go client. Okay, so now let's talk about installation. So basically you're going to want to add the latest version of the client package to your projects dependency with go get and then we're ready to actually write data. So we're going to want to import our packages and then instantiate the client to write in query influxdb v. Three by providing our credentials. So this is what instantiating the client looks like. This is just kind of, you know, all boilerplate. We import our package and then we get our environment variables, and that includes the URL that is associated with your influxdb cloud account, your influxdb token that we just generated, and the database that we want to write to. And then we can actually instantiate that client. And now we'll talk about writing data so we can write data with line protocol in the line protocol format. Line protocol is just the ingest format for influxdB, and it consists of the following components, measurements, tags, fields, and timestamps. And I'll show an example of that in just a second. And when we write a point to influxdB, the only difference between tags and fields is that tags are used to store metadata to your instance, and fields are where you contain your actual time series data. However, in practice, both fields and tags do get converted into columns within a table in influxdb. So in practice they're really identical once you query. This distinction is really only for organizational purposes when you're writing your line protocol. But if you get confused between what to make a tag or a field, it's not really a big deal. Measurements are just basically what you would think of as your table name. And your timestamp is going to be in Unix precision or Unix format, sorry. And you'll obviously need that with every data point because we are writing time series data. So this is an example of what line protocol looks like. We have our table, our measurement called stat. Then we separate our tags and tag keys and tag values with a comma. So in this case, we have one tag called unit with a tag value of temperature. And then we have two fields, average and max, with, you know, two actual time series values, 23.5 and 45.0. If you don't provide a timestamp, one will be generated at the time of write. When you write it, we can also use the point method instead. And this is probably what you would do if you were using the go client. And so yeah, we'll use the write points method, or you can also use the new point method to create a point. And you can append points to an array to write an array of points and pass those into the write points method as well. And data is also written synchronously. So this is what this looks like if you wanted to create basically a point that's similar to the line protocol that we just wrote. You would use that new point method and include the measurement, any tags that you have and any fields and any timestamp that you have. And then you'd pass that into the write points method along with the database that you want to write it to and write your point that way. Now there's an important note about upsets. You can always upset a field, but not a tag. So for example, if you add a second point, we'll notice the addition of a two to the field value. For example, I add the second point in the line of code below. We would upset those field values and the previous values would be overwritten. So this first line here where I have an average of 23.5 and 45.0, that would be overwritten by the second point that I wrote of 23.52 and 45.2. Notice that the timestamps are the same as well. So then that upsurt would work. However, if we do add a second point and we change the tags, then we're not upsetting those values. So that is another thing to think about when you are thinking about the difference between tags and fields, it's just upsets. So in this case, if I added a second point with unit equals temperature two, that's an entirely different tag value. You can think of that as a different time series, right? Maybe more concrete example for tag would be sensor. And if I had sensorid and then I had one value equal to one and one value equal to two, it makes sense that we don't want to upsert that because most likely that is in fact a true new time series. Um, let's talk about how we can query data with the go client library. Well, we can use SQL to query. So following the example that we've been using, I can provide a simple SQL query like select all from that stat table where time is between now and the last five minutes. And I want to include the tag, the temperature tag. We can do that and then pass that query into our query method and then print the values. There are some additional method parameters that you can use, the context for the request, the database that you want to use for the database operations and the query. So those are all the things that we included. And so here's how you can query explicitly in that instance and actually provide those options and also the query type that you want. So you can switch between querying within with SQL or not. So here's an example of how you can query with influxql if you wanted to. Instead, we're using this influxql command called show measurements, which would show all the measurements or tables within our database and we are specifying that query option to be influxql instead of flight SQL and able to return all of our measurements. And similarly, here are some method parameters as well. I encourage you to check out all full code examples at the influx community so the influx community on GitHub contains a variety of examples for how to use influxdb with a bunch of projects on everything from IoT to data science and machine learning. I highly encourage you to go there because if there's something that you're looking to do with influxdb, there's a high probability that that we've already created a sample project which is a great place for you to get started. Last but not least, I wanted to share some final resources for getting help. The first is within the influx community. That's where all of our client libraries for v three are maintained, so please go there and take a look at it. Also our go client library documentation at Docs dot in fluxdata.com. Just make sure that you are using the right documentation for the right product by selecting your product in the upper right hand before you continue using the docs. And last but not least, I want to encourage you to please join our community slack workspace and participate in all conversations about influxdb. Go Python, whatever language you're using and any questions that you have about time series in general. And last but not least, I also wanted to let you know about our forums. Those are a great resource. Well, I mentioned GitHub already, our docs and also our blogs where you can find all these examples and more for how to get started using influxdb. Last but not least, I want to encourage you to try the free trial and get started for yourself. You can scan that QR code to learn more about how to use influxdb.
...

Anais Dotis-Georgiou

Developer Advocate @ InfluxData

Anais Dotis-Georgiou's LinkedIn account Anais Dotis-Georgiou's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways