Conf42 Python 2024 - Online

Leveraging the Apache Flight Python Client and InfluxDB

Video size:

Abstract

In this talk, we will learn about the advantages of Apache Arrow and Arrow Flight as a data format and framework for transporting large datasets. Then, we’ll explore how to leverage the Arrow Flight Python Client to build an InfluxDB Python Client, work with Pandas and InfluxDB, and create a CLI.

Summary

  • Today we'll be talking about how we can leverage the Apache flight, Python and influxdB. We'll talk about InfluxdB's commitment to the open data architecture. I encourage you to reach out on LinkedIn. I'd love to learn about what you're doing and any questions that you have.
  • Time series data is any data that really has a timestamp associated with it. You can have event data too, as a type of time series data. Your time series database should be both scalable and performant.
  • influxDB is a time series database platform built on Apache Arrow, arrow, flight data fusion, and parquet. We believe in open architecture and open data format. When you contribute to these upstream projects, then so many other tools benefit from it.
  • Aeroflight is a Python client library that leverages influxDB and other databases. The arrow flight SQL client basically just wraps the arrow flight client. The return of the streams of data differ a good amount between different languages.
  • Aeroflight is an open source data pipelining tool for transforming and integrating data. By contributing this plugin, we provide this benefit to a number of different open source tools that leverage aeroflight. I want to encourage you too to visit the influx community organization on GitHub.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
So welcome everyone. Today we'll be talking about how we can leverage the Apache flight, Python and influxdB. We'll also talk about how we can just leverage the influxDB Python v three client and any aeroflight SQL client as well. So we'll go over all three. So just a little bit about me for those of you who don't know what a developer advocate is, a developer advocate is someone who represents the community to the company and the company to the community. And I do that through webinars like this, also creating pocs or demos that showcase code examples of how to use a variety of technologies together so that you can understand how to leverage something like influxdB. Yeah, and I work for Influxdata, and Influxdata is the creator of influxdB. So if you enjoy any of this content today, or you want to learn more about Apache Arrow, about the Apache ecosystem in general, about influxdB, or perhaps have access to any questions about time series or data science in the time series space, I encourage you to reach out on LinkedIn. Please connect with me there. I'd love to learn about what you're doing and any questions that you have. But for today's agenda, we're going to start with a quick introduction to influxdB and time series. Because influxdb is a time series database and you can't understand what influxdB is without understanding what time series data is. Then I'm going to talk about InfluxdB's commitment to the open data architecture, along with the FDAP stack or the Aeroflight data fusion, Apache Arrow and Parquet data Stack. And then I'll talk about leveraging the aeroflight client and the influxdb v three python client. And then last but not least, I'll share some projects that leveraging all of these tools. So let's get right into it. Let's talk about the introduction to influxDB and time series data. So time series data is any data that really has a timestamp associated with it. So the earliest or most simple example is probably stock market data. And the unique thing about time series data is that the single value of it is not really that interesting. So single stock value, for example, doesn't have a lot of meaning. But what's really important is the trend of the stock over time. And that's what's really interesting, because that tells us whether or not we should be buying or selling based on a particular trend. And so this is true for time series data, any time series data, which is that the sequence of data points that are usually consisting of successive measurements from a data source over time is what really defines it. And this is true whether or not we're looking at SoC data, or we're looking at industrial IoT data or application monitoring data. And so that's time series data. And when we think of time series data, we really think of it existing within two categories. So, the first is metrics. And metrics are time series data that are predictable, and they occur at the same time interval. So, if we were monitoring, let's say, a machine on some plant floor, maybe we're looking at the vibration reading per second from an accelerator, let's say, and we're pulling that at a regular interval. So that would be a metric. But you can have event data too, as a type of time series data. And events are unpredictable, so we can't derive when an event will occur, but we can still store events, and we can also, through aggregation, convert any event data into a metric. So, for example, we could count the number of events that occur per day. Maybe event in this machine use case is when we have some sort of machine fault, and we can record how many faults happen per day. And then this way, we're converting our event data into a metric where we're getting a count on a regular basis. So what is a time series database? Well, time series database has four components, or really pillars that comprise it. The first is that it must be able to handle time series data, maybe, pretty obviously. So, every data point in a time series database is associated with a timestamp. And you should be able to query this data in a time ordered fashion, because, as we mentioned, time series data is really meaningless unless it's in that time ordered fashion. Additionally, it has really high throughput. High, excuse me, really high write throughput. So, in mace cases, we're talking about really high volumes of batch data or real time streams from multiple endpoints. So we could think of like thousands of sensors that are writing pressure, concentration, temperature, light, et cetera data. Or we might even be thinking about potentially also just one sensor, or a few sensors that have really high throughput. Like when we think about an industrial vibration sensor. That type of sensor is typically writing around 10,000 points per second. Another kind of pillar of time series data is that we need to be able to handle really efficient queries over time ranges, and we also need to be able to perform aggregation over time. Things like averages, sum min, max, with a variety of different granularities, whether that's seconds, minutes, hours, days, or nanoseconds. And then last but not least, your time series database should be both scalable and performant. So you need to be able to design something and scale it horizontally in order to handle the increased load that is often associated or often across distributed clusters of machines, and also to be able to handle the really high write throughput of and query requirements of the time series use cases, which really are often really high dimensionality and high volume use cases. So let's talk about influxdb specifically. So influxdb three and influxdb in general is a time series database platform, and it is built on Apache Arrow, arrow, flight data fusion, and parquet. And I'll talk in detail about what all these are. But one thing that is at our core is that we believe in open architecture and open data format, and we believe that these technologies really enable this. But before I go into those technologies, I just wanted to also highlight some of our customers, just to give some context around how influxDB is used. So we're primarily used in IoT monitoring, but also in application monitoring and software monitoring. So for some examples, Wayfair uses us for application monitoring. Tesla uses us to monitor all of their batteries, their wall batteries, b box, which is not a logo that's up here, but a company that I think is really cool develops and manufactures products to provide affordable, clean solar energy to off grid communities in developing countries. And we help them monitor all their solar panels. We have companies that are doing indoor agriculture that are using us, community members that are monitoring endangered birds from also community members that are monitoring their barbecue at home. So there's so many different use cases. But the one thread is that it all involves time series data. And just to kind of highlight just the high throughput requirements that time series requires and that influxDB can provide. So the latest benchmark for our latest version of influxdb v three, which is built upon data fusion, Apache Arrow, arrow and parquet. If we had a dimensionality of 160,000, we are able to ingest around 329,000,000 rows per hour, or 4 million values per second. So that's really what we're looking at when we're thinking about the high volume use cases that time series datasets like influxDB can provide. So let's talk about this commitment to open data architecture with Apache Arrow, Apache Arrow, flight data fusion, and Parquet. So what we mean by this is just the ability to have good interoperability with a bunch of other tools and really easily transport data to and from other sources so that people have the ability to develop their architecture with the tools that are most aligned with the project. And the problem that they're trying to solve. And the way that this stack enables influxdb to do this is that essentially Apache Arrow is an in columnar memory format for defining data. And Apache Arrow flight is a way to transport ledge data sets like arrow over network interface. Parquet influxdb uses Parquet as the durable file format on disk. It's also columnar, and data fusion is the query execution framework that allows developers to query influxDB v three in both SQL and influxdb. InfluxDB happens to be a SQL like language that's specific to influxdB, but essentially what it allows developers to do is ingest, store and analyze all types of time series data, handle this at a really high speed and high volume, and then also have the ability to have increase interoperability with a bunch of tools, and be part of the Apache ecosystem. Which means that when you contribute to these upstream projects, then so many other tools benefit from it. And any other tools that are also leveraging these technologies means that you can have a standard for basically transporting these data sets to and from all of these different tools. So we think of things like Dremio leverages, Apache Arrow and Apache Arrow flight. A whole bunch of machine learning tools leverage things like Parquet, I know h 20 does, I know Google, Bigquery does, I believe Cassandra. So so many different tools that you can use. If you can extract parquet files from influxdb and then quickly leverage them in other use cases and other tools, you just have access to them and then you can build the architecture as you need. The other thing that these tools enable for influxDB specifically, for example, is it allows you to have schema on write. So schema on write means you don't have to define your schema beforehand so you can modify the schema as you go. Like we said, it allows you to query and write millions of rows per second. Since we're on the bleeding edge of throughput through the use of this columnar store and also influxdB is a database purpose built for handling time series data at a massive scale. And yeah, that's pretty much that. So what I wanted to do is take a step back and kind of highlight why leveraging arrow, Apache Arrow specifically is so important, because it is that way that we are defining our data in memory in this columnar format. So the reason why columnar data works so well specifically for influxdB and helps us achieve this throughput is that imagine that we had this data that we were writing to influxdb. This just happens to be the ingest format for influxdB. Well, we might have different measurements, which are essentially tables, and we want to write different fields with a timestamp and some metadata. So this is maybe what a table would look like once we wrote this data. And you can imagine that if we wanted to query this data for like a max value for our field one, let's say if we store our data in a row format, we would run into some problems, which is that we would have to iterate through every single row, essentially, and every single column in order to try and find the max value associated with one column. But if you store things in a columnar fashion instead, then it looks like this. In other words, the data would be stored like the formatted block. And the other thing to note too is that neighboring values are the same data type and oftentimes also the same value itself. This is especially true for time series data when you are monitoring things like your environment, where you're maybe measuring the temperature at a minute interval. Oftentimes the temperature in a specific environment stays the same for periods of time. So what does this mean? This provides opportunity for really cheap compression, which enables these really high cardinality, high volume use cases that we're talking about. This also enables faster scan writes by using SMD instructions. So depending on how your data is stored, you may only have to look at the first column of data to find the max value of a particular field. Contrast to that row Arrington storage, where you have to look at every field, every tag, and every time set in order to find the max field value for that one column. So that's just a little sidebar of why columnar storage specifically suits things like influxDb and why Arrow is really purpose built for influxdB. So now let's talk about actually leveraging the aeroflight client with influxDB and also the Python client library. So this is what the aeroflight client looks like, the Python Arrow flight client library. To install it, you do something like PiP, install Pyro, and you can query a database like influxdb, three, V three, or any other database that leverages aeroflight. And what you would do is you'd first create a flight client by passing in the URL and instantiate after you've imported the library. And then you write a JSOn that has the necessary information for running the query on a specific platform, including the namespace name, the SQL query that you want to use, and the query type. So notice here how we are specifying the SQL query. So if you're querying influxdB exclusively, you could easily switch this between SQL and influxDB, for example, because you can query influxdb with both. And so this makes aeroflight a more convenient tool for querying influxDB specifically or any other database, just because you have that option to specify the query type. And this boilerplate is basically query is database agnostic. And this is what it would look like to use the arrow flight SQL client. So the arrow flight SQL client basically just wraps the arrow flight client and the protocol here is pretty similar. We just instantiate a flight SQL client that's configured for a particular database. Then we execute a query to retrieve the flight info. We extract a token for retrieving the data, use a ticket to request an arrow data stream, and in this example, we are returning the flight stream reader for the streaming results. We're reading all the data to a pirate arrow table and printing that table. So another important thing to note though too is that about the flight SQL client specifically is that it returns a stream of data. The return of the streams of data differ a good amount between different languages. So that's just something to keep in mind. It can be a little bit harder to use and a little bit less flexible, especially if you want to query a variety of different databases and you want to use kind of the same boilerplate in your Python script to query any data stores that leverage aeroflight. And then this is what it looks like to use the influxdb v three Python client library. So this basically just wraps the arrow flight client. And we would first import our library like we would do before. Then we initialize our client for influxdb. That includes providing the iD, the URL that influxdB is being run on the database name that we are querying data from. We initialize that client, we provide a SQL query, and then we actually return our query results. And in the client method or the query method, you can specify if you want to use SQL or influxdb. You can also specify the mode that you want to return your data back in. And that could be supports both polars and pandas. So you can return a polars or pandas data frame directly and that just increases interoperability with a bunch of python libraries that you can use for things like anomaly detection and forecasting. So before I go, I want to talk about some projects that leverage the aeroflight client. So the first one is that involve using Grafana and influxdb. So one thing we did was to contribute this plugin to Grafana, which is the flight SQL plugin. And so you can use the flight SQL connector to connect any data store that leverages aeroflight to or that leverages arrow to Grafana. So this is kind of what we mean by our commitment to open data architecture. By contributing this plugin, we provide this benefit to a number of different open source tools that leverage aeroflight. And I want to encourage you too to visit the influx community organization on GitHub. That organization just has a bunch of different POCs and demos for how to use influxdb with a variety of different tools. So one repo in specific is the Grafana Quickstart, which shows you how to use that flight SQL plugin, but also how to build dashboards and query influxdb using SQL, and also how to leverage the influxdb Grafana v three plugin too, which is specific to just influxdb. Another fun project is influxdb that uses influxdb and mage. So Mage is an open source data pipelining tool for transforming and integrating data. In essence, you can think of it as an open source alternative to Apache Airflow, and it contains a UI that simplifies the ETL creation process, and it has documentation on how to deploy these pipelines on AWS, azure, digitalocean, and GCP. I think they provide both terraform and helm charts so you can leverage those. And one specific demo that we have in the influx community organization is this one on anomaly detection, where we have some machines generating some different data and we use half space trees to identify anomalous behavior in the machine data and actually allow you to use a little interface to generate the anomalies in real time and also get alerts on those anomalies. So I encourage you to try it yourself here. Another really cool solution or project is one that uses quicks. So quicks is similar to mage in that it's also a solution for building and deploying monitoring event streaming applications. But the cool thing is that it uses Kafka under the hood and allows you to control all the elements of your pipeline with just Python. So it's specifically designed for processing time series data, and it comes both in cloud and on prem offerings. And it also offers a UI that simplifies this ETL processing building to make that much simpler. But the really cool thing about it is that it really is specifically built to handle event streaming and leverages Kafka so that you don't have to be a domain expert in how to leverage it. And this particular demo also leverages hive MQ, which is an MQTT broker. So we get data from a lot of different machines using MQTT, pass them into HiveMQ that broker, then centralize all that data in quicks and perform anomaly detection there. And also quicks has built in integration with hugging face, which is basically like sort of a GitHub on steroids for data scientists where they can publish any of their algorithms or anomaly detection or forecasting tools and algorithms there. And so yeah, Quix does all of the anomaly detection there after pulling data out of influxDB, where all of the MQTT data is written directly to influxDB and performs this anomaly detection and then also creates alerts. And in this example, we use auto encoder instead, which is a type of neural network that is unsupervised. And yeah, this is essentially what the architecture for this project looks like. We have all of these machines that are generating and these robot arms in this example are also generating data, pushing that data to hive MQ with MQTT, and then we use MQTT clients to write the data to influxdb. We take advantage of quicks to actually perform the querying and apply the machine learning model which is stored in hugging face. And then we use Grafana to visualize all the data. So the cool thing about this solution architecture is that while we're only performing this on three types of machines, it could easily be scaled out for a real world use case. And so yeah, in general, I recommend that you check out this demo, but also a variety of other demos that exist at influx community. If you want to learn more about how we're leveraging Apache Aero data fusion, parquet, and aeroflight to increase interoperability with other tools and provide solutions like some of the projects that I mentioned here today. Last but not least, I want to encourage you to join the influxDB community. You can get started with influxDb by visiting influxdata.com. You can also check out our documentation. InfluxDB University is a resource that offers free and live training on a variety of different topics, including some of the projects that I mentioned today. And also please join our community. You can join our community slack at influxcommunity, slack.com, or also our forums as well, our discourse forums. So it's entirely up to you what you want to use. Thank you so much.
...

Anais Dotis-Georgiou

Developer Advocate @ InfluxData



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways