Conf42 Machine Learning 2023 - Online

Polars: A highly optimized dataframe library

Video size:

Abstract

Get ready to revolutionize your data analysis with Polars - the newest, most highly optimized dataframe library on the market! In this talk, we’ll introduce you to the power of Polars and show you how it compares to the popular Pandas library.

Summary

  • Matt Harrison is the author of effective pandas Machine Learning, Pocket reference and illustrate guide to Python three among other books. He is an advisor to a company called Ponder that creates an enterprise version of pandas that lets you run pandas on your bigquery or on your snowflake data sets. He talks about Polars today.
  • The data set is vehicle data from the US government about cars and their mileage. We're going to look at the types, talk about this thing called chaining and talk about function application and we'll talk about aggregation as well.
  • Polar's gives us the ability to be lazy. And it actually has a query engine that will optimize what's going on there. We call this chaining, and it's also called flow programming. It can actually make your queries run faster when you do that.
  • Use a pivot table or a grouping if you're familiar with Excel or SQL. Now, one of the things that pullers doesn't give you that pandas does is the ability to plot. Once you've got these figured out, it's really easy to change mean for standard deviation.
  • Polars is a super powerful library and progressing fast. It has a lot of capabilities and is a lot faster than pandas for a many operations. Check this out if you need an option for pandas.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, let's talk about polars. So again, I'm Matt Harrison. I'm the author of effective pandas Machine Learning, Pocket reference and illustrate guide to Python three among other books. I'm an advisor to a company called Ponder that creates an enterprise version of pandas that lets you run pandas on your bigquery or on your snowflake data sets. And I also do corporate training. So I teach people Python and teach people data science. So I'm excited to talk about Polars today's. Okay, quick history of polars or relevant background. So my background is I've worked with data since 1999. In 2006, I created my own OLAP engine in Python. Later I heard about pandas and I started using pandas. In 2016, I wrote a book about pandas. 2019 did some spark training. 2020, wrote the second edition of the Pandas cookbook. In 2021, I wrote a book called Effective Pandas. And recently I've done a bunch of training around QDF mode and pullers as well. So I've got a bunch of opinions here and I'm going to just walk you through a data set and we're going to look at the data set and look at the types, talk about this thing called chaining, we'll talk about function application and we'll talk about aggregation as well. Okay, so let's run our data. Okay, so we're going to load our libraries here. I'm running Polar's version zero point 17 and I'm going to download my data. This is vehicle data from the US government about cars and their mileage. So it looks something like this. We've got 41,000 rows and 83 columns. So we're going to look and there's a bunch of columns in here. I'm not concerned with all of those, but our initial data size of the, this is like 33 megs. Okay, so I'm going to walk through this data set and one of the things you want to do is get the types right. I'm going to use a subset of the columns here. So instead of all 83 columns, I'm just going to use this subset here. And you can see that we've got various types of columns in this polar's data frame. If you're familiar with pandas, this looks pretty familiar. We've got Int 64s, float 64s, UTF eight. And one thing to note is that polars does have a native stream type, which in pandas one wasn't the case. We look at the size of this, we're looking at around eight megs for this. So what I'd do is I'd just go through this data. Let's pull out the integers. So I'll do something like this. I'm going to say pull out the columns I'm interested in and then select the columns that are in 64. So these are the columns that are in 64. We've got city mileage, combined mileage, highway mileage, cylinders, the fuel cost per year, the range, that's how far an electric car will go and the year. I might want to do a describe on this. And this looks very similar to what we'd see in pandas. Here we get summary statistics of this. Now one thing to note about polars is if you're familiar with pandas, pandas has an index. We don't have an index here, we just have column names and we have a bunch of columns. Now I wouldn't write this like this. I would write it like this. Second option here, it's the same code, but I put parentheses around it and the parentheses allow us to not worry about white space. So what I can do is put each operation on its own line. You can see that that gives me the same result. Now one things to be aware of is you can see that the types for each of these. And I probably don't need to use a floating point, a 64 bit floating point for city mileage that goes up to, it looks like 150. Combined mileage goes up to 136. So I could probably reduce these cylinders goes up to 16. And I'm going to just see if I could use like an int eight, an eight bit integer. And you can see that that goes from like negative 128 to 127. There's also an unsigned eight bit integer which goes up to 255, which might be more appropriate. So I'm going to say let's take my data frame, pull out the columns I want, and then we're going to use this with columns method and we're going to try and convert the combined mileage to an int eight. And then we'll do a describe on that. When I do that I get an error and the error here says that this strict conversion failed. We've got values like 131, 31 and that did not work. So instead of doing that with an int eight, let's try unsigned integer eight. And looks like that does indeed work. So this is kind of nice. Pandas one didn't check those types, it would just allow you to do those. We can use other types as well. So int 16. So I might do something like this where I say let's take the city mileage, the combined mileage, the highway mileage, the cylinders and displacement, convert those to unsigned eight and then range, fuel, cost and year. We'll convert that to unsigned 16 bit integers and we'll look at the size. We're now down to about 5.8 megabytes just by doing those operations without losing any of our data. Okay, let's look at strings here and I'm going to just select the string types. So again you can see that we've got a native string type here. We've got drive, which appears to be categorical. We've got this engine description column which looks like it's pseudocategorical. It's got a bunch of different entries in there and parentheses and commas, probably free form text make looks categorical. Model is probably categorical. Tranny actually looks like it has two pieces of information, whether it's manual or automatic, and the number of speeds. And then this created on actually looks like a date. So we'll look at these and try and deal with these. Okay, so let's just look at our size. Remember we have 5.8 megs and I'm going to convert drive, make and model to categoricals and look at our size after that. We're down to four megs by doing that. So categorical is just a string value that doesn't repeat itself a lot. That looks pretty good. Let's look at the FFS column here. I'm going to look at these remaining ones. So I've got engine description that has FFS, I've got transmission. We want to pull that into two, the manual and the speeds and then create it on. We'll deal with that when we talk about dates. So I'm just going to make a column here. This is an expression so I can pull up the documentation and you can see that we have like off of a column expression, we have this Str and off of Str we have various things that we can do. You can see that we can extract values from that. So I'm going to do an extraction here. I'm going to say, let's extract. Then I'm using a raw stream to say in parentheses, backslash, d plus, that will match all the numeric values inside of the transmission. And then I'm also saying alias that as the speeds column. And then from that same tranny column I'm going to see whether it contains the manual string and alias that as the manual column. And it looks like that does indeed work. If I do that okay, let's look at the columns after I do that. These are the columns that I have. And what I'm going to do is I'm going to change my code a little bit here. I'm going to use that columns I just printed out here and put it in the select method down below that to select those columns. Let's look at our estimated size. We're now down to 2.5 megabytes. Now we might be missing some values. So I'm going to use a filter method here and I'm going to say where is drive null? Let's run that. And you can see here are the values where it's null. Looks like a lot of those are electric vehicles. So what am I going to do here? I'm going to say, okay, in drive fill null with other that's in this value right here. You can see this. And we're also going to say and put in engine description. You can see the engine description is whether we contain it in ffs. Let's just run that and see if that works. So do we have a drive here? And it looks like drive worked there. It's not complaining. And we've got some missing cylinders as well. So let's just find out where those are missing. And it looks like those are electric vehicles as well. So I'm just going to come in here and save cylinders. If we have missing values, fill those in with zero and cast that to a un eight. And that looks like that works as well. Okay, let's look at our dates. To do this, I'm going to have to clean up this date column a little bit here. And I need to know about this replace method here. So I'm going to use this replace method on created on. If I try and convert the dates, it's not going to like those EDT and EST. So I'm just going to replace those with the time zone offsets there. And then I'm going to call the stir stir p time to convert it to a date time and say UTC is equal to true. You can see that created on now says that it is a date time. So that's pretty cool. If I want to convert this to a New York City time zone, after I've converted it to a date time, I can say DT convert time zone. And that looks like that works as well. Okay. At this point my data is looking pretty good. I'm going to convert this into a function here. And here is my function. So one of the things I like to do is work with the raw data here you can see that I've chained up my operations starting from autos and then doing all these operations as I've been creating these. I've been checking them to see that it works along the way. This is my recommended way and the polar's recommended way. We call this chaining, and it's also called flow programming. And you can see how I created this chain as I was going through. Now, one of the things that Polars gives us, that pandas doesn't give us is polar's gives us the ability to be lazy. And it actually has a query engine that will optimize what's going on there. So here I'm saying make a function called tweak autos lazy. It's actually taking in a path. This is a path to the file. And then you can see I'm saying scan CSV and then I'm saying lazy. And then down here I'm saying collect. So what this is going to do is it's not going to execute anything until I run collect. And then it's going to look at the columns that I use down here and it's only going to read from the CSV, the column that I specified down here. So even though they're 83 columns, it's not going to read everything. And it can do other things like predicate, push down, et cetera. But you can run this and see that it does work. So this is one of the reasons you'll want to chain with polars, because it can actually make your queries run faster when you do that. Now, a lot of people say this isn't easy to debug. I disagree with that. I mean, if I want to debug this, I can, I can stick in a pipe here. Pipe allows you to take the current state of the data frame and just return whatever you want. So here I'm just printing out the shape and then returning the data frame using a short circuit. You can see that I'm doing that before this operation. I'm doing that down here and I'm doing this at the end. So I should be able to see how big each of these are. I'm also piping in the git bar here if you want the intermediate state. This is just using the globals from Python to make a variable called DF two with the intermediate state right there. Let's run that and see what happens. Okay, you can see that we printed out the size as we're going through here. And if we check DF two, this is that intermediate state if we wanted to. So I don't think that chaining is necessarily hard to debug. It's just something that as you're building up your code, you will debug it as you're going. And then if you put it into a function, that makes it really easy to use. Now another thing you want to be aware of is function application. So I'm going to make this variable called autos two, which is our cleaned up data set. And this is again pretty us centric here. Let's try and convert this to liters per 100. We could do this by saying apply on that column and that looks like that is working. Our city is converted to liters per 100. However, we can get the same result by doing this instead by just saying 235 divided by that column. Now when you're doing function application, you need to be aware that that can be a slow process because what you're doing is you're taking your data out of the backing store. The case of polars, it's using the arrow has a rust implementation of the arrow memory representation of this. And you're crossing that boundary, converting each individual value to a python value, running the function and sticking it back in. So you can see that this took in this case four milliseconds to run the slow code and 150 microseconds to run the fast code. The last time I run this, it's similar order of magnitude there, almost 300 times slower to do the function application. So just be aware of that. You're going to want to stay away from that function application if you can. Okay. Another thing you'll want to do is master this thing called aggregation. And that's like a pivot table or a grouping if you're familiar with Excel or SQL. And so here we go. We're going to say let's get the mean by year and what you want to do is group by the year column and then we can do the mean of that. You can see this makes a column called year that has each year. And then for each value we have the numeric value for that. This is super powerful and it's basically one line of code, but I've written it as this chain. Now one of the things I do like to do is visualize that. So I'm going to load some plotting code here. Now, another way to do this, if I don't want to get the mean of everything, I can specify the columns that I want using this Ag method here. So here I'm saying pull off the combined mileage column and the speeds column and take the mean of both of those. Now, one of the things that pullers doesn't give you that pandas does is the ability to plot. So if I want to plot this, I'm generally going to stick it back into pandas. So here I call two pandas, which actually gives me a data frame. And because I don't have an index there, I'm going to stick the year into the index and then I'm going to call plot on that. And when I do this, the plot is kind of ugly. The issue here is that if you look at the data, the index is not sorted because of the optimizations that polars makes. It tries to run as fast as possible. So I'm going to just stick in a pandas sort index there. And you can see that we get a pretty plot coming out of this, allowing us to quickly see around the year 2010 or so, the combined mileage shot up quite a bit for the average value there. Okay, here I'm going to group by year and I'm going to, instead of taking the mean, I'm going to take the standard deviation. So once you've got these figured out, it's really easy to change mean for standard deviation or for other operations that you might want to do. Now in this case I'm going to try and add a country and so we might want to know about this win operation and we're going to use the is in method on a column as well. So here I'm going to try and add a country column and I'm going to say with columns and I'm going to say PL win. This is how you do an if statement and we're going to say take that make column. And if it's in Chevy, Ford, Dodge, GMC or Tesla, then I want a value of us, otherwise I want a value of other and we're going to call that country. And when I run this, I get an error and the error is that it didn't like that I used a categorical and did this on a categorical. So to get around this, I'm going to actually cast this back to UTF eight and run this. I should get a country column now. And then I'm going to say group by. We're going to group by both year and country, then we're going to aggregate that and here's the aggregation for that. Now in this case, if I wanted to plot this by year, I would have to basically pull out that country and I could do that in pandas. But if I wanted to do that in polars, I actually want to use what's called a pivot to do that. So here I'm going to say let's make our column, our country column and then we're going to call pivot. We're going to stick year into the index even though polars doesn't have an index. When we say pivot, we can specify what we want in the year. And here we're going to say the values are the combined mileage and the speeds and the columns is the country column and we're going to aggregate that with the mean function. So this columns here is going to take the values of country and stick them up into the columns. Let me just maybe run this for you by commenting out these other values so you can see what's going on here. And then we'll just step through the chain. Okay, so this is the result that we get from doing this. And then if we convert this to pandas, it looks pretty similar. We need an index here. So we're going to stick the year into the index. It looks semi sorted, but it's not. So we're going to sort it and then we'll call plot on this. That looks pretty good, but the legends in the middle. So we'll just stick the legend off onto the side there. Okay, so this lets us look and see that around 2008 or so, we did see a bump up in the combined mileage. Also looks like speeds has an upward trend or bend at that point as well. Okay, so we've looked at the polar's library. A couple of things to note about this. If you change your types, the correct types will allow you to save space, which can allow you to load more data and run things at a quicker clip. Chaining operations is going to make your code more readable. It's going to make it look like a recipe of operations. I also think it removes bugs and makes it easier to debug, unlike pandas. Pandas, I recommend chaining as well. But polars, you actually get an additional benefit from chaining because it does have a query planner and it can do things like predicate, push down, et cetera. So you will want to chain in polars to make the most of it. And then remember that that function application is slow for math operations. You want to try and avoid crossing that, what I would call the rust to python boundary there. And aggregations are super powerful. If you're not familiar with them, play around with them. Once you get down the pattern of them. They're going to be super powerful and help you answer a lot of questions that you might have. And then I do like to visualize polars does not have visualization, so I'm going to jump back to pandas to visualize that. Well, I hope you've enjoyed this quick introduction to polars. It's a super powerful library and progressing fast. It has a lot of capabilities and is a lot faster than pandas for a lot of operations. So check this out if you need an option for pandas. Pandas isn't working for you. Polar's, I think, is in a great place. If you're interested in more content like this, you can follow me on Twitter Dunder M. Harrison, where I talk a lot about Python and data science science. Have a great rest of your day and enjoy the conference. Thanks everyone.
...

Matt Harrison

Python & Data Science Corporate Trainer & Consultant @ MetaSnake

Matt Harrison's LinkedIn account Matt Harrison's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways