Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi, let's talk about polars. So again, I'm Matt Harrison.
            
            
            
              I'm the author of effective pandas Machine Learning, Pocket reference
            
            
            
              and illustrate guide to Python three among other books.
            
            
            
              I'm an advisor to a company called Ponder that creates an
            
            
            
              enterprise version of pandas that lets you run pandas
            
            
            
              on your bigquery or
            
            
            
              on your snowflake data sets. And I
            
            
            
              also do corporate training. So I teach people Python and teach people data science.
            
            
            
              So I'm excited to talk about Polars today's.
            
            
            
              Okay, quick history of polars or relevant background.
            
            
            
              So my background is I've worked with data since
            
            
            
              1999. In 2006, I created my own OLAP
            
            
            
              engine in Python. Later I heard about pandas and I
            
            
            
              started using pandas. In 2016, I wrote a book
            
            
            
              about pandas. 2019 did some spark training.
            
            
            
              2020, wrote the second edition of the Pandas cookbook.
            
            
            
              In 2021, I wrote a book called Effective Pandas.
            
            
            
              And recently I've done a bunch of training around QDF
            
            
            
              mode and pullers as well.
            
            
            
              So I've got a bunch of opinions here and I'm going to just walk you
            
            
            
              through a data set and we're going to look at the data
            
            
            
              set and look at the types, talk about this thing called
            
            
            
              chaining, we'll talk about function application and
            
            
            
              we'll talk about aggregation as well.
            
            
            
              Okay, so let's run our data. Okay,
            
            
            
              so we're going to load our libraries here.
            
            
            
              I'm running Polar's version zero point 17
            
            
            
              and I'm going to download my data. This is vehicle
            
            
            
              data from the US government about cars and
            
            
            
              their mileage. So it looks something like this. We've got 41,000 rows
            
            
            
              and 83 columns. So we're going to look and there's a bunch of
            
            
            
              columns in here. I'm not concerned with all of those, but our initial data
            
            
            
              size of the, this is like 33 megs. Okay, so I'm
            
            
            
              going to walk through this data set and one of the things you want to
            
            
            
              do is get the types right. I'm going to use a subset of the columns
            
            
            
              here. So instead of all 83 columns, I'm just going to use this subset here.
            
            
            
              And you can see that we've got various types of columns in this polar's
            
            
            
              data frame. If you're familiar with pandas, this looks pretty familiar.
            
            
            
              We've got Int 64s, float 64s,
            
            
            
              UTF eight. And one thing to note is
            
            
            
              that polars does have a native stream type,
            
            
            
              which in pandas one wasn't the case. We look at the size
            
            
            
              of this, we're looking at around eight megs for this. So what I'd do is
            
            
            
              I'd just go through this data. Let's pull
            
            
            
              out the integers. So I'll do something like this. I'm going to say pull out
            
            
            
              the columns I'm interested in and then select the columns that are in 64.
            
            
            
              So these are the columns that are in 64. We've got city mileage,
            
            
            
              combined mileage, highway mileage, cylinders, the fuel
            
            
            
              cost per year, the range, that's how far an electric car will go and the
            
            
            
              year. I might want to do a describe on this. And this looks very
            
            
            
              similar to what we'd see in pandas. Here we get summary statistics of this.
            
            
            
              Now one thing to note about polars is if you're familiar with pandas,
            
            
            
              pandas has an index. We don't have an index here, we just have
            
            
            
              column names and we have a bunch of columns.
            
            
            
              Now I wouldn't write this like this. I would write it like this.
            
            
            
              Second option here, it's the same code, but I put parentheses around it
            
            
            
              and the parentheses allow us to not
            
            
            
              worry about white space. So what I can do is put each operation on
            
            
            
              its own line. You can see that that gives me the same result.
            
            
            
              Now one things to be aware of is you can see that the types for
            
            
            
              each of these. And I probably
            
            
            
              don't need to use a floating point, a 64
            
            
            
              bit floating point for city mileage that goes up to, it looks like 150.
            
            
            
              Combined mileage goes up to 136. So I could probably reduce
            
            
            
              these cylinders goes up to 16. And I'm going to
            
            
            
              just see if I could use like an int eight,
            
            
            
              an eight bit integer. And you can see that that goes from like negative 128
            
            
            
              to 127. There's also an unsigned eight bit integer which
            
            
            
              goes up to 255, which might be more appropriate. So I'm going to say let's
            
            
            
              take my data frame, pull out the columns I want, and then we're going to
            
            
            
              use this with columns method and we're going to try and convert
            
            
            
              the combined mileage to an int eight. And then we'll do a describe
            
            
            
              on that. When I do that I get an error and the error here
            
            
            
              says that this strict conversion failed. We've got values like
            
            
            
              131, 31 and that did not work.
            
            
            
              So instead of doing that with an int eight, let's try unsigned
            
            
            
              integer eight. And looks like that does indeed work. So this is kind of nice.
            
            
            
              Pandas one didn't check those types, it would just allow you to do
            
            
            
              those. We can use other types as well. So int 16.
            
            
            
              So I might do something like this where I say let's take the city mileage,
            
            
            
              the combined mileage, the highway mileage, the cylinders and displacement,
            
            
            
              convert those to unsigned eight and then
            
            
            
              range, fuel, cost and year. We'll convert that to
            
            
            
              unsigned 16 bit integers and we'll look at the size.
            
            
            
              We're now down to about 5.8 megabytes
            
            
            
              just by doing those operations without losing any of
            
            
            
              our data. Okay, let's look at strings
            
            
            
              here and
            
            
            
              I'm going to just select the string types. So again
            
            
            
              you can see that we've got a native string type here. We've got
            
            
            
              drive, which appears to be categorical. We've got this engine
            
            
            
              description column which looks like it's pseudocategorical.
            
            
            
              It's got a bunch of different entries in there and parentheses and commas,
            
            
            
              probably free form text make looks categorical.
            
            
            
              Model is probably categorical.
            
            
            
              Tranny actually looks like it has two pieces of information, whether it's manual or
            
            
            
              automatic, and the number of speeds. And then this created on actually looks
            
            
            
              like a date. So we'll look at these and try and deal with these.
            
            
            
              Okay, so let's just look at our size.
            
            
            
              Remember we have 5.8 megs and I'm going to convert
            
            
            
              drive, make and model to categoricals and look at our size
            
            
            
              after that. We're down to four megs by doing that. So categorical
            
            
            
              is just a string value that doesn't
            
            
            
              repeat itself a lot. That looks pretty good.
            
            
            
              Let's look at the FFS column here.
            
            
            
              I'm going to look at these remaining ones. So I've got engine description
            
            
            
              that has FFS, I've got transmission. We want to pull that
            
            
            
              into two, the manual and the speeds and then create it on. We'll deal with
            
            
            
              that when we talk about dates. So I'm just
            
            
            
              going to make a column here. This is an expression so I
            
            
            
              can pull up the documentation and you can see that we have like
            
            
            
              off of a column expression, we have this Str and off of Str
            
            
            
              we have various things that we can do. You can see that we can extract
            
            
            
              values from that. So I'm going to do an extraction here.
            
            
            
              I'm going to say, let's extract. Then I'm using a raw stream to say in
            
            
            
              parentheses, backslash, d plus, that will match all the
            
            
            
              numeric values inside of the transmission. And then I'm
            
            
            
              also saying alias that as the speeds column.
            
            
            
              And then from that same tranny column I'm going to see whether
            
            
            
              it contains the manual string and alias that as the manual
            
            
            
              column. And it looks like that does indeed work.
            
            
            
              If I do that okay,
            
            
            
              let's look at the columns after I do that. These are the columns that I
            
            
            
              have. And what I'm going to do is I'm going to change my code a
            
            
            
              little bit here. I'm going to use that columns I
            
            
            
              just printed out here and put it in the select method down below that
            
            
            
              to select those columns. Let's look at our estimated size. We're now
            
            
            
              down to 2.5 megabytes. Now we might be missing
            
            
            
              some values. So I'm going to use a filter method here and I'm going to
            
            
            
              say where is drive null? Let's run that. And you can
            
            
            
              see here are the values where it's null. Looks like a lot of those are
            
            
            
              electric vehicles. So what am I going to do here?
            
            
            
              I'm going to say, okay, in drive fill null with other
            
            
            
              that's in this value right here. You can see this.
            
            
            
              And we're also going to say and put
            
            
            
              in engine description. You can
            
            
            
              see the engine description is whether we contain it in ffs.
            
            
            
              Let's just run that and see if that works.
            
            
            
              So do we have a drive here?
            
            
            
              And it looks like drive worked there.
            
            
            
              It's not complaining. And we've got some missing cylinders as
            
            
            
              well. So let's just find out where those are missing. And it
            
            
            
              looks like those are electric vehicles as well. So I'm just going to come in
            
            
            
              here and save cylinders. If we have missing values,
            
            
            
              fill those in with zero and cast that to a un eight. And that
            
            
            
              looks like that works as well. Okay, let's look
            
            
            
              at our dates. To do this, I'm going to have
            
            
            
              to clean up this date column a little bit here.
            
            
            
              And I need to know about this replace method here.
            
            
            
              So I'm going to use this replace method on created on. If I try
            
            
            
              and convert the dates, it's not going to like those EDT and EST.
            
            
            
              So I'm just going to replace those with the time zone offsets there.
            
            
            
              And then I'm going to call the stir stir p time to convert it
            
            
            
              to a date time and say UTC is equal to true.
            
            
            
              You can see that created on now says that it is a date time.
            
            
            
              So that's pretty cool. If I want to convert this to
            
            
            
              a New York City time zone, after I've converted it to a date time,
            
            
            
              I can say DT convert time zone. And that looks like that
            
            
            
              works as well. Okay. At this point my data
            
            
            
              is looking pretty good. I'm going to convert this into a function here. And here
            
            
            
              is my function. So one of the things I like to do is work
            
            
            
              with the raw data here you can see that I've chained up my
            
            
            
              operations starting from autos and then doing all these
            
            
            
              operations as I've been creating these. I've been checking them to see that
            
            
            
              it works along the way. This is my recommended way and
            
            
            
              the polar's recommended way. We call this chaining,
            
            
            
              and it's also called flow programming. And you
            
            
            
              can see how I created this chain as I was going through. Now, one of
            
            
            
              the things that Polars gives us, that pandas doesn't give us is polar's
            
            
            
              gives us the ability to be lazy. And it actually has
            
            
            
              a query engine that will optimize what's going on there.
            
            
            
              So here I'm saying make a function called tweak autos lazy.
            
            
            
              It's actually taking in a path. This is
            
            
            
              a path to the file. And then you can see I'm saying scan CSV and
            
            
            
              then I'm saying lazy. And then down here I'm saying collect.
            
            
            
              So what this is going to do is it's not going to execute anything until
            
            
            
              I run collect. And then it's going to look at
            
            
            
              the columns that I use down here and it's only going to read from the
            
            
            
              CSV, the column that I specified down
            
            
            
              here. So even though they're 83 columns, it's not going to read everything. And it
            
            
            
              can do other things like predicate, push down, et cetera. But you can
            
            
            
              run this and see that it does work. So this is one of the reasons
            
            
            
              you'll want to chain with polars, because it can actually
            
            
            
              make your queries run faster when you do that. Now, a lot of people
            
            
            
              say this isn't easy to debug. I disagree with that. I mean,
            
            
            
              if I want to debug this, I can, I can stick in a pipe here.
            
            
            
              Pipe allows you to take the current state of the data frame and just return
            
            
            
              whatever you want. So here I'm just printing out the shape and then returning the
            
            
            
              data frame using a short circuit. You can see that I'm doing that before this
            
            
            
              operation. I'm doing that down here and I'm doing this at the end.
            
            
            
              So I should be able to see how big each of these are.
            
            
            
              I'm also piping in the git bar here if you want the
            
            
            
              intermediate state. This is just using the globals from Python to
            
            
            
              make a variable called DF two with the intermediate state right there.
            
            
            
              Let's run that and see what happens.
            
            
            
              Okay, you can see that we printed out the size as we're going through
            
            
            
              here. And if we check DF two, this is that intermediate state
            
            
            
              if we wanted to. So I don't think that chaining
            
            
            
              is necessarily hard to debug. It's just something that
            
            
            
              as you're building up your code, you will debug it as you're going. And then
            
            
            
              if you put it into a function, that makes it really easy to use.
            
            
            
              Now another thing you want to be aware of is function application.
            
            
            
              So I'm going to make this variable called autos two, which is our cleaned
            
            
            
              up data set. And this is again pretty us centric here. Let's try
            
            
            
              and convert this to liters per 100. We could
            
            
            
              do this by saying apply on that column and that looks like that
            
            
            
              is working. Our city is converted to liters
            
            
            
              per 100. However, we can get the same result by doing this
            
            
            
              instead by just saying 235 divided by that column.
            
            
            
              Now when you're doing function application,
            
            
            
              you need to be aware that that can be a slow process because what you're
            
            
            
              doing is you're taking your data out of the backing store.
            
            
            
              The case of polars, it's using the
            
            
            
              arrow has a rust implementation of the arrow memory
            
            
            
              representation of this. And you're crossing that boundary, converting each
            
            
            
              individual value to a python value, running the function and sticking
            
            
            
              it back in. So you can see that this
            
            
            
              took in this case four milliseconds to run the slow code
            
            
            
              and 150 microseconds
            
            
            
              to run the fast code. The last time I run this, it's similar order
            
            
            
              of magnitude there, almost 300 times slower to do
            
            
            
              the function application. So just be aware of that. You're going to
            
            
            
              want to stay away from that function application if
            
            
            
              you can.
            
            
            
              Okay. Another thing you'll want to do is master this thing called aggregation.
            
            
            
              And that's like a pivot table or a grouping if
            
            
            
              you're familiar with Excel or SQL.
            
            
            
              And so here we go. We're going to say let's get the mean
            
            
            
              by year and what you want to do is group by the
            
            
            
              year column and then we can do the mean of that. You can see this
            
            
            
              makes a column called year that has each year. And then
            
            
            
              for each value we have the numeric value for that.
            
            
            
              This is super powerful and it's basically one line
            
            
            
              of code, but I've written it as this chain. Now one of the things
            
            
            
              I do like to do is visualize that. So I'm going to
            
            
            
              load some plotting code here. Now, another way to do this, if I don't want
            
            
            
              to get the mean of everything, I can specify the
            
            
            
              columns that I want using this Ag method here. So here
            
            
            
              I'm saying pull off the combined mileage column
            
            
            
              and the speeds column and take the mean of both of those.
            
            
            
              Now, one of the things that pullers doesn't give you that pandas does is the
            
            
            
              ability to plot. So if I want to plot this, I'm generally
            
            
            
              going to stick it back into pandas. So here I call two pandas, which actually
            
            
            
              gives me a data frame.
            
            
            
              And because I don't have an index there,
            
            
            
              I'm going to stick the year into the index and then I'm going
            
            
            
              to call plot on that. And when I do this, the plot is kind of
            
            
            
              ugly. The issue here is that if you look at the data,
            
            
            
              the index is not sorted because of the optimizations that polars
            
            
            
              makes. It tries to run as fast as possible. So I'm going
            
            
            
              to just stick in a pandas sort index there. And you can see that we
            
            
            
              get a pretty plot coming out of this, allowing us to quickly see around the
            
            
            
              year 2010 or so, the combined mileage shot up
            
            
            
              quite a bit for the average value there.
            
            
            
              Okay, here I'm going to group by year and I'm going to,
            
            
            
              instead of taking the mean, I'm going to take the standard deviation.
            
            
            
              So once you've got these figured out, it's really easy to change mean for
            
            
            
              standard deviation or for other operations that you
            
            
            
              might want to do. Now in this case I'm going to try and
            
            
            
              add a country and so we might want to know about this win
            
            
            
              operation and we're
            
            
            
              going to use the is in method
            
            
            
              on a column as well. So here I'm going to try and add
            
            
            
              a country column and I'm going to say with columns and I'm going
            
            
            
              to say PL win. This is how you do an if statement and we're going
            
            
            
              to say take that make column. And if it's in Chevy,
            
            
            
              Ford, Dodge, GMC or Tesla, then I want a
            
            
            
              value of us, otherwise I want a value of other and
            
            
            
              we're going to call that country. And when I run this, I get an error
            
            
            
              and the error is that it didn't like that I used a categorical
            
            
            
              and did this on a categorical. So to get around this, I'm going to
            
            
            
              actually cast this back to UTF eight and run this. I should get
            
            
            
              a country column now. And then I'm going to say group by. We're going to
            
            
            
              group by both year and country, then we're going to aggregate that and here's the
            
            
            
              aggregation for that. Now in this case, if I wanted to
            
            
            
              plot this by year, I would have to basically pull
            
            
            
              out that country and I could do that in pandas.
            
            
            
              But if I wanted to do that in polars,
            
            
            
              I actually want to use what's called a pivot to do that. So here I'm
            
            
            
              going to say let's make our column, our country column and then we're going to
            
            
            
              call pivot. We're going to stick year into the index even though polars
            
            
            
              doesn't have an index. When we say pivot, we can specify what we want in
            
            
            
              the year. And here we're going to say the values are the combined
            
            
            
              mileage and the speeds and the columns is the country column
            
            
            
              and we're going to aggregate that with the mean function.
            
            
            
              So this columns here is going to take the values of country and stick them
            
            
            
              up into the columns. Let me just maybe run this for you by
            
            
            
              commenting out these other values so you can see what's going on here.
            
            
            
              And then we'll just step through the chain. Okay, so this is the result
            
            
            
              that we get from doing this. And then
            
            
            
              if we convert this to pandas, it looks pretty similar. We need an index here.
            
            
            
              So we're going to stick the year into the index. It looks semi
            
            
            
              sorted, but it's not. So we're going to sort it and then we'll call
            
            
            
              plot on this. That looks pretty good, but the legends in the middle. So we'll
            
            
            
              just stick the legend off onto the side there.
            
            
            
              Okay, so this lets us look and see that around
            
            
            
              2008 or so, we did see a bump up in the
            
            
            
              combined mileage. Also looks like speeds has
            
            
            
              an upward trend or bend at that point as well.
            
            
            
              Okay, so we've looked at the polar's library.
            
            
            
              A couple of things to note about this. If you
            
            
            
              change your types, the correct types will allow you to save space,
            
            
            
              which can allow you to load more data and run things
            
            
            
              at a quicker clip. Chaining operations
            
            
            
              is going to make your code more readable. It's going to make it look like
            
            
            
              a recipe of operations. I also think it removes bugs and makes it
            
            
            
              easier to debug,
            
            
            
              unlike pandas. Pandas, I recommend chaining as
            
            
            
              well. But polars, you actually get an additional benefit from chaining because
            
            
            
              it does have a query planner and it can do things like predicate, push down,
            
            
            
              et cetera. So you will want to chain
            
            
            
              in polars to make the most of it. And then remember
            
            
            
              that that function application is slow for math
            
            
            
              operations. You want to try and avoid crossing that,
            
            
            
              what I would call the rust to python boundary there.
            
            
            
              And aggregations are super powerful. If you're not familiar with
            
            
            
              them, play around with them. Once you get down the pattern of
            
            
            
              them. They're going to be super powerful and help you answer a lot of questions
            
            
            
              that you might have. And then I do like to visualize
            
            
            
              polars does not have visualization, so I'm going to jump back to pandas to
            
            
            
              visualize that. Well, I hope you've enjoyed this
            
            
            
              quick introduction to polars. It's a super powerful library
            
            
            
              and progressing fast. It has a lot of
            
            
            
              capabilities and is a lot faster than pandas
            
            
            
              for a lot of operations. So check this out if you need
            
            
            
              an option for pandas. Pandas isn't working for you. Polar's, I think,
            
            
            
              is in a great place. If you're interested in more content like this,
            
            
            
              you can follow me on Twitter Dunder M. Harrison, where I talk a lot
            
            
            
              about Python and data science science. Have a
            
            
            
              great rest of your day and enjoy the conference. Thanks everyone.