Conf42 Python 2021 - Online

Idiomatic Pandas

Video size:

Abstract

Pandas can be tricky, and there is a lot of bad advice floating around. This talk will cut through some of the biggest issues I’ve seen with Pandas code after working with the library for a while and writing two books on it.

Summary

  • Matt Harrison: We're going to be talking about idiomatic pandas. He runs a company that does corporate training and consulting in python and data science. We'll talk about using the correct types for your data. And talk about some ways that you can make your pandas code better.
  • Getting the right types in pandas can be a challenge. But if you can do that, it's going to save you memory. It's also going to make your code easier to work with and it can save you computation time as well.
  • In pandas, there is 206 entries here where the cylinders are missing. This is because the lowercase int 64 type does not have support for missing numbers. Pandas also allows you to stick python objects into each cell there. This saves a lot more memory.
  • drive engine description, make, model, transmission and created on. An object represents string data, though it could also represent numeric data. If we convert them to categorical types, we can be able to do date operations on them.
  • Categoricals are a way to save a lot of memory in pandas. In order to use pandas, you need to be able to load your data in memory. Using the right types can allow you to process more data.
  • Jupyter makes it really nice to do that. I like to make a function that will take my raw data that ive read directly out of a CSV file. One thing that sticks out is whether it's automatic or manual. We've almost halved our memory by doing these little operations here.
  • In pandas, pandas generally does not mutate data. Rather it returns new data frames. Call pipe is a nice way to inline any arbitrary code. Chain programming is easy to read and easy to debug.
  • In place rarely actually does something in place. You are thinking you're saving memory, but you're not. Pandas is not really doing an in place operation, even though it says it is. It also prohibits this chaining, which I think is going to really mess up your code.
  • The next section is don't apply if you can. In this case, because it's working with strings instead of numbers, it's going to be slower. Don't use it unless you're in a string situation. And avoid apply if possible.
  • Pandas has two ways to do this. One is with a group by and then I'm going to take the mean here. This visualization tells me a lot more to me. Why was there an inflection point around 2007 eight that made things start ticking up.
  • Don't mutate. Apply is generally your enemy, so if you're trying to do math operations with apply, you're probably doing it the wrong way. Aggregations are very powerful. If you use this chain style and start building them up, it's going to help you understand what's going on.
  • I hope you enjoyed this. If you want two learn more about pandas, you can follow me on Twitter. I also have a discount code to my pandas. Course it's going to be 50% off for people who are watching this.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Okay. Hi, I'm Matt Harrison and we're going to be talking about idiomatic pandas. So I'm excited to be here today. I run a company called Metasnake that does corporate training and consulting in python and data science. Some of the reasons you might want to listen to me talking about pandas is because I've written a couple books on pandas. I co wrote the one point x cookbook, also wrote learning the Pandas library, and I have this machine learning pocket reference that also uses a lot of pandas as well. In addition, I've taught pandas to thousands of people over the years and used it for various consulting projects. So ive seen it basically since it was released and used it from about that time as well. I've seen a lot of good pandas code and bad pandas code. So what I want to do today is talk about some ways that you can make your pandas code better. Let me just share my notebook here and here's the outline of what we're going to do. We're going to be looking at loading our data really quickly. Then we'll talk about using the correct types for your data. We'll talk about chaining, mutation, application or the apply method, and finally aggregation. Okay, I'm going to do my imports here at the top. We're going to be using numpy and pandas. We are using pandas version 1.2.3. I'm going to set some options here just so I can see a few more rows than the default rows, and reload our data set. This data set is from fuel economy Gov. It's got information about cars that were sold in the US, I believe from 1984 to 2018 or so. Information about them. So it looks something like this and it's got quite a bit of data in it. So we've got 40,000 rows and 83 columns of data about this. So this is a nice little data set to let us explore various features that has different types in these data sets as well. So if you look at the columns in here, here's some of the columns we have. Like how many barrels of gas does a car consume in a year? What's the city mileage, highway mileage, engine description, the drive? So a bunch of good information about cars. Okay, so that's the data. Let's dive in. So the first part here is using the right types. So getting the right types in pandas can be a challenge. But if you can do that, it's going to save you memory. It's going to make your code easier to work with and it can save you computation time as well. I'm just going to limit what we're going to do here to a few columns. So again, we had 83 columns in that data set. I don't want all 83 columns here. So what I'm going to do is pull off the columns I want to look at and look at the dtypes attribute. This is going to give me a series that has the column names in the index and the values for those attributes on the right hand side there. So you can see I've got a bunch of Int 64, some float, 64, some objects, and those are the types. One thing you might want to do is you might want to look at how much memory this is using. So here's our memory usage for each column. If we sum that up, we can see that this is the amount of memory we're using. Just look at that number. It's around 19, blah, blah, blah. So we'll just say it's 19 right now, 19 megabytes. And we're going to basically try and lower that if we can. So the first thing is considering what our integer columns. So again, I can look at some of the integer columns here. So I'm going two, use the select dtypes, pass an int into it, and then chain that with the described method here. And here are the integer types that are currently on this data frame that I have. We've got city mileage, combined mileage, highway mileage, fuel cost, the range and the year. So fuel cost is how much you would spend on fuel, range is how far an electric car will go, and city combined and highway, that's miles per gallon. How far these cars go, how many miles they go with a gallon of gas or petroleum. And then year, that's the year the car was released. Okay? So of all, I probably wouldn't write this code like this. I would write this in this chain style right here. So you'll see throughout the course today that I do a lot of this chaining. And I just think it makes it easier to read. So you can see, first I'm going to take the autos, then I'm going to pull off the columns and I'm going to select the dtypes. And then finally I'm going to describe that it's a few more lines. We're just splitting it into a few more lines. And we're also wrapping it with parentheses, which allows us to put it on multiple lines. Now, one thing you want to do when you do this describe is you want to look at the min and max values because that can indicate what types you can use for integers. By default, pandas is going to use this lowercase int 64 type. And you might want to use a different type. So there are other types in here. Let's just show some of them. For example, this one down here can int eight. If you use that type, you can go from negative 128 to 127. So if we look at our values here, for example highway mileage, we might get away with using this int eight. There's also an int 16 here. An int 16 goes up to 32,000. So it looks like all of our values there would support being changed to int 16 instead. So what we can do is we can stick in this as type call here into our chain. And then in select dtypes we'll just select int and int eight. And you can see that this is now showing us what is either an int or an int eight. It's not showing us the int 16 there, but we did. Okay. And so if you're not familiar with this as type, what this is doing. Here's the documentation. Basically you can pass in this dtype here or you can pass in a dictionary of types. So what we're doing above with this as type is we're passing in a dictionary with the columns and the types that we want those columns to be. So you can see that highway. We're casting that to an int eight city and combined eight. We're casting those or we're changing those to int 16s. That's why city and int city and combined don't show up in this. Because when we said select dtypes, we did not select the int 16 dtype. Okay, so here I'm just flushing this out a little bit more. I'm filling in range and year as well. Note that I'm saying now select dtypes with the string integer. And if we do that, this is going to pull off anything that's integer esque. And so it will get all those types in there. This is looking okay. We haven't lost any precision by casting our integer columns to these other columns. We still have the same values in them. But now we should be using a lot less memory to do that. Let's just look at the memory here. And now we're using 18 instead of 19. So we've got a little bit of savings in that. Let's look at floating point numbers. So I'm going to say, let's describe what's a float and we've got cylinders and displacement in there. So I'm going to jump into cylinders. One thing that strikes out at me as I look at this is that cylinders actually looks integer. Like, if you look at the values that came up there, those look very integer esque. What I'm going to do is just stick a describe on here to look at cylinders. And this is confirming my suspicion that we have the mins and the quartiles there that are all whole numbers. So why is that being represented as can float instead of an integer? That might be the question that pops up in your mind, hopefully, is, well, one thing you can do is you can use this value counts call. And if I do that and say drop Na is equal to false, we can see why this is represented as a float. And the reason why is because in pandas, the lowercase int 64 type does not have support for missing numbers. And you can see right here, there is 206 entries here where the cylinders are missing. These are cars that do not have cylinders in them. So you might want to dive into that and figure out why those cylinders are missing. So this code right here will just say, let's query at. And I want to query where cylinders is Na. And so these are the cars where cylinders are missing. And it looks like these are mostly, if you look in this model column right here, it looks like these are mostly electric cars, which makes sense. I'm not a super car buff, but apparently electric motors do not have cylinders in them. So at this point we need to figure out what we want to do with that cylinders column. So what I'm going to do is I'm going to make an executive decision. Rather than having missing numbers in there, I'm just going to say fill an a with this little operation here. And then with that I should be able to cast it to an int eight. Also with the, this displacement here, I'm going to do a similar thing. I'm going to cast that or not. I'm going to cast it. I'm going to fill that in with empty values here. Let's just look at the integer types after we do that. And we should see that cylinders pops up here as an integer type. And the lowest value for a cylinder is zero. That's right here. And the highest value is 16. Okay, so we've converted cylinders successfully from a floating point to an integer. Let's see what else we have. We can also, if we want to dive into these float types, the default float type is float 64. If we wanted to look at like a float 16. We can do this numpy call here to see what the minimum and maximum values for like a float 16 can be. And if we can support that in here. So this looks like the displacement. We should be able to use a float 16 for that. I'm going to stick that in there and we'll have a float 16 for that. Now let's look at our memory usage now. After we've done that, our memory usage is now 17.5. So again we've saved a little bit more memory. We started at 19 and we've gone down a little bit. The next type that I want to look at is object types. Pandas supports these native float and integer types and basically it's allocating a block of memory in the specific type. So if it's an int eight, it's going to use eight bits of memory for each integer rather than a whole python object. But pandas also allows you to stick python objects into each cell there. Now when you're doing that you're going to use a lot more memory because pandas objects take up more memory, you're also going to lose speed. And so you're sort of executing things at the python level rather than pandas. Optimized basically c speed if you stay at those low levels here. So let's look at what is object types and these columns we care about. We've got drive engine description, make, model, transmission and created on. So generally when you read a CSV file, an object represents string data, though it could also represent numeric data that had string esque values in it. You can also see that created on looks like a date, but by default read CSV does not convert that into a date. Now I like to generally treat string types as basically a few types. One is categorical. And what do I mean by categorical? Well, if you look at like make, there's only a few different makes in there, so that could be categorical. Model on the other hand is more free form. There's going to be basically for every make, there'll be a couple of models for every year. So it's kind of a gray area. It's a little free form. It could be categorical esque, but it's going to have a higher cardinality, a higher number of unique values. Transmission looks categorical created on. I mean, you could say it's categorical, but it's actually a date type. I mean, these entries all look the same. So because there's so many entries that look the same, you might think that's categorical, but I'm going to convert that to a date because then I can leverage pandas date operations on that. We also have this engine, DSCR, which is the engine description, which looks like it's got parentheses and strings inside of it. We'll dig into that a little bit more and then we have drive, which looks categorical as well. So we kind of need to figure out what we want to do with these if we want to keep them as strings. If we convert them to categorical types, we'll save memory there. If we convert them to dates, we can be able to do date operations on them. Let's just start off with drive. So what I like to do when I come across a string column is stick it into this value counts call. That's going to give me the frequency and let me evaluate whether it is indeed categorical or not. Again, I put in that drop Na is equal to false to see if there are missing values in there. So these are the entries, the unique entries for drive. And I would call that this is categorical. Now, it's interesting that again, there are some missing values in there. NAN is used to represent missing values. So small aside, an object type in pandas can mean string, but it can also mean a mixed type. So in this case, Ive is actually a mixed type. It's actually strings with floating points. It's using nans. Floating points are representing those pandas there with those missing values. So let's just dive into that. Like, where is the drive missing? I'm going to do this little call here. Query. Drive is can a. Looks like a lot of these are electric, but some of them are not. There's like Nissan 300. I'm not sure why the drive is missing from that. There's some corvettes where the drive is missing. So is this front wheel drive or rear wheel drive? Why is this missing? Again, I'm not a car expert per se, but if you talk to a subject matter expert, they could probably give you some insight into figuring out the appropriate label for those various types. What I'm going to do, given that I'm not really a subject matter expert here, I'm going to say, let's just fill in everything else with the string, empty or not empty other, and we're going to convert it to a category type. So I'm going to say as type category. Let's just look at the memory after do that and look at that just by changing that type there. Well, we also did change the make type down here. So you can see that we changed the make type. So we changed basically two types, drive and make. We went from, I believe we're around 17. We've now dropped into twelve. So categoricals are a way to save a lot of memory in pandas. Why do we care about saving memory in pandas? Again, I talked about that before, it might make things run faster. But the other thing is that pandas is an in memory tool. So in order to use pandas, you need to be able to load your data in memory. So if your sort of pushing that edge where you're running out of memory using the right types can allow you to process more data. Okay, let's jump into tranny again. Value counts is my go to here. Here's the call for value counts. And it looks like there are a bunch of entries in there. Does it say at the bottom here? Sometimes it'll say how many rows there are, but I'm going to estimate there's around 30 different entries in here. So is this categorical? Maybe. I mean, it looks a little free form too. There's these parentheticals in there. I'm not quite sure what four versus the s seven versus the ams seven means per se, but to me, as sort of a novice, one thing that sort of sticks out is this, is whether it's automatic or manual. The other thing is there's this number in there, right? So those are both things that might stick out, that I might want to keep track of in this data set. So what I'm going to do is I'm going to put a new column here called automatic, and that's just whether it contains automatic in there. And then I'm going to put another new column here called speed. And that's just using this stir extract, which is passing in a regular expression here and pulling out that value. If there are missing values, I'm sticking in the string 20 instead. It looks like the ones that are missing if you dive into that are probably like variable speeds. So I'm just putting 20 in as a large number for speeds and then I'm casting that as an int eight. So if we do that and we look at our memory, our memory now has gone down from 19 to ten weeks. So we've almost halved our memory by doing these little operations here. Note that I also now have a few more call columns that might have more interesting information. Is there more that you could do with this column? The tranny column? Yeah, there might be. Right, so again, talking to a subject matter expert might reveal more insight, other features that you can create for your data. Let's look at the date columns here. So I do have this created on column. So what I'm going to do is I'm going to run this two date time call. This is a function in pandas, and then I'm going to call this tz localize down here because it looked like it was in the eastern time zone. So let's just run that and I get a warning here. It complains about that, but if you look at my memory usage again, I'm going down quite a bit. I'm now at 7000 here. So this warning here is due to a python issue where python doesn't understand this EDT or ESt in there. So with this little change here, replacing EDT and ESt by these numbers, when I rerun this here, I don't have any issues. Now, in this case, I'm pulling off the value counts here and just looking at the engine description. Value counts here, but let's just drop the engine description. So that's what I'll do down here because this looks like free form. But before I drop it, I'm going to add this little thing here. Does it contain ffs? Again, I'm not a pro at these engine things, but looks like there's a bunch of ffs in there, so that might be something that I want to stick in there. So this is an example, just making a new column, whether it contains ffs or not. So let's run that. And we're now down to eight. So we're not quite a third of our original data size, but we're getting close. Two that, okay, so this is kind of nice. Now, one thing you may or may not have noticed is that I've just sort of built this up as we've been going along, and Jupyter makes it really nice to do that. What I like to do now is I like to make a function that will take my raw data that ive read directly out of a CSV file, and I generally just call it tweak, whatever the name is. I pass in my raw data here and I just apply all these chains right here, and this gives me my cleaned up data. So this is really nice to have a function like this in one place, in one cell. Generally, I'll put it right after the load up at the top. Then when I want to revisit my data, I know that I can load the raw data and I can immediately get my cleaned up data. This also allows for what's called data provenance, where I can track through and see what happened, where every row is coming from. Generally your pointy head boss, after you do an analysis or calculation, is going to ask you to explain it. And if you don't have this provenance where you can come back to the original data that might be problematic and bite you. So this is what I like to do, and this should read pretty clearly here. I'm taking autos, I'm pulling off these columns, I'm assigning a cylinders column, a displacement column, a drive column, an automatic column, a speeds column, a created on column, and an SS. I'm changing the types for various ones of these and I'm dropping these columns so this sort of reads like a recipe. Do these steps and then you'll have your cleaned up data. Okay, so that is my first thing to do with pandas is you probably want to look at your data and make sure that your types are correct. The next section here is to talk about chains. So chains are something that I actually just showed you. Sometimes this is called flow programming. But what we're doing is we're leveraging the fact that in pandas, pandas generally does not mutate data. Rather it returns new data frames. And so we're just going two, take the new data frame and do operations on that. So on that note, because I said pandas can in memory tool, you want to have three to ten times the amount of memory as the data frame you're working with, because you need to have some overhead for doing these intermediate operations. I know some of you who probably use pandas are like Nomad, that's not the way to do it. There's an in place parameter that you can use on various methods. And pandas will do things in place. Yeah, there is an in place parameter. However, the in place parameter generally doesn't do things in place as you might think it does. So I do have a hint here, if you find some code that you can't change. Sometimes I like to stick in a pipe. Call pipe is a nice way to inline any arbitrary code. Basically you can stick in a function that takes a data frame and returns a data frame for you. So this is my chain here. Now I think this actually reads very cleanly. I just walked through it in the last section here. Now compare this with, this is what I generally see in pandas code in the wild. So basically ive taken just these few lines here and ive split them out into something that I'll see like this. I'll see something like this, making all these intermediate variables, changing them and then sticking them back in. And I'm not even doing that for the rest of the code. For the rest of the code, I'm just sort of doing, not even making the intermediate variables, but just assigning these directly. And generally, again, I wouldn't see that, but just for the sake of typing this in as an example, you could imagine that generally what I see when I'm working with students or clients is that this code here is going to be like three times as long as it is here. So contrast that with this, which I think is really easy to read, it's easy to step through and is going to make your code and life a lot easier. Now let's just run this here. And actually when you run it, you'll notice all these little errors here that's sort of the bane of pandas, people everywhere, the setting with warning copy. You'll note that I didn't run into that at all because I used a sign here. So if you use a sign and you're chaining, you're never going to run into this setting with warning copy error. That's kind of confusing. And you go and read the descriptions and the stack overflows and you're still kind of confused on what to do. This is also easy to debug as well. So people sometimes say, matt, that's not easy to debug that because I don't have the intermediate objects. I say it is easy to debug. Let me just show you some things you can do. One thing I can do is I can just comment out everything and walk through it as I'm going through it. So here I've got a call to tweak autos here, let me just show you. So here is tweak autos. If I want to, I can just come in here and say, well, what did it do before this? This is what it did, right? And I can look at that intermediate and I can walk through from the start to the end. So that's one way to debug it, which I think is actually very useful, more so than having all these arbitrary variables sitting around. Another thing I can do if I want an arbitrary variable sticking around is I can make a function like this gitvar. It's going to take a data frame in a variable name and look what it does. This is sort of a dirty hack. It just basically injects that variable name pointing to the current data frame into my globals. So guess what? Right down here I've got this here pipe gitvar df three. So this is right after getting the columns. And if I come down here below this and look at df three, this is the intermediate state. So I can get that intermediate state very easily by piping in this little hack function here that updates that if I need it. So again, I claim that this is a lot cleaner. It's going to allow you to very easily books at your code, but pull out stuff that you need to. One more thing that I have in here is this pipe, and this is a pipe to display a data frame here. So I can display one as well. And I didn't really show this, but if you look at the output here, there's actually two data frames here. So here's the first one. And if I scroll down a little bit more, let me rerun this. This was commented out, so I'm going two, rerun this. Okay, so here's the first one. I commented that out when I run it. And there's the first one, and there's actually the second one here. So I put a little function in here that I piped in. That just is a lambda function. It says display and then it returns that data frame after it displays it. So the pipe takes a data frame and returns a data frame. So we're just saying shim in a print along the way so we can inspect it. So this is the way that it's going to make it very easy for you to debug these chain operations. Okay, the next step that I have is do not mutate. And if you look at my chained operations, because I'm using chains, chains don't mutate. And so I'm obeying this. And I've got a quote here from one of the core developers in pandas who I've got the link to the bug as well. That's talking about getting rid of the in place option in pandas. And the quote is Mr. Reebax says you're missing the point. In place rarely actually does something in place. You are thinking you're saving memory, but you're not. And I come across this all the time when I'm teaching pandas or looking at code people will use in place and they think, oh, it says in place. So it's just updating the memory in place. Generally it's not doing that. Generally when you call in place, it's making the new variable that you'd get anyway and then shimming it back in. So it's not saving you any memory that you would get by doing chaining, but basically by forcing yourself to chain, you're going to make that clean code and you're also going to get rid of the bane of python developers everywhere. That annoying error that's like, what does this mean by using chains and assigns? You get rid of that. This is a short section, but Ive sort of demoed it up to this point. In general, there are no performance benefits by doing in place. So if you think you're doing that, you're just fooling yourself. Pandas is not really doing an in place operation, even though it says it is. It also prohibits this chaining, which I think is going to really mess up your code, make it uglier and harder to read. And I think that's one of the most important things, is to have code that's easy to read. And then again you will get rid of that setting with warning error that is the bane of pandas developers everywhere. Okay, the next section is don't apply if you can. So let's just make our data frame here. I'm going to call auto two here, and this is sort of us centric, but maybe I want to move to something that's more like world centric. A lot of people, instead of doing miles per gallon, they don't even know what a mile, let alone a gallon is. They will use liters per 100 km. Right. As a us person, I don't really know what that is, but Americans are kind of weird that way. But I've got this function here that will convert miles per gallons to liters per 100. Conversion there is, you multiply it by 235. So what I can do is I can take like the city mileage and I call the supply method on it and I can do that. And there's our conversions there. This looks like that works. Note that I can also do this instead of doing apply here, I can just multiply that right here. And this gives me something that is actually the same value. Now the output is the same, but what's going on under the covers is actually very different because I'm calling apply. What apply is doing is going to take each individual entry in that column and run each individual entry into this python operation in this cell down here. Because I'm multiplying the column. What this is going to do is a vectorized or broadcast operation. Remember, Panda stores this as a block of data. It's going to take that block of data, multiply it by this number, and give me a new block of data without going back into Python. We're saving sort of the serialization deserialization step. In fact, we can just time this here. I'm going to run this little time operator and we'll see how long this takes with both of these options here. So it looks like the apply takes about six milliseconds. Let's look at the vectorized version here. It takes about 143. So sometimes this changes, but by my calculations this is from a previous run of numbers here, but it's pretty similar. It's around a 50 x difference in doing an apply versus doing the vectorized operation. So my point here is anytime you find yourself doing apply, check whether that is what you want to do. Now, there are some cases where apply sort of makes sense. Let's just sort of talk about them. Let's just say I want to check whether make is american or not. And I'm going to apply this little function here that just says is the value for that american? And I'll also do this other one. There's also an is in on make. And so if I do apply, that's 1.5 milliseconds. You can notice that I do is in, that's about half the amount of time. So even with this string operation here, by doing an is, I am saving some operations because pandas can do it a little bit faster. Now, if we do something a little bit more complicated here, we might do something like I want to make a country column in here, and it's either going to be us or other. Again, I'm sort of us centric. Apologize for that if that offends you here, the timing for that is like 2.6 milliseconds. Generally, if I was doing a calculation and it had some conditional like this, I would use a where statement here. So this is using the where down here. This is how I canonically would do this with a calculation. However, in this case, because it's working with strings instead of numbers, it's going to be slower. So this is almost two times as slow to do this with the where. However, if you were to do a where on a numeric column rather than a string, it would actually see an improvement in data. Let's just do one more operation here. There's also this NP select, which is a nice little tool. And there's NPware, which again are very useful for basically building up conditional statements. They're a little bit more powerful than the where in pandas. However, in this case for strings, they are slower, but if you're doing numeric calculations, all of these will be faster. Than the apply operation there. My takeaway here with this apply, I see a lot of people using apply. Don't use it unless you're in a string situation. Otherwise you're going to want to use one of these NP selects or pandas where. NP where and your code is going to run a lot faster. I get it that where might be a little bit confusing. Your code might be a little bit harder to read. But if speed is of the essence, try and avoid apply if possible. And really with string operations, because strings are stored as python objects, you're going back to that python layer anyway. So in that case I'll say, yeah, apply is okay there. But generally I try to avoid apply for numeric operations. Okay, the last section here is aggregation. Let's talk about aggregation. Let's say that I want to compare mileage by country by year. And so when I started hearing buy this or buy that, basically it means to me that I want to do these group by aggregation. Pandas kind of has two ways to do this. One is with a group by, the other one is with a pivot table. I'm a software engineer by training and so group by makes sense to me. I'm not really can excel master. People who are very adept with Excel tend to favor the pivot table way of doing this, but both of them will get you to the same place. I'm just going two show the group by way here. So what I'm going to say is group by and then I'm going to take the mean here. So if you're not familiar with this, this is really powerful. What it's doing is it's going to take and put in the index the year. You can see in the left hand side, the bolded thing is the index. And then it's going to say, okay, for every row that had the year of 1984, get all the numeric columns and take the mean of each numeric column. Put that in the column. So this is actually pretty cool. I've written it out as three lines of code, but you could write it as a single line, a single operation to do this aggregation here, which is pretty cool. Now the sky's sort of the limit what you do with these. So instead of just doing all the columns, I can say I want to pull off combined and speed and just get the mean for those if I want to. So I stick in this little operation here. Note that I am chaining these. I'm writing these out each on their own line, and I can sort of walk through and debug that if I need to. Now, one thing I like to do is just stick on a plot. Humans aren't optimized for looking at tables of data. So I can look at this table and I can see it looks like speed is going up, but it's kind of hard for me to get a feel for what's going on. But if I just stick a plot on the end of this, what this is going to do, it's going to take the index by default, and each column it will plot as a line in there. So look at this. There's a visualization. This visualization tells me a lot more to me because I can visually see the representation, what's going on there, that our speeds are going up over time. Our combined mileage was sort of static until 2010, then it started taking up. Now, again, I'm not a car guru by any means, but this is kind of interesting just from looking at the visualization. So what happened? Why was there that inflection point around 2007 eight that made things start ticking up. And this could also be the data. We're also taking the mean, so there could be outliers in that. I'm imagining what this is, is electric cars, and electric cars tend to push off the mean a lot, so maybe we'd look at the median instead. And the median is still going up a little bit, but it doesn't look like it's going up quite as much as the mean there. And so in this case, what I'm going to do down here is we said by country, right? And ive just been doing this by year, but now what I'm going to do is I'm going to change some operations in here. I'm going to make a country column. So I'm going to do that, apply and get the country column in there. And then I'm going to group by look, I'm going to group by year and country. So let's look at what this gives me. This gives me this data frame with the index, you can see that there's both the year and the country in the index. So this is a hierarchical or a multi index, which is a little bit confusing to use. But hopefully you can see that there's power there. We can see for 1984. Here's the summary for us columns, and there's a summary for other non us as well. We can go a little bit deeper as well. So we can go down this rabbit hole here. I'm going to say, instead of just doing the mean here, I'm going to call AG, which allows me to call multiple aggregations. So I'm going to call the minimum the mean, and I'm even going to pass in my own defined function here that's going to take a series and aggregate that series, collapse it to a value. So in this case, my aggregation is a little bit weird. It takes a second to last value. Now this is just showing that you can pass in whatever function you want to. It just has to take a series and collapse it to a single value. And so here we go. Now, you can see that we have, in addition to hierarchical index, we have hierarchical columns. So for every numeric column here, we have the minimum value, the mean value, and the second to last value as well. So the sky's sort of the limit with what you want to do with these things. So what I'm going to do is I'm just going to go back to what I had, year and country, take the mean of that, and then plot that. And we get this plot here that's a little bit hard to grock here. I'm not quite sure what's going on. Issues here is that, remember, pandas is going to plot the index along the x axis. So now it's actually plotting these tuples of year and country like us or other in there, which makes it a little bit hard. So what you might want to do is one of these more advanced operations here where I'm going to use this unstack here. So if you're not familiar with unstack, let me just comment it out here and show you here is grouping by year and country. And then we're going to take the mean. If we come in here and say unstack, what that's going to do is it's going to take the innermost index, which is that country, and it's going to pop that up into the columns. And so you can see that we now have country in the columns. Now we only have a single thing in the index. We've got year. And if we wanted to look at, for example, just the city mileage for a year from that, we could pull off the city column, which is going to give us the sub columns there because that's hierarchical. Let me just comment that out and show you what that's giving us. It gives us that, again, we're going from this unstacked version, we're going to say city. And you can see I'm using this debug style to pull that out. Now we're going to plot that. Going to plot the index, and I'm going to plot each of those columns as a value and we get something that looks like this. So ive sort of used this construct of grouping by unstacking to illustrate some of the power. And this allows me to quickly visualize that and see what's going on with the US. We can see that the US city mileage is going up, maybe faster than the non us mileage, at least in this data set that I have here. Okay, so in summary, pandas is super powerful. You can do a lot of things with it with very few lines of code. In my experience, I've seen good pandas code and I've seen some bad pandas code. If you follow these steps, I guarantee that your code will be better. It will have less bugs, it will probably run faster, and your colleagues will probably be more happy with you because they're going to be able to understand your code. So correct types again. Those save time and make things fast and also give you functionality like date and string manipulation, chaining operations. That's two. Make your code more readable. It's going to remove bugs. It's going to be easier to debug as well. I showed some examples of that. Don't mutate. If you find yourself mutating, you should take a step back and think about how to do it without mutating. Again, apply is generally your enemy, so if you're trying to do math operations with apply, you're probably doing it the wrong way. Apply is okay for string operations, but generally should be avoided. And then aggregations are very powerful. I get that they can be confusing, but if you use this chain style and start building them up, it's going to help you understand what's going on and you're going to be very empowered. So thanks for watching. I hope you enjoyed this. If you want two learn more about pandas, you can follow me on Twitter. Underscore underscore Mharrison underscore underscore. I also have a discount code to my pandas. Course it's going to be 50% off for people who are watching this. So the code is learn pandas and that will be active for the next week. Good luck with your pandas. It's been a pleasure being with you.
...

Matt Harrison

Python & Data Science Corporate Trainer @ MetaSnake

Matt Harrison's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways