Conf42 Python 2022 - Online

Strategies for working with data as it grows

Video size:

Abstract

Nowadays data is getting bigger and bigger, making it almost impossible to processed it in desktop machines.

To solve this problems, a lot of new technologies (Hadoop, Spark, Presto, Dask, etc.) have emerged during the last years to process all the data using multiple clusters of computers. The challenge is that you will need to build your solutions on top of this technologies, requiring designing data processing pipelines and in some cases combining multiple technologies.

However, in some cases we don’t have enough time or resources to learn to use and setup a full infrastructure to run a couple experiments. Maybe you are a researcher with very limited resources or an startup with a tight schedule to launch a product to market.

The objective of this talk is to present multiple strategies to process data as it grows, with the limitations of a single machine or with the use of clusters. The strategies will focus on technologies such as Pandas, Pyspark, Vaex and Modin.

Outline 1.- Introduction (2 mins) 2.- Vertical scaling with Pandas and the Cloud (3 mins) 3.- Keeping the memory under control by reading the data by chunks (5 mins) 4.- Processing datasets larger than the available memory with Vaex (5 mins) 5.- Scaling Pandas with Modin and Dask (5 mins) 6.- All-in with Pyspark (5 mins)

Summary

  • Marco Carranza is the technical cofounder of Teamcore Solutions. Teamcore processes sales information through machine learning algorithms. The objective of this talk is to view alternative for processing data using different types of technologies.
  • Data is getting bigger and bigger, making it almost impossible to process it in regular desktop machines. In this talk, we will show you multiple strategies to process data locally and review some alternative tools that could help us to process large data sets using a distributed environment.
  • Vertical scaling is the ability to increase the capacity of existing hardware or software. horizontal scaling involves adding machines to the pool of existing resources. Speeding up pandas with Modin is a multi process data frame library with an identical API to pandas.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You strategies for working with data as it grows. Hello everyone, my name is Marco Carranza. I'm an entrepreneur and also the technical cofounder of Teamcore Solutions. I have been a user of Python for more than 15 years and I have had the opportunity to use it extensively to develop many technological solutions for the retail industry on a global scale. In teamcore solutions, we process sales information through machine learning algorithms, giving companies visibility into the execution of their product at each store, generating insights and specific actions for the office and the field teams. The objective of this talk is to view alternative for processing data using different types of technologies and data frames. First we will take a look to different techniques to keep pandas memory usage under control and to allow us to process larger files. Then we will take a look in how we can vertical scale our pandas loads using Jupyter notebooks. Next we will learn about this amazing Python library Bayx so we can process a large amount of data that cannot fit in our memory. Then we will try Modin so we can speed up our pandas code with a minimal change in our Python source code. And finally, we will take a look to Pyspark and understand in which cases is a great alternative. Introduction nowadays, data is getting bigger and bigger, making it almost impossible to process it in regular desktop machines. To solve this problem, during the last years a lot of new technologies have emerged to press all the data using multiple cluster of computers. The challenge is that you will need to build your own solution on top of these technologies requiring designing data processing pipelines and in some cases combining multiple technologies. However, in some cases we don't have enough time or resources to learn to use and set up a full infrastructure to run a couple of experiments. Maybe you are a researcher with very limited resources or a startup with a tight schedule to launch a product to the market. Usually the software that process data works fine when it's tested with a small sample file, but when you load the real data, the program crashes. In some cases, some simple optimization could help to process the data, but when the data is much larger than the memory available, the problem is harder to solve. In this talk, we will show you multiple strategies to process data locally and review some alternative tools that could help us to process large data sets using a distributed environment. Pandas is the de facto tool when we are working with data on python environments. Now we're going to see a couple of tricks that will help us to control the memory of our workloads in a better way. Trick number one, spurse data structures. Sometimes we have data sets that came with many, many empty values. Usually these empty values are represented as non values and using an sparse column representation could help us to save some memory. In pandas, sparse objects uses much less memory, especially if we save these objects as pickle on dask and when we are using them inside a Python interpreter. Let's take a quick look to a small example. As we can see in this data frame, when we list the content of the column name education 2003 revision, we'll see that there are a lot of rows with non values. And then if we take a deeper look to how much memory we are using, we realize that we are like consuming a lot of memory something cloud to 19 megabytes. So with a very simple command we can change the data type of that column and tell pandas to use a spurse type object. So after doing that, if we take a look again to the memory usage, we'll see that it has reduced. It has been reduced a lot. So basically after changing only the data type, we have reduced the memory in 41%. This maybe doesn't look too much, but it's very useful, especially when you have very large data sets that come with a lot of empty values. Trick number two, sampling. Sampling is a very interesting and useful technique and will help us to create a smaller data set from a larger one, and if it's dont in the right weight, will help us to run a faster analysis without sacrificing the quality of the results. Pan has a special function for that named sample and let's see an example. In this example we have one large data frame, so we are creating a sample of 1000 rows. But before running a sample, you need to be careful because a common mistake is that a lot of people think that if they only pick up the first thousand rows, that will be a right sample. But in reality the correct way is to use this function because you will get a more uniform sample that is better for further analysis. For example, if we run later the function describe pandas, we'll get some instagrams and also some descriptive statistics. And if you can see if we compare the result of the original data frame with the sample one, the results are pretty similar. Also, if we take a look to the histograms, we'll see that both are very similar. But if we make the mistake of only picking the first n grows, you will get a completely different result. Okay, trick number three, cloud only the columns that you need. In some cases we have very large data sets that comes with many columns. In some cases you can have hundreds of columns. So there's no point to cloud all these columns into memory. So the basic rule is less columns, less memories. So let's take a quick look to an example. In this small example we have a large text file that is 3.8gb on disk. So basically after reading it we realize that we have 77 columns and if we analyze the memory usage we will realize that we're using 4.5gb of memory. So a quick way to reduce the memory usage is only to select the columns that we are going to work with. So after selecting these four columns that you can see in the example, we realize that we have reduced the memory from 4.5gb to a little bit more of 300 megabytes. Also this could be done directly when reading the CSV file number four use smaller numerical data types. There are like multiple data types in pandas and according to the type we can store more or less information. For example, if we want to store the age of a person there's no point of using the data type in 64 view because we are going to waste a lot of memory. In that case it's much better to use a smaller data type as int eight. Let's take a quick look to an example. For example, in this data frame we can see the column name detail h type and we realize that all the values are between one and nine. But when we analyze the data type of that specific column we realize that we are using in 64. So the recommendation is to change that data type and we realize that we are using 196 megabytes of memory. But after changing it to a smaller data type we will be using only 24 megabytes. That means that we could be reducing with only one line, 87 dont 5% of the memory consumption. Trick number five using categorical data types in some cases it's possible to shrink non numerical data and reduce the total memory footprint. For these cases pan has a custom categorical data type. So let's take a quick look to an example. In this example we can see that there is a column named sex that could have only two categorical values, f and m. So when we analyze the type of data that we are using, it's an object. If you remember the objects can consume a lot of memory. So when we look deeply we realize that we are using more than 142 megabytes of memory only for that column. But then if we change the data type as a categorical type we will see a huge memory reduction. We will reduce from 142 megabytes to only 2.4. That means that we are reducing the total memory of to 98%. Trick number six reading data by chunks attempting to read a large file can lead to a crash if there's not enough memory for the entire file to be read at once. Reading the file in chunks make it possible to access a very large files by reading one part of the file at a time. Let's take a look to a small example. In this example we have a very large file that is almost 4gb on disk. So basically what we are doing in pandas, we are reading the CSV file, but we are adding a parameter named chunk size. So in this case we are iterating over the file and reading the grows in blocks of 500,000. So every time we read a part of the CSV file we start to apply and count all the values and then all the results are beginning storing a different variable. So after looping, every time we continue looping, we will release the memory and read the next chunk. So after finishing all the process, we will get a result with the desired calculation. Vertical scaling with Jupyter on the cloud the easiest solution to not having enough ram is to throw money to the problem. That's basically vertical scaling. Thanks to cloud computing, this is a very easy task. Vertical scaling is the ability to increase the capacity of existing hardware or software by adding resources, cpu, memory, dask, et cetera. In the other hand, we have horizontal scaling that involves adding machines to the pool of existing resources. Jupyter is a very popular tool that helps us to create documents with live code. Thanks to this tool. It's very easy to run this code on the cloud, and there are plenty of large and cheap machines all around the cloud providers. What is the advantage of this approach? Basically no code change is needed. It's very easy if you are using the right cloud tools. There are plenty of options like binder, kaggle kernels, Google Collab, Azure notebooks, et cetera. It's very good for testing and cleaning data and visualizing it. And the good thing is that you only pay for what you use. Of course, if you forget to turn off your machine, you'll have to pay that too. Which are the disadvantages of this approach is that in the long run it's very expensive because you haven't really optimized your code, you're only throwing more money to the problem. It does not scale very well and also it's not production ready. Normally your code is not optimized for production. Speeding up pandas with Modin Modin is a multi process data frame library with an identical API to pandas. That means that you don't need to change your code because the syntax is the same. Modin will allow users to speed up their pandas workflows because it will unlock all the cases of your machine. Also, it's very easy to install and only requires to change a single line of code. You have to change panda import to import Modin pandas as PD, so it's very easy. Pandas implementation is single threaded, so this means that only one of your cpu cases can be utilized at any given point of time. But if you use modin implementation, you will be able to use all the available cores of your machine, or maybe all the cores available in the entire cluster. The modin advantage of Modin is that unlocks all the cpu power of your machine. Only one import is needed, so no changes in the code are needed and it's really fast when reading data. Also has multiple compute engines available to distribute the calculation of clusters. These clusters could be implemented with dask or ray and the main disadvantages of modin is that you need to expend x ray for depending on the combining gen setup, dask array plus the configuration of the clusters. Also, distributed systems are complex so debugging could be a little bit tricky. And finally, Modin requires a lot of memory also as pandas processing large data sets with by Biax is a python library with a similar syntax to pandas that help us work with large data sets in machines with limited resources where the only limitations is the size of a hard drive. By provides memory mapping so it will never touch or copy that data to memory unless it's explicitly requested. Bix also provides some cool features like virtual columns and calculate statistics such as mean sum count standard deviation, with a blazing speed of up to billion object rows per second. For the best performance, bikes recommends using the HDF five format. HDF five is able to store a large amount of numerical data set with their metadata. But what do I do if my data is on a CSV format? No problem. Bikes include the functionality to read a large CSV file by chunks and convert them to HDF files on the fly. To do this, it's very easy, we only need to add a parameters convert equals true to the function from csV. Also, Bix provides a data frame server so calculation and aggregation could run on a different computer. Bix provides a python API with websockets and a rest API to communicate with a biox data frame server. The advantages of bikes are it helps control the memory usage with the memory mapping and also there are amazing examples in their website. Also, Bix allows computing on the fly with a lazy and virtual columns. It's possible and easy to build visualization with a data set larger than memory. Machine learning is also available through the Biax ML package and data can be exported to pandas data frame if you need a feature that is only available on pandas. Also, the disadvantages of BIax are you need to do some modification to your code, but the syntax is slightly similar to pandas. Also, Bix is not as mature as pandas but is improving every day. And also it's a little bit tricky to work with some of the data types with HDF five all in with Pyspark. When you need to work with a very large scale data, it's mandatory to distribute both the data and the computation to a cluster. This cannot be achieved with pandas. Spark is an analytic engine used for large scale data processing. It lets you spread both data and computation over clusters to achieve a substantial performance increase. Pyspark is a Python API for Spark. It combines the simplicity of python with the high performance of Spark also provides the Pyspark shell for interactively analyzing your data in a distributed environment. PySpark support most of Spark features such as Pyspark, SQL data frames, streaming machine learning libraries, and spark core. So the advantages of Pyspark are first, you will get a great speed with a large data set. Also, it's a very rich and mature ecosystem with a lot of libraries for machine learning, feature extractions, and data transformation. Also, spark runs on top of Hadoop so it can access other tools in the Hadoop ecosystem. In the other hand, we have some disadvantages. For example, the first thing that you will notice is that you need to modify your source code because the syntax of Pyspark is very different to pandas syntaxes. Also, Pyspark could have a very bad performance if the data set is very small. In that case, it's much better to keep using pandas. In Spark we have the machine learning library, but at this moment it has fewer algorithms. Also, Spark requires a huge amount of memory to process all the data, so in some cases this could not be very cost effective. Final notes there are multiple options to scale up your workloads. The easiest way is to vertical scale your resources with Jupiter and a cloud provider. But first, don't forget to optimize your data frame or you will be wasting money. Also, there are some powerful alternative to work with large data sets like bikes, so it's worth giving it a try if you need to process a huge amount of data. You can use Modin with ray or dask to distribute your workload, but this will add some extra complexity. Finally, you can rewrite your pandas logic to make it run over, spark data frames, and take advantage of many platforms available in almost all the cloud providers. Thank you very much for your attention.
...

Marco Carranza

Co-founder @ Teamcore Tech

Marco Carranza's LinkedIn account Marco Carranza's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways