Nowadays data is getting bigger and bigger, making it almost impossible to processed it in desktop machines.
To solve this problems, a lot of new technologies (Hadoop, Spark, Presto, Dask, etc.) have emerged during the last years to process all the data using multiple clusters of computers. The challenge is that you will need to build your solutions on top of this technologies, requiring designing data processing pipelines and in some cases combining multiple technologies.
However, in some cases we don’t have enough time or resources to learn to use and setup a full infrastructure to run a couple experiments. Maybe you are a researcher with very limited resources or an startup with a tight schedule to launch a product to market.
The objective of this talk is to present multiple strategies to process data as it grows, with the limitations of a single machine or with the use of clusters. The strategies will focus on technologies such as Pandas, Pyspark, Vaex and Modin.
Outline 1.- Introduction (2 mins) 2.- Vertical scaling with Pandas and the Cloud (3 mins) 3.- Keeping the memory under control by reading the data by chunks (5 mins) 4.- Processing datasets larger than the available memory with Vaex (5 mins) 5.- Scaling Pandas with Modin and Dask (5 mins) 6.- All-in with Pyspark (5 mins)
Priority access to all content
Exclusive promotions and giveaways