Tips and tricks for data science projects with Python

Video size:

Abstract

Python has become the most widely used language for machine learning and data science projects due to its simplicity and versatility. For this purpose, Python provides access to great libraries and frameworks for AI and machine learning (ML), flexibility and platform independence.

Summary

Python has become the most widely used language for mature learning and data science project. For this purpose, Python provides access to grid libraries and frameworks for artificial intelligence and mature learning. Kosman has written a book with tips and tricks for data science projects with Python.
The machine learning lifecycle basically is the cyclical process that data science project follow. It defines each step that an organization should follow to take advantage of mature learning and artificial intelligence to derive practical business value.
Secular is an open source tool for data mining and data analysis. Psychilearn mesh filters include classification, regression, clustering, dimensionality reduction, model selection and preprocessing. Scikilear is a grid library to master for machine learning beginners and professionals.
We continue with commenting libraries we have in Python for deep learning. Tensorflow is an open source library that is based on a neural network system. Another interesting library is Pytorch. Theano is a Python libraries that allows you to define, optimize and evaluate mathematical expressions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, my name is Kosman or I'm Python developer. I'm from Spain. This tool is called tips and tricks for data science project with Python. Basically, Python has become the most widely used language for mature learning and data science project due to its simplicity and specialty. For this purpose, Python provides access to grid libraries and frameworks for artificial intelligence and mature learning, flexibility and platform independence. This year I have writing and published this book with title Big Data, Machine Learning and data science with Python. This book is published in Spanish. In this book, basically you can find practical symbols with panda, Pyspark, psychic learner, tensorflow, Hadoop, Jupiter Network and Apache the Pelink. This could be the main point, the main tackling points. I will start with an introducing Python as programming language for mature learning projects. I will comment the main stages for a mature learning project also will comment what are the main Python libraries for your project? Finally, I will commend the main Python tools for deep learning in data science projects. Well, Python simplicity allows developers to write reliable systems and developers get to put all the effort into solving a mature learning problem instead of focusing on the technical nuance of the language. Since Python is a general purpose language, it can do a set of complex machine learning tasks and enable you to build prototypes quickly that allow you to test your product for machine learning purpose. For example, there are some fields where artificial intelligence and mature learning techniques are applied. For example, Span filters, recommendation systems, certain giants, personal assistance and fraud detection systems. In this table we can see the main libraries, the main module we have in Python for each depending. For example for material we have kerastins for flow and central for high performance in scientific computing we have numpy and scipy. For computer vision, we have OpenCV. For data analysis we have numpy and pandas, and for natural language processing we have spicy and one way to review some of these libraries. For example, Numpy is the fundamental package required for high performance scientific computing and data analysis in the Python ecosystem. And the main data structure in Numpy is the array, which is a shorthand name for n dimensional array. When working with Numpy data and the array is simply referred to has an array you can create, for example, unidimensional, b dimensional and three dimensional arrays. The main advantage of Numpy is its speed, mainly to the fact that it is developed in C programming language for data science and mature learning tags. It provides a lot of advantages. Other models we have in Python is one of the most popular libraries for scientific Python is pandas that is built open numpy array, thereby preserving fast execution speed and offering many data engineering features including grading, writing, manifesting the format, selecting success of data, calculating across row and columns, filling and filling missing data, applying operations to independent groups within the data. Another task related for example, with combining multiple data sets together. One of the structures we have is very useful in pandas is the data frames is the most widely used data structure. You can imagine it at a set table in a database or a spreadsheet with rows and columns. Basically, data frames is a two dimensional data structure with potentially iteration data. The main features is that it has a size mutable structure that means data can be added or deleted from it in a simple way. PandAs provides another interesting project that is called pandas profiling. That is an open source Python model with which we can quickly perform an exploratory data analysis with just a few lines of code. In addition, it can generate interactive reports in web format that can be present to anyone. In short, what panel's profiling does is to save us all the work of visualization and understanding the distribution of each variable in our data set. Generating a report with all the information is ill visible. Now I'm going to commend the many stages of a machine learning projects. Learning is the study of certain algorithms and statistical techniques that allow computers to perform complex tasks without receiving instructions beforehand. Instead of using pre programming, directing certain behavior under a certain set of circumstances, machine learning relies on pattern recognition and associated inferences. In this diagram we can see the main stages of a material. In project we start with level observations and in the stage two with us is splitting these level observations in training and data sets. In step three, our model is built using training data and for validating the model we use data set. In the last step, basically, the model is then evaluated on the degree to which it arrives at the correct output. In this diagram we can see in a more generical way these stages. The machine learning lifecycle basically is the cyclical process that data science project follow. It defines each step that an organization should follow to take advantage of mature learning and artificial intelligence to derive practical business value. These are the five major steps in the mature learning lifecycle, all of which have equals importance and go in a specific order. We start with getting in data from various sources. In the step two, we try to clean in data to have homogeneity. In the step three, we try to build our model selecting the right material learning algorithm depending our data. In the step four, we try to grind in insights from the model threshold and in the step five, we have basically data visualization and transforming the results into visual graphs in a more detail way. In this diagram we can see a specific task for each stage. For example, in the fixed step that is related with defining the project objectives, the fixed step of the lifecycle is to define these objectives. In the second step, we try to acquire and explore data when we try to collect and prepare all of the relevant data for use in material learning algorithms. In the third step, we try to build our model. In order to gain insights from your data with machine learning, you must determine your target variable, which is the factor on which you wish to win deeper understanding. In the four step, we try to interpret and communicate the results of the model. Basically, the more interpretable your model is, the easier it will be to meet regulatory requirements and communicate this value to management and other case stakeholders. And finally, the final step is to implement, document and maintain the data science project so that the project can continue to leverage and improve upon its models. We are going to commend the main libraries, the main models we have in python. For this task, we start with secular. Secular is an open source tool for data mining and data analysis and provides a consistent and easy to use API for doing tasks related with preprocessing, training and predicting data. Psychilearn mesh filters include classification, regression, clustering, dimensionality reduction, model selection and preprocessing. It provides a range of supervised and unsupervised learning algorithms via consistent interface and delivery. Provides a lot of algorithms for classification, redression clustering like for example for clustering it's very useful the Cummins algorithm and the scan and also is designed to work with the Python numerical scientific libraries like Numpy and scipy. Scikilear is a grid library to master for machine learning beginners and professionals. Whoever have an experienced machine learning practitioners may not be aware of all the hiding hems of this package which can aid in their tasks significantly. I am going to commend the main features that we can the most relevant futures that we can find in this libraries. For example, pipelines are very useful to chain, for example multiple estimators. If we have multiple estimators in our pipeline, we can use this future to change these estimators. This is useful for when we need to fix a sequence of states in processing the data. For example, we have a feature selection, we have normalization classification. At this point, utility funtium make payline pipeline is a shorthand for construing pipelines. It takes a variable number of estimators and produce a pipeline with the steps that follow it. The use of pipelines to split, train, test and select the models and epiparameters made it so much easier to keep track of the outputs at all stage as well as reporting why you choose specific epiper parameters. Eper parameters basically are parameters that are not delivery with estimators, and inside they are passed as an argument to the constructor of the estimator classes. At this point, it's possible to search the impaired parameter space for the best cross valuation score. An eparameter provided when construing an estimator may be optimized using get params method. Specifically, we can use this method to find the names and current values for all parameters for a given estimator. Every estimator has its advantages and drawbacks. Its generalization error can be discomposed in terms of bias, variance and noise. The bias of an estimator is its average error for different training sets, and the variance of an estimator indicates how sensitive it is to vary in training sets. At this point, it could be helpful to plot the influence of a single epiper parameter on the training score and evaluation score to find out whether the estimator is overfitting or underfitting for some parameter values. At this point, the fonteon validation curve can help us in this case and return and validate scores. If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the valuation score is low, the estimator is of overfitting and otherwise we suppose that estimator is working well. Another interesting feature is one hotel encoding that is a very common data processing tax to transfer input categorical features in one binary encodings for using in classification or prediction tax. For example, let us assume that we have two categorical values and in this table we can see that we have one column with js and no values and this column it transferred into new columns, one for each category. For example, with the JS value we have values one and zero, and for the no values we have zero and one. In these two new columns created and in IC way we can use the one hot encoder for applying these distress formations. Cycular also includes random sapling generators that can be used to boil artificial data sets of control size and complexity. It has functions for classification, clustering, regression, matrix decomposition, and manifold tense testing. Other techniques that can be useful when we have a large data set and we need to reduce the dimensionality of the data is in these cases, we can apply the principal component analysis PCA. Basically, PC functions by finding the directions of maximum variance in the data and provides the data in those directions. The amount of variance explained by each direction is called the splain variance. Explained variance can be used to choose the number of dimensions to kept in a reduced data set. It can also be used to assess the quality of a mature learning model. In general, a model with high splain variance will have good predictive power, while a model with low explained variance may not be as accurate. In this diagram we can see that we have two independent principal components, PC one and PC two. The PC one represents the vector which explain most of the information variance and PCE two represents the lesson information. In this example, we are following the classical machine learning pipeline where we will first import libraries and data set, perform exploratory data analysis and preprocessing, and finally train our models predictions and evaluate accuracy. At this point we can use PCA to find optimal number of features before we train our models. Performing PCA is as easy as following these two tips. Process first, we initialize the PCA class by passing the number of components to the constructor and in the second step we call the fit and transform methods. By passing the future set to these methods and the transform method returns the specified number of principal components that we have in this data set. Another interesting library we can find in Python for task related with the obtaining statistical data for data exploration is ESTAS model. EstAs model is another grid library which focus on statistical models and can be used for predictive and exploratory analysis. If you want for example to fit linear models, do statistical analysis, maybe a bit of pre modeling, then start models is great. We continue with commenting libraries we have in Python for deep learning. We start with tensorflow, that is an open source library that is based on a neural network system. This means that it can relate several network data simultaneously in the same way that the human brain does. For example, it can recognize several words of the Alphabet because it relates letters and phonemes. Another case is that of image and text that can be related to each other thanks to the association capacity of the neural network system. Internally, what is used in terms of law is use the tensors for building the neural network. A tensor basically is a mathematical object represent as a rise of higher dimensions and this rise of data with different sizes and runs get fit as input to the neural network. Tensorflow has become an intermachent learning ecosystem for all kinds of artificial intelligence technology. For example, here are the features the community has add to the original tensorflow package. For example, we have the Tensorflow a little for working with a smartphone operating system and IoT devices. Since tertial flow 2.0 version, keras has been adopted as the main API to interact with Tensorflow. Keras the main difference is that Tensorflow was at low level and keras it was a high level for building neural networks and interact with the Tensorflow with the API that Tensorflow provides. This is maybe the best choice for any beginner in mature learning. It offers an easy way to express neural networks compared to other libraries. Basically, it provides an interface for interacting with tensorflow in an easy way. In this code, we can see how a little code for Keras Keras has I comment before is the real model for rapid experimentation and the most common way to define your model is by building a graph of layers which correspond to the mental model we normally use when we think about leap learning. The simplest type of model is a stack of layers and you can define such a model using the sequential API like we can see in this code. The advantages of this process are that it's easy to visualize the book and building a deep learning model using the different methods and different classes that provides the grass API. Another interesting library is Pytorch. Pytorch is similar to Tensorflow and we can use, for example, Tensorflow Pytorch in all the stages of a material learning project. We can use Pytorch for getting the data ready, building or pick up a training model, fit the model to the data and make a prediction, evaluate the model, improve, throw experimentation and save and reload your training model. All these stages can be executed with Pytorch if we compare the three libraries. Has I commented we can see that Keras works at high level, normally in conjunction with TensorFlow and Pytorch however works at low level. At architecture level, Pytorch and Tensorflow are more complex to use and keras is more simpler and readable. Regarding the speed, keras offers low performance comparing with the others and Tesoflow and Pytos offers fast and high end performance. And regarding training models, the three libraries provides offers this feature and finally commenting the Theano that is a Python libraries that allows you to define, optimize and evaluate mathematical expressions involving multi dimensional arise efficiently. Thanos main filters include tight integration with numpy transparent use of gpus, efficient symbolic differentiation, speed and stability optimizations, dynamic C code generation, and extensive unit testing and self verification. It provides many tools to define, optimize and evaluate mathematical expressions and numerous other libraries can be built upon Theano that explore its data structures. Theano is one of the most mature of machine learning libraries since it provides nice data structures like tensors, like the structure that we have in terms of flow to represent lies of neural networks, and they are efficient in terms of linear algebra. Similar to numpy arise. There are a lot of libraries which will on top of Thanos exploiting its data structure, and as I commented before, it has support for GPU programming out of the wall of as well. And that's all. Thank you very much. Thank you for doing this presentation. In this slide we can see my contact if you want to contact me, in social networks like Twitter and LinkedIn, and if you have any question or have any doubt, you can use this channel for resolving your questions. Thank you very much.

Slides

Download slides (PDF)

See all 19 talks at this event!

Conf42 Python 2023 - Online

March 09 2023

Tips and tricks for data science projects with Python

Video size:

Abstract

Summary

Transcript

Slides

Jose Manuel Ortega

Python Developer & Security Researcher

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2023 - Online

March 09 2023

Tips and tricks for data science projects with Python

Video size:

Abstract

Summary

Transcript

Slides

Jose Manuel Ortega

Python Developer & Security Researcher

Join the community!