Conf42 Python 2021 - Online

Beginning your own data engineering projects

Video size:

Abstract

Collecting and maintaining quality data is at the heart of data engineering. Before we can perform data science or data analytics, we need data to begin with and using Python, we can automate the data collection process and keep data up-to-date. Data engineers are high in demand as they lay the foundation for data-driven enterprises. If you’re interested to build up your portfolio of data engineering projects, watch my talk :)

Summary

  • Data engineering helps to ensure consistent data flow for data data scientists or data users. It enables large scale data mining, analytics and also for reporting purposes. If youre also thinking of collecting more data on your own for your own data science projects, this talk might also be useful for you.
  • A good starting point is to make use of Google sheets. The Google sheets API is free and also Google data studio is so free. If you are considering of starting your own data engineering projects, this can be something you can consider.
  • How to use Google API. Set up a new Google sheet first. Then scrape the data from the website. Some parts of the website require some interaction in order for it to load. This is in a way not done by me, manually done by the script.
  • From this Google sheet we can actually use Google data studio to connect to the Google data sheet and then create nicer looking interface for user to play around. From here, if youre interested, you can make use of the data to do some text mining to identify what are some of the more popular domains within data science at the moment.
  • Second use case on creating a blended face based on multiple images. In this case also we are getting data, in this case image of the Internet. What I did after that was to make use of open source library hosted on GitHub. There is no one single rule when it comes to data engineering.
  • I maintain a data science block called data double confirm. Over here I cover a lot of other things related to various data science tasks. Web scraping is one big component related to data collection. From there, with the data we have, we can perform data visualization or statistical analysis.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, very good day to everyone. My name is Ang, based in Singapore and today I'll be talking about beginning your own data engineering projects. Before we start, let me talk a little bit about what is data engineering currently in my role, I help drive a lot of data engineering projects at work, mainly to help my company become more data driven. And data engineering actually helps to ensure consistent data flow for data data scientists or data users, typically in the form of a data warehouse to enable large scale data mining, analytics and also for reporting purposes. If youre also thinking of collecting more data on your own for your own data science projects, this talk might also be useful for you. As you know, data engineers are also important because they help build the foundation for data driven enterprises. Usually before we get to do data science or data analytics, the most important thing is to have quality data. So with all these data engineering projects, what we are trying to do is to build data pipelines, and we actually help to transfer data from one place to another through these pipelines. And so instead of transporting liquid or gas, what we are trying to do over here is to transport data. So the typical process actually involves extracting data from one place and then transforming them before loading them into the data warehouse. So this process is called ETL extract, transform, load. And at my workplace currently, we make use of APIs that points to different sources. Currently what we have is different kinds of campaigns begin run on different advertising platforms. So these advertising platforms have a lot of data collecting related to the campaigns. What we have is to try to build pipelines that pull data from these platforms, then store them in our own data warehouse, which is on Google Cloud platform. So depending on your context, depending on your company, you might have in house systems that contain transactional data, and then you want to use them for analysis. So what you have to do is to extract them and then probably do some cleaning or transformation before loading them into your data warehouse, which you can then use for data analysis. So we want to automate this process as much as possible to save time and ensure consistency and accuracy. So as you know, like the source data, for example, if it's containing transactional data, there might be some errors, data here and there, or some test cases here and there that are being stored and that should not be used for analysis. So during the transformation process, this is where we do the data cleaning. We want to check that whatever data that we have for our analysis data is as accurate as possible. So the data warehouse should be the place where there is a single source of truth and it should contain the most accurate data. So today I'll be sharing about a few use cases. So the first one is related towards data science, distinct. What I am trying to do over here is that for those who are not so familiar about towards data science, it actually is a publication hosted on medium where different authors, different writers actually write about what they have done, what they have realized through the different data science project that they do on their own, or be it for their work as well. So there are a lot of articles being published on a day to day basis. And then what I try to do over here is that I want to extract the titles of these articles and store them somewhere so that I can use for further analysis later on. So as you know, because sometimes we want to do our own site projects like we want to just to practice. There are some, I would say constraints, say for example limited to budget, where we might not be able to afford a data warehouse or storage space. So I think a good starting point is to make use of Google sheets. So the Google sheets API is free and also Google data studio is so free. So if you are considering of starting your own data engineering projects, this can be something you can consider. So what I do over here is that we do the ETL process from towards data science website, the medium publication, and then we do some transformation, data cleaning, data preparation in Python script. And also after that we load this clean data into a Google sheet. And then subsequently I can use data studio to connect to this Google sheet and then present it in a dashboard manner whereby it's friendly for deep dive. So this is what towards data science looks like in terms of the web interface. So if you go to just towardsdatascience.com you can see there are many, many different articles published. So this is just a screenshot that I've taken. So what we want to do is to extract the data related to author and also recency and post title. And then subsequently I added a column on the time extracted. So I know when exactly I have run the script or when the script is being run in the future if I look back. So what we do is we need to use some web scraping tools, some web scraping packages within Python to extract the data on the website. And then as you can see, I have imported beautiful soup and selenium. These are the two main packages necessary for the web scraping. And then numpy and pandas are used for transformation. And then Gspread data frame and gspread is being used to push the data to the Google sheet. I would say this talk will be perhaps maybe more relevant to you if you are really familiar with Python. So you know exactly what the syntax means. So I have also shared the instructions on how to use the Google API in this slide. So subsequently if you were to download the slides and click on the links here, they will direct you to the Google API GitHub page. This is where it contains more documentation on how to exactly get it started. And also to use gspread and G spread data frame you need to install the two packages. So you run Pip install g spread and also pip install gspread data frame. So I would say the instructions on the page is pretty detailed and pretty clear. So if you just follow through you should be able to get the API working. So there are no charges for using this API, for using Google sheets API and then over here. So the authentication part is, I would say probably maybe the most important part because if it doesn't work then you probably wouldn't be able to do the engineering side of it effectively. Because what we are trying to do is to automate the process, right? So if you can't authenticate the API, you wouldn't be able to automate the process and then the data wouldn't be able to get loaded in. So next, what you see over here is also something that I have gotten off the tutorial or the instructions relating to how to use Google API. So as you can see, there are some parameters you have to just change. You can actually create a new Google sheet first and then you look at the id of the sheet. So it should be the last part of the URL and then you can copy that sheet id and paste it here to replace it that is applicable in your case. And then over here, after you have set up the API, there should be, I would say a password, a credential token file which you have to download to your local and then it should be a JSON file. And this is where in order to get the API working, the credentials will be read from this JSON file. And then currently what we are using are the spreadsheet API. So we need to include in our scopes. So after we do the authentication, then the packages G spread and G spread data frame will work. So over here what I'm trying to do is that I have created the sheet API, I mean sorry, I've created the Google sheet already and then I am reading in sheet one. So I didn't really change the name of the sheet. So if you change the name of the sheet then probably you need to change your worksheet name instead of sheet one youre just change it to whatever worksheet name you have. I am getting whatever existing data there is. So if there is no existing data, it's an empty spreadsheet then yeah there isn't anything. But because I am going to do the update on a daily basis, there are existing data already in the sheet. So I want to retrieve them first and then what I do next is I want to append more data into the sheet. So the next step after I get the data from the sheet, I will scrape the data from the website. So this is where web scraping comes in. I access the site first towards data science and then you notice I actually have to make use of selenium because there are certain parts on the website where we have to basically trigger for it to load even more. So how do I scrape title off the website? I will have to find out text whereby all these titles or all these data related to the title or author is being typed to in the HTML text. So when I click inspect, then you return me and show me the HTML behind it. And then I see that it is being typed to the a tag. Okay, so over here you can see there is the a class whatever. So what I mentioned just now is so that there are certain parts of the website that require some interaction in order for it to load. Furthermore, in this case I have to click show more to load youre articles because there are a lot of articles being published in one day. So of course I'm running this script on a daily basis. I want to make sure that all the articles that are published in one day is being shown to me. So I think if you just load the page on its own then there are maybe, I think only ten articles being shown without clicking show more. So I have to in a way make the script interact with the page to click show more so that I can get more articles being script. So I actually set can arbitrary figure which is 20 arbitrary number 20 where I click on the show more button 20 times. So every time I click show more then probably I think another ten articles will load and then after that I will click show more again to load the next ten articles. So I do this for about 20 times, but this is in a way not done by me, manually done by the script. It's automated so it's very easy. And then in between each click I actually have a lack of 2 seconds. So you can see over here there's a time sleep bracket two. So this is what I'm doing because in a way I also want to make sure that the page doesn't hang or also like the page doesn't lock me out. Because sometimes there are some checks behind whereby they think that a bot, which I mean, in this case it is a bot, but we want to try to overcome this to make it look like more human in a sense. So by building in this kind of time lapse, then it will mimic a more human behavior and ensure that the script is able to run true. So next, after I have what I do over here is actually a loop. So I will look for the ATAG and then extract it out, get the text, and then over here youre can see there is another loop that I run through again, second for loop. So let me just show you again what the output is. After I have script the text type to the a type. So this is what you will get. After you run the script, you will get the data returned on the author and the recent c, and then the title. And then there is also like some other more random text that appears and it's not very consistent. So some of these other data that you see or other text that you see being returned is tied to, for example, the caption of a photo, or like the youre of the photo or image that was added into the article. What I realized is that there is a pattern. There is always like tree blanks in between each post, or rather tree blank in between the data between each post. So what we need is just the first tree elements of each chunk that is separated by the tree spaces. So what I did is I will check whether the last element is similar, or rather like the current element is similar to the previous two element. Okay, so over here you can see equals equals I plus one, and also to check whether I plus one equals equals I plus two. So this is what I'm doing. So once I hit, so as I run through the loop, I will check whether this element is similar to the next one, which should be a blank space, and then it's similar to the one below it as well. So if all three are similar, that means they are actually off spaces. Then that is the row, the element on a position that I want. Okay, so I store all this element position that I want, and then I do another loop to extract the text type to this element position. After that I convert it into a data frame, and then I will also add in a new column, which is the time now as an indication of just to let me know when this job is being run. So, based on the later data that I have script, I will append it to existing data in the Google sheet that I have. And then I will check whether there are duplicates or not based on author and post title. Because since I have arbitrary chosen to click show more 20 times. Right. There could be cases whereby the data is already scraped in the previous day. So just wanted to make sure that there are no overlaps, there are no duplicates. So I will drop the duplicates using again like pandas. And then I will also fill missing values with space if there is. And then I will drop the index whereby it's missing. Okay? And then in certain cases the first row after it does the sorting is empty. So I will drop it and then after that I will update my google sheet. So this is what the last line of code w dot update is doing. Okay, so this is the output based on the quote just now. You see where I am actually looping through the element position and then I get the text type to that position. A little bit of note is that there are also few cases whereby the title is not in the third line in the chunk that is being scraped off in the different for each post. Okay, so there could be certain cases whereby the data is not so clean, where it actually extracts the image caption in an earlier position and then the title appear later. So in those cases, those caption actually in general have less than 20 characters. So this is where I built in the if else over here. In the case, if the number of characters is less than 20, that means that the title should be in the seven position, okay? If not, yeah, it should be earlier and then this is how I identify it. So in order to automate the entire process, now that I have done the Python script to do the web scraping and updating of Google Sheet, I want to make it run on a daily basis. Okay? So I have to schedule it, but if not, it's actually very easy. Now that we have the Python script, you can just open the command prompt, the anal over here. What I have is a virtual environment. So I activate my virtual environment first and then I run my Python script. So else you want, you can also set up a Chrome job locally. Okay, so for example, in my case I'm using Windows and then what I can do is set it up using the Windows task scheduler. I look at the instructions over here, right. What you need to do is you can create basic tasks and then, yeah, this is what you will see when you launch administrative tools, create basic tasks and then there are certain boxes which you have to fill in. So you locate where your python exe is. So you paste the path over here and then just also identify where your script is being located at and then paste the location of your script in the arguments. So the rest of the parts that you can fix you can set up in the task schedule is pretty straightforward, like when you want to run it, like what time, et cetera. So it's pretty straightforward, which also if you click on the stack overthrow. Sorry. If youre search this link, you should be able to see the entire instructions. So this is what we will get. Okay. And then of course I think there are still certain cases whereby based on the logic that I've come up with, for example the checking of the number of characters, there could still be outliers here and there. So in those cases they might still need some manual intervention. But on the whole I would say this is 90% or even like 95% cleaned data. So from this Google sheet we can actually use Google data studio to connect to the Google data sheet and then create nicer looking interface for user to play around or to explore the data. Okay, so over here we can see that what I have done is actually just collecting the data over a six week period between 21st November 20, 22nd January 2021. And yeah, there are so many articles being published in a day to me is pretty crazy. So maybe just give me a while, I can load the data studio page. So this is what we have in the Google sheets. What's interesting is that there are multiple topics, I mean multiple articles on pretty similar topics. And then from here, if youre interested, you can make use of the data to do some text mining to identify what are some of the more popular domains within data science at the moment. Whether is it relating to certain kinds of tools or algorithms, you can search for the keywords within these articles. Yeah, as you can see that there are really quite a number of articles that are being published over six weeks, over 1000. It's like 1680 over a six week period. Okay, so now I'll be talking about the second use case on creating a blended face based on multiple images. So in this case also we are getting data, in this case image of the Internet. And then I am saving it into my local drive, my local folder. What I did after that was to make use of open source library that was hosted on GitHub to create a blended face. So in this case, the process is a little bit different in the sense that we will do extract and loading first before we do the transformation. So actually there is no one single rule when it comes to data engineering, like it can be also done, not necessarily that it has to be ETL, like abstract, transform, load. It can be also abstract and load first before doing the transformation. So in some organizations they actually have data lakes. So this is where they have data from different places, and then they just pull it all into one single location. First they call it the data lake, and then after that, so it can be clean or unclean data altogether in this data lake, and then they do the transformation after that. But based on my work experiences so far, I typically do ETL most of the time, which is extract, transform, then load the data. So again, it depends on your organization, your structure, how actually you want to store the data. So in this case, what I'm trying to do is to create a blended face, blended average face based on the candidates of different political parties in Singapore. So last year in 2020, there was a general election, and then the street science, which is a news platform, a news outlet, they have designed a website to basically let people know who the candidates are from the different political parties and then where they are contesting. So they have uploaded images of all these political candidates and then they are publicly accessible. So what I did was to scrape the image of all these political parties off this website into my local drive. So again, I made use of the web scripting tools, beautiful soup and selenium. And then also I imported Wii, which is for regular expressions, because I need to extract the format of the image of later based on the URL. So subsequently also, I mean, there is this library that I use URL request or so for accessing the website. And what you see over here is the same thing as the previous use case. Okay, so this page is also interactive in the sense that I have to actually scroll down in order for the page to load entirely. So what we need to do when we use Python to access is to include more arguments in the options. So when we use the webdriver to access the page, it will allow the page to load entirely. So what I added over here is headless argument and a disabled gpu. So this, to overcome this, and then also similarly like using beautiful soup, I look for the text that all these images are tied to, which in this case is IMG, and then I run a loop to script the images. Also, if you can see, there's a bit of trial and error here where I need to identify the location of the first image that I'm interested in. Because when extract the information typed to the IMG text, there are other information, other images that I am not interested in. For example, it can be like the map. So after a bit of trial, I identified that the first image that I want is in the very first position of the IMG tag. And then there are x number of images I need to script, right? So I identify like okay, the last element is in the position three, six, nine. So I run the loop and then I basically filter out things that not necessary. Like you can see there is also additional subtext like alts that is blank. And then in this case, this is where I will skip those, because they shouldn't be blank. And then I get the source of the URL, the picture, the image, and then I scrape it off with the name type to the image. Okay, so if you can see, I also have included this minus four colon, which is to extract the format of the image like it can be JPG or PNG. So I save it accordingly into my local drive. And then after all the image is being script successfully. Then I make use of this open source library that was created by this person. He hosted it on GitHub, which I'm very thankful for because it makes things much easier. So I think this is also like the wonders of open source, right? You can do a lot of things much quicker, much easier based on what other people have developed or created. If you go into this URL, then you will see the scripts that are developed to create a blended image, average face. So what you need is just to change the file path that pine to those images where you have saved it locally. And then you run the script, you run the package, and yeah, the instructions is actually all on GitHub for this library. So if you go through the instructions on how to download it and how to run it, you'll be able to get the output very quickly as well. So this is what I got after running the script. And yeah, for each political party, if at least ten candidates, I created one phase for each of them, which is the average phase. Okay, so just wrapping up in terms of data engineering projects, first I guess we have to start off with thinking what analysis you or your data science team want to do here. From there we will think about what data is necessary, and then we will write a Python script to automate ETL or ERT process. And yeah, this will allow us to generate insights much quicker based on the data that we have collected so far. Okay, so I actually maintain a data science block called data double confirm. And over here I cover a lot of other things related to various data science tasks. So web scraping is one big component related to data collection, because I think there is basically a lot of, I would say, development in the algorithms realm and also in deep learning. So sometimes what we need is the data, which is very important, which is why I'm still talking about data engineering. And then from there, with the data we have, we can perform data visualization or statistical analysis that can give us insights and help us make more informed decisions. So this is the URL of my data science blog, and also feel free to connect with me via LinkedIn or Twitter. Thank you.
...

Hui Xiang Chua

Senior Analytics Manager @ Essence

Hui Xiang Chua's LinkedIn account Hui Xiang Chua's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways