Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi, very good day to everyone. My name is Ang, based in Singapore
            
            
            
              and today I'll be talking about beginning your own data engineering projects.
            
            
            
              Before we start, let me talk a little bit about what is data engineering
            
            
            
              currently in my role, I help drive a lot of data engineering projects
            
            
            
              at work, mainly to help my company become more data driven.
            
            
            
              And data engineering actually helps to ensure consistent
            
            
            
              data flow for data data scientists or data users, typically in
            
            
            
              the form of a data warehouse to enable large scale data mining,
            
            
            
              analytics and also for reporting purposes. If youre also thinking
            
            
            
              of collecting more data on your own for your
            
            
            
              own data science projects, this talk might also be useful for
            
            
            
              you. As you know, data engineers are also important because
            
            
            
              they help build the foundation for data driven enterprises.
            
            
            
              Usually before we get to do data science or
            
            
            
              data analytics, the most important thing is to have quality data.
            
            
            
              So with all these data engineering projects,
            
            
            
              what we are trying to do is to build data pipelines,
            
            
            
              and we actually help to transfer
            
            
            
              data from one place to another through these pipelines.
            
            
            
              And so instead of transporting
            
            
            
              liquid or gas, what we are trying to do over here is
            
            
            
              to transport data. So the typical process actually
            
            
            
              involves extracting data from one place and
            
            
            
              then transforming them before loading them into the data warehouse.
            
            
            
              So this process is called ETL extract,
            
            
            
              transform, load. And at my workplace currently,
            
            
            
              we make use of APIs that points to
            
            
            
              different sources. Currently what we have
            
            
            
              is different kinds of campaigns begin run on different
            
            
            
              advertising platforms. So these advertising platforms have a lot
            
            
            
              of data collecting related to the campaigns.
            
            
            
              What we have is to try to build pipelines
            
            
            
              that pull data from these platforms, then store them in
            
            
            
              our own data warehouse, which is on Google Cloud platform.
            
            
            
              So depending on your context, depending on your company, you might have in house
            
            
            
              systems that contain transactional
            
            
            
              data, and then you want to use them for analysis. So what you
            
            
            
              have to do is to extract them and then probably do
            
            
            
              some cleaning or transformation before loading them into your
            
            
            
              data warehouse, which you can then use for data analysis. So we
            
            
            
              want to automate this process as much as possible to save time
            
            
            
              and ensure consistency and accuracy. So as you know, like the source
            
            
            
              data, for example, if it's containing transactional data,
            
            
            
              there might be some errors, data here and there, or some
            
            
            
              test cases here and there that are being stored and that
            
            
            
              should not be used for analysis. So during the
            
            
            
              transformation process, this is where we do the data cleaning. We want to
            
            
            
              check that whatever data that we have for our analysis data
            
            
            
              is as accurate as possible. So the data
            
            
            
              warehouse should be the place where there is a single source of truth and
            
            
            
              it should contain the most accurate data. So today I'll
            
            
            
              be sharing about a few use cases. So the
            
            
            
              first one is related towards data
            
            
            
              science, distinct. What I am trying to do over
            
            
            
              here is that for those who are not so familiar
            
            
            
              about towards data science, it actually is a publication hosted
            
            
            
              on medium where different authors, different writers actually
            
            
            
              write about what they have done,
            
            
            
              what they have realized through the different
            
            
            
              data science project that they do on their own, or be it
            
            
            
              for their work as well. So there are
            
            
            
              a lot of articles being published on a day to day basis.
            
            
            
              And then what I try to do over here is that I want to extract
            
            
            
              the titles of these articles and store them
            
            
            
              somewhere so that I can use for further analysis
            
            
            
              later on. So as you know, because sometimes
            
            
            
              we want to do our own site projects like we
            
            
            
              want to just to practice.
            
            
            
              There are some, I would say constraints, say for
            
            
            
              example limited to budget, where we might not be
            
            
            
              able to afford a data warehouse or storage
            
            
            
              space. So I think a good starting
            
            
            
              point is to make use of Google sheets. So the Google sheets API is
            
            
            
              free and also Google data studio is so free.
            
            
            
              So if you are considering of starting
            
            
            
              your own data engineering projects, this can be something
            
            
            
              you can consider. So what I do over here is
            
            
            
              that we do the ETL process from towards data
            
            
            
              science website, the medium publication,
            
            
            
              and then we do some transformation,
            
            
            
              data cleaning, data preparation in Python
            
            
            
              script. And also after that we load this
            
            
            
              clean data into a Google sheet.
            
            
            
              And then subsequently I can use data
            
            
            
              studio to connect to this Google sheet and then present it in a dashboard
            
            
            
              manner whereby it's friendly for deep dive.
            
            
            
              So this is what towards data science looks like in
            
            
            
              terms of the web interface. So if you go to just
            
            
            
              towardsdatascience.com you can see there
            
            
            
              are many, many different articles published. So this is just a screenshot
            
            
            
              that I've taken. So what we want to do is to
            
            
            
              extract the data related to author
            
            
            
              and also recency and post title.
            
            
            
              And then subsequently I added a column on the time extracted.
            
            
            
              So I know when exactly I have run the script or when
            
            
            
              the script is being run in the future if I look back.
            
            
            
              So what we do is we need to use
            
            
            
              some web scraping tools, some web scraping packages
            
            
            
              within Python to extract the data on the website.
            
            
            
              And then as you can see, I have imported beautiful
            
            
            
              soup and selenium. These are the two main
            
            
            
              packages necessary for the web scraping. And then
            
            
            
              numpy and pandas are used for transformation.
            
            
            
              And then Gspread data frame and gspread
            
            
            
              is being used to push the data to
            
            
            
              the Google sheet. I would say this talk will be
            
            
            
              perhaps maybe more relevant to you if you are
            
            
            
              really familiar with Python. So you
            
            
            
              know exactly what the syntax
            
            
            
              means. So I have also shared the
            
            
            
              instructions on how to use the Google API in this slide.
            
            
            
              So subsequently if you were to download the
            
            
            
              slides and click on the links here, they will direct you
            
            
            
              to the Google API GitHub
            
            
            
              page. This is where it contains more documentation
            
            
            
              on how to exactly get it started. And also to
            
            
            
              use gspread and G spread data frame you need to install
            
            
            
              the two packages. So you run Pip
            
            
            
              install g spread and also pip install gspread data frame.
            
            
            
              So I would say the instructions on the page
            
            
            
              is pretty detailed and pretty clear.
            
            
            
              So if you just follow through you should be able to get the
            
            
            
              API working. So there are no charges for using
            
            
            
              this API, for using Google sheets API and then over
            
            
            
              here. So the authentication part is, I would
            
            
            
              say probably maybe the most important part because
            
            
            
              if it doesn't work then
            
            
            
              you probably wouldn't be able to do the
            
            
            
              engineering side of it effectively. Because what we are
            
            
            
              trying to do is to automate the process, right? So if you can't
            
            
            
              authenticate the API, you wouldn't be able to automate the process and
            
            
            
              then the data wouldn't be able to get loaded in.
            
            
            
              So next, what you see over here is also
            
            
            
              something that I have gotten off the tutorial or
            
            
            
              the instructions relating to how to use Google API.
            
            
            
              So as you can see, there are some parameters
            
            
            
              you have to just change. You can actually create a new Google sheet
            
            
            
              first and then you look at the id of the sheet.
            
            
            
              So it should be the last part of the URL
            
            
            
              and then you can copy that sheet id and
            
            
            
              paste it here to replace it that is applicable in your case.
            
            
            
              And then over here, after you have set
            
            
            
              up the API, there should be, I would say a password,
            
            
            
              a credential token file which you have to download to your local
            
            
            
              and then it should be a JSON file. And this is where
            
            
            
              in order to get the API working, the credentials
            
            
            
              will be read from this JSON file. And then currently what
            
            
            
              we are using are the spreadsheet API. So we need
            
            
            
              to include in our scopes. So after we do the authentication,
            
            
            
              then the packages G spread and
            
            
            
              G spread data frame will work. So over here what
            
            
            
              I'm trying to do is that I have created the sheet
            
            
            
              API, I mean sorry, I've created the Google sheet already and
            
            
            
              then I am reading in sheet one. So I didn't
            
            
            
              really change the name of the sheet. So if you change the name of
            
            
            
              the sheet then probably you need to change your worksheet
            
            
            
              name instead of sheet one youre just change it to whatever worksheet
            
            
            
              name you have. I am getting whatever existing data
            
            
            
              there is. So if there is no existing data, it's an
            
            
            
              empty spreadsheet then yeah there isn't
            
            
            
              anything. But because I am going to do the update on
            
            
            
              a daily basis, there are existing data already in the sheet.
            
            
            
              So I want to retrieve them first and then what I do next is
            
            
            
              I want to append more data into the sheet.
            
            
            
              So the next step after I get the data from the sheet,
            
            
            
              I will scrape the data from the website. So this
            
            
            
              is where web scraping comes in. I access the
            
            
            
              site first towards data science and then
            
            
            
              you notice I actually have to make use of selenium because there are
            
            
            
              certain parts on the website where we have to
            
            
            
              basically trigger for it to load even
            
            
            
              more. So how do I scrape title
            
            
            
              off the website? I will have to find
            
            
            
              out text whereby all
            
            
            
              these titles or all these data related to the title
            
            
            
              or author is being typed to in the HTML
            
            
            
              text. So when I click inspect,
            
            
            
              then you return me and show me the HTML behind it. And then
            
            
            
              I see that it is being typed to the a tag. Okay, so over
            
            
            
              here you can see there is the a class whatever.
            
            
            
              So what I mentioned just now is so that there are certain
            
            
            
              parts of the website that require some interaction in
            
            
            
              order for it to load. Furthermore, in this case I have
            
            
            
              to click show more to load youre articles
            
            
            
              because there are a lot of articles being published in one day. So of
            
            
            
              course I'm running this script on a daily basis. I want to make sure
            
            
            
              that all the articles that are published in one day is being shown to me.
            
            
            
              So I think if you just load the page on its own then
            
            
            
              there are maybe, I think only ten articles being shown without
            
            
            
              clicking show more. So I
            
            
            
              have to in a way make the
            
            
            
              script interact with the page to click show more so that
            
            
            
              I can get more articles being script.
            
            
            
              So I actually set can arbitrary figure which
            
            
            
              is 20 arbitrary number
            
            
            
              20 where I click on the show more button 20
            
            
            
              times. So every time I click show more then probably
            
            
            
              I think another ten articles will load and then after
            
            
            
              that I will click show more again to load
            
            
            
              the next ten articles. So I do this for
            
            
            
              about 20 times, but this is in a way not done by
            
            
            
              me, manually done by the script. It's automated so it's very
            
            
            
              easy. And then in between each click I
            
            
            
              actually have a lack of 2 seconds. So you can see over
            
            
            
              here there's a time sleep bracket two. So this is what I'm doing
            
            
            
              because in a way I also want to make sure that
            
            
            
              the page doesn't hang or also like the page doesn't lock me
            
            
            
              out. Because sometimes there are some checks behind whereby they
            
            
            
              think that a bot, which I mean, in this case it is a bot,
            
            
            
              but we want to try to overcome this to
            
            
            
              make it look like more human in a sense. So by building in
            
            
            
              this kind of time lapse, then it will mimic a more
            
            
            
              human behavior and ensure that the script is able to run
            
            
            
              true. So next, after I have
            
            
            
              what I do over here is actually a loop. So I
            
            
            
              will look for the ATAG and then extract
            
            
            
              it out, get the text, and then over
            
            
            
              here youre can see there is another loop that
            
            
            
              I run through again, second for loop.
            
            
            
              So let me just show you again what the
            
            
            
              output is. After I have script the text type to the a type.
            
            
            
              So this is what you will get. After you run
            
            
            
              the script, you will get the data returned on the
            
            
            
              author and the recent c, and then the title. And then
            
            
            
              there is also like some other more random text
            
            
            
              that appears and it's not very consistent. So some
            
            
            
              of these other data that you see or other text that you see
            
            
            
              being returned is tied to, for example,
            
            
            
              the caption of a photo, or like
            
            
            
              the youre of the photo or image that
            
            
            
              was added into the article.
            
            
            
              What I realized is that there is a pattern. There is always
            
            
            
              like tree blanks
            
            
            
              in between each post, or rather
            
            
            
              tree blank in between the data between each post.
            
            
            
              So what we need is just the first tree elements
            
            
            
              of each chunk that is separated by
            
            
            
              the tree spaces. So what
            
            
            
              I did is I will check whether the
            
            
            
              last element is similar, or rather
            
            
            
              like the current element is similar to the previous two element.
            
            
            
              Okay, so over here you can see equals equals I
            
            
            
              plus one, and also to check whether I plus one
            
            
            
              equals equals I plus two. So this is what I'm doing. So once I
            
            
            
              hit, so as I run through the loop, I will check whether this
            
            
            
              element is similar to the next one, which should
            
            
            
              be a blank space, and then it's similar to the one below it
            
            
            
              as well. So if all three are similar, that means
            
            
            
              they are actually off spaces. Then that is the
            
            
            
              row, the element on a position that I want.
            
            
            
              Okay, so I store all this element
            
            
            
              position that I want, and then I do another loop to
            
            
            
              extract the text type to this element position.
            
            
            
              After that I convert it into a data frame,
            
            
            
              and then I will also add in a
            
            
            
              new column, which is the time now as an
            
            
            
              indication of just to let me know when this
            
            
            
              job is being run. So, based on the later
            
            
            
              data that I have script, I will append it to existing
            
            
            
              data in the Google sheet that I have. And then I will
            
            
            
              check whether there are duplicates or not based on author
            
            
            
              and post title. Because since I have arbitrary chosen
            
            
            
              to click show more 20 times. Right. There could be
            
            
            
              cases whereby the data is already scraped in the
            
            
            
              previous day. So just wanted to make sure
            
            
            
              that there are no overlaps, there are no duplicates. So I will
            
            
            
              drop the duplicates using again like pandas.
            
            
            
              And then I will also fill missing values with space
            
            
            
              if there is. And then I
            
            
            
              will drop the index whereby it's
            
            
            
              missing. Okay? And then in certain cases
            
            
            
              the first row after it does the sorting is empty.
            
            
            
              So I will drop it and then after that I will update
            
            
            
              my google sheet. So this is what the last line of
            
            
            
              code w dot update is doing. Okay,
            
            
            
              so this is the output based on the quote
            
            
            
              just now. You see where I am actually looping through
            
            
            
              the element position and then I
            
            
            
              get the text type to that position.
            
            
            
              A little bit of note is that there are also few cases
            
            
            
              whereby the title is not
            
            
            
              in the third line in the chunk that
            
            
            
              is being scraped off in the different for each post.
            
            
            
              Okay, so there could be certain cases whereby the data is
            
            
            
              not so clean, where it actually extracts the image
            
            
            
              caption in
            
            
            
              an earlier position and then the title appear later.
            
            
            
              So in those cases, those caption
            
            
            
              actually in general have less than 20 characters.
            
            
            
              So this is where I built in the if else over here.
            
            
            
              In the case, if the number of characters is less than 20,
            
            
            
              that means that the title should be in the seven
            
            
            
              position, okay? If not, yeah, it should be
            
            
            
              earlier and then this is how I identify it.
            
            
            
              So in order to
            
            
            
              automate the entire process, now that I have done the
            
            
            
              Python script to do the web scraping and updating
            
            
            
              of Google Sheet, I want to make it run on a daily basis.
            
            
            
              Okay? So I have to schedule it, but if not,
            
            
            
              it's actually very easy. Now that we have the Python script, you can just
            
            
            
              open the command prompt, the anal over
            
            
            
              here. What I have is a virtual environment. So I activate my virtual environment
            
            
            
              first and then I run my Python script. So else you
            
            
            
              want, you can also set up a Chrome job locally.
            
            
            
              Okay, so for example, in my case I'm using Windows
            
            
            
              and then what I can do is set it
            
            
            
              up using the Windows task scheduler. I look at the
            
            
            
              instructions over here, right. What you need to do is you can
            
            
            
              create basic tasks and then,
            
            
            
              yeah, this is what you will see when you launch
            
            
            
              administrative tools, create basic tasks and
            
            
            
              then there are certain boxes which you
            
            
            
              have to fill in. So you locate where your python exe
            
            
            
              is. So you paste the path over here and then
            
            
            
              just also identify where your script is being located
            
            
            
              at and then paste the location of your script in
            
            
            
              the arguments. So the rest
            
            
            
              of the parts that you can fix you can set up
            
            
            
              in the task schedule is pretty straightforward, like when you
            
            
            
              want to run it, like what time, et cetera. So it's pretty straightforward,
            
            
            
              which also if you click on the stack overthrow.
            
            
            
              Sorry. If youre search this link,
            
            
            
              you should be able to see the entire instructions.
            
            
            
              So this is what we will get.
            
            
            
              Okay. And then of course I think there are still certain cases whereby
            
            
            
              based on the logic that I've come up with, for example the checking
            
            
            
              of the number of characters, there could still be outliers here
            
            
            
              and there. So in those cases they might still need
            
            
            
              some manual intervention. But on the whole
            
            
            
              I would say this is 90% or even
            
            
            
              like 95% cleaned data.
            
            
            
              So from this Google sheet we can actually
            
            
            
              use Google data studio to connect to the Google data
            
            
            
              sheet and then create nicer looking interface
            
            
            
              for user to play around or to explore the data.
            
            
            
              Okay, so over here we can see
            
            
            
              that what I have done is actually just collecting the
            
            
            
              data over a six week period between 21st November 20,
            
            
            
              22nd January 2021. And yeah,
            
            
            
              there are so many articles being published in a day to
            
            
            
              me is pretty crazy. So maybe just give me a
            
            
            
              while, I can load the data studio page.
            
            
            
              So this is what we have in the Google sheets.
            
            
            
              What's interesting is that there are multiple topics,
            
            
            
              I mean multiple articles on pretty similar topics. And then
            
            
            
              from here, if youre interested, you can make use of the data to
            
            
            
              do some text mining to identify what
            
            
            
              are some of the more popular domains within
            
            
            
              data science at the moment. Whether is it relating
            
            
            
              to certain kinds of tools or
            
            
            
              algorithms, you can search for the keywords within
            
            
            
              these articles. Yeah, as you can see that there are really
            
            
            
              quite a number of articles that are being published over six
            
            
            
              weeks, over 1000.
            
            
            
              It's like 1680
            
            
            
              over a six week period. Okay, so now I'll
            
            
            
              be talking about the second use case on creating
            
            
            
              a blended face based on multiple images.
            
            
            
              So in this case also we are getting data,
            
            
            
              in this case image of the Internet. And then I
            
            
            
              am saving it into my local drive, my local folder.
            
            
            
              What I did after that was to make use
            
            
            
              of open source library that was hosted
            
            
            
              on GitHub to create a blended face.
            
            
            
              So in this case, the process is a little
            
            
            
              bit different in the sense that we will do extract
            
            
            
              and loading first before we do the transformation.
            
            
            
              So actually there is no one single rule when it comes to
            
            
            
              data engineering, like it can be also
            
            
            
              done, not necessarily that it has to be
            
            
            
              ETL, like abstract, transform, load.
            
            
            
              It can be also abstract and load first before doing the
            
            
            
              transformation. So in some organizations
            
            
            
              they actually have data lakes. So this is where they have data
            
            
            
              from different places, and then they just pull it all into
            
            
            
              one single location. First they call it the data
            
            
            
              lake, and then after that, so it can be clean
            
            
            
              or unclean data altogether in this data lake, and then they
            
            
            
              do the transformation after that.
            
            
            
              But based on my work experiences so far,
            
            
            
              I typically do ETL most of
            
            
            
              the time, which is extract, transform, then load
            
            
            
              the data. So again, it depends on your organization,
            
            
            
              your structure, how actually you
            
            
            
              want to store the data.
            
            
            
              So in this case, what I'm
            
            
            
              trying to do is to create a blended face,
            
            
            
              blended average face based on the
            
            
            
              candidates of different political parties in
            
            
            
              Singapore. So last year in 2020,
            
            
            
              there was a general election, and then the street science, which is
            
            
            
              a news platform, a news outlet,
            
            
            
              they have designed a website to basically
            
            
            
              let people know who the candidates are from the different political
            
            
            
              parties and then where they are contesting.
            
            
            
              So they have uploaded images of all these
            
            
            
              political candidates and then they are publicly
            
            
            
              accessible. So what I did was to scrape the
            
            
            
              image of all these political parties
            
            
            
              off this website into my local drive.
            
            
            
              So again, I made use of the web scripting tools,
            
            
            
              beautiful soup and selenium. And then also I
            
            
            
              imported Wii, which is for regular expressions,
            
            
            
              because I need to extract the format of the image
            
            
            
              of later based on the URL.
            
            
            
              So subsequently also, I mean, there is
            
            
            
              this library that I use URL request or so for accessing the
            
            
            
              website. And what you see over
            
            
            
              here is the same thing as the
            
            
            
              previous use case. Okay, so this page is also
            
            
            
              interactive in the sense that I have to actually scroll down
            
            
            
              in order for the page to load entirely. So what
            
            
            
              we need to do when we use Python to access
            
            
            
              is to include more arguments in the options.
            
            
            
              So when we use the webdriver to
            
            
            
              access the page, it will allow the page to
            
            
            
              load entirely. So what I added over here is headless
            
            
            
              argument and a disabled gpu. So this, to overcome this,
            
            
            
              and then also similarly like using beautiful
            
            
            
              soup, I look for the text that all these images are
            
            
            
              tied to, which in this case is IMG, and then
            
            
            
              I run a loop to script the images. Also,
            
            
            
              if you can see, there's a bit of trial and error here where I need
            
            
            
              to identify the location of the first image that I'm interested in.
            
            
            
              Because when extract the information typed
            
            
            
              to the IMG text, there are other information,
            
            
            
              other images that I am not interested in. For example,
            
            
            
              it can be like the map. So after a bit
            
            
            
              of trial, I identified that the first image that I want is
            
            
            
              in the very first position of
            
            
            
              the IMG tag. And then there are x
            
            
            
              number of images I need to script, right? So I identify like okay, the last
            
            
            
              element is in the position three, six, nine.
            
            
            
              So I run the loop and then I
            
            
            
              basically filter out things that not necessary. Like you can see
            
            
            
              there is also additional subtext
            
            
            
              like alts that is blank. And then in
            
            
            
              this case, this is where I will skip those, because they
            
            
            
              shouldn't be blank. And then I get the
            
            
            
              source of the URL, the picture, the image,
            
            
            
              and then I scrape it off with the name
            
            
            
              type to the image. Okay, so if
            
            
            
              you can see, I also have included
            
            
            
              this minus four colon, which is to extract the format
            
            
            
              of the image like it can be JPG or
            
            
            
              PNG. So I save it accordingly into my local drive.
            
            
            
              And then after all the image is being script
            
            
            
              successfully. Then I make use of this open source library
            
            
            
              that was created by this person.
            
            
            
              He hosted it on GitHub, which I'm very thankful
            
            
            
              for because it makes things much easier. So I think this
            
            
            
              is also like the wonders of open source, right? You can do
            
            
            
              a lot of things much quicker, much easier based on
            
            
            
              what other people have developed or created.
            
            
            
              If you go into this URL,
            
            
            
              then you will see the scripts that
            
            
            
              are developed to create a blended image,
            
            
            
              average face. So what you need is just to
            
            
            
              change the file path that pine
            
            
            
              to those images where you have saved it locally. And then
            
            
            
              you run the script, you run the package, and yeah,
            
            
            
              the instructions is actually all on
            
            
            
              GitHub for this library. So if you
            
            
            
              go through the instructions on how to
            
            
            
              download it and how to run it, you'll be able to get the
            
            
            
              output very quickly as well. So this is what I got
            
            
            
              after running the script. And yeah,
            
            
            
              for each political party, if at least ten candidates, I created
            
            
            
              one phase for each of them, which is the average phase.
            
            
            
              Okay, so just wrapping up in terms
            
            
            
              of data engineering projects, first I guess we have to
            
            
            
              start off with thinking what analysis you or your
            
            
            
              data science team want to do here. From there we will think
            
            
            
              about what data is necessary, and then we will write a Python script to automate
            
            
            
              ETL or ERT process.
            
            
            
              And yeah, this will allow us to
            
            
            
              generate insights much quicker based on the data that we have
            
            
            
              collected so far. Okay, so I actually maintain
            
            
            
              a data science block called data double confirm. And over here
            
            
            
              I cover a lot of other things related to
            
            
            
              various data science tasks. So web scraping is
            
            
            
              one big component related to data collection,
            
            
            
              because I think there is basically a lot of,
            
            
            
              I would say, development in the algorithms
            
            
            
              realm and also in deep learning.
            
            
            
              So sometimes what we need is the data, which is
            
            
            
              very important, which is why I'm still talking about data
            
            
            
              engineering. And then from there, with the data we have, we can
            
            
            
              perform data visualization or statistical analysis
            
            
            
              that can give us insights and help
            
            
            
              us make more informed decisions. So this
            
            
            
              is the URL of my data science
            
            
            
              blog, and also feel free to connect with me via
            
            
            
              LinkedIn or Twitter. Thank you.