Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to the session on data to discovery,
unveiling, clustering in birth topic topic modeling.
This session will be presented by me, Abhiram and as well as my
co presenter Jaspal. So a little bit about ourselves.
I work as a cloud machine learning engineer,
Collinson and hold a master's in data science from
King's College London. Previously I was working as
a research fellow at SAP Labs in Bangalore.
Additionally, I also have published courses on
LinkedIn learning as an instructor on rust programming and it has
over 40,000 participants. I love to volunteer
at Data Kind Bangalore on nonprofit products and projects
and also play badminton and love to listen 80s
rock music in my free time. Along with me is Jaspal
who works as a lead data scientist with me at
Collinson. His expertise lies in the areas of
AI, Python, AWS and building data products
from scratch. He holds a CBA in advanced
business analytics from Indian School of Business.
He loves to play football and his favorite football club is
Arsenal. Both of us are reachable at Twitter
and our handles are mentioned below. So let's look at
today's agenda. First we will present the problem
statement and the topic modeling use case following
which we present why build topic is best
suited to solve this problem and the
end to end flow or the different processes that are
involved in build topic. One such process is
clustering, which is an integral part of the build topic technique
and here we will explain about hdb scan.
Then we will go into a hands on session where we look
at Amazon Alexa reviews and how we can get topics of
interest from that data set published on Kaggle.
Then we close it off by discussing about the future
scope of topic modeling. So let's
get started. Let's imagine that you have
a product like Alexa Echo Dot, which is basically a
Bluetooth enabled voice assistant. People buy it mostly online
and also in offline stores and leave reviews on websites
like Amazon Reviews and also the Twitter website.
So both e commerce as well as
social media sites. So in this scenario,
the first review on the Amazon platform is a positive review
which says about the usability
of the echo dot and how it's helping the customers.
Kids learning current affairs general knowledge
also improve their english skills. So this is definitely a pleasure point
for Amazon. Whereas in the tweet below a
customer is complaining saying that the echo dot is not
working on his phone and the customer care
is also not picking up his calls, both of which are crucial pain
points that Amazon has to address.
It is inhumanly possible that
it's not possible by humans to sift through
hundreds and thousands of reviews like these and pick up
pain points. So that is where a topic model link comes
into play, where it can process millions of reviews
and intelligently pick out topics of interest
for organization like Amazon to process and
take action. So now that we've established a use case,
or the problem statement of why topic modeling is
necessary, we look at why build topic
is the best suitable one, or one of the best suitable ones
in this area right now. So the first
thing about such topic modeling is that it has to
work on unstructured data. So as you saw in
the reviews samples before, the reviews
are unstructured text and they might have
emojis, they might have hashtags, all of these need to be processed.
And bertopic is capable of doing that.
Also, it is capable of taking advantage of transformer
models like Bert and other embedding
related models, and efficiently convert
the textual words into
contextual embeddings, which is a vector format or a numeric
format, right? It offers modularity.
So for example, on the image, right, there are different processes
listed that happen as part of the bird topic technique,
and embeddings is one such process. Dimensionality reduction is
another process followed by clustering. All these three
processes can easily be interchanged by the most advanced
or the state of the art algorithm, which is there out there.
For example, if we take the example of UMAP for
dimensionality reduction, if you're not satisfied with UMAP and
you want to try out something like PCA, which is principal component
analysis for dimensionality reduction, that can be used as
well. And for now, the state of the art for clustering seems
to be HDB scan. But a year from now or
even two months from now, you never know. A clustering
algorithm, which is far more advanced, comes up and you just have to plug
and play the HDB scan, like replace HDB scan
with the latest algorithm, and the whole build
technique will work seamlessly as expected,
right? And that is why one of the main crucial aspects of
this technique is that new advancements in clustering can be adapted very
easily. And the improved version of TFIDF,
which is the CTFIDF extraction of topic
representation, has been used for the weighting scheme
in this particular technique. And that is also one of the motivational
factors to use per topic. The CTF IDF
works quite well in extracting topic representations
from clusters of documents without focusing
on centroid based extraction, which again, as you may
be aware, has its own share of problems moving
to the end to end flow. What are the different parts of this whole picture
of bird topic. So we pick up undokenized reviews and
then we convert them into a numerical format in the vectorization
process using sentence transformers or count vectorizer
or such technique. And then we do dimensionality
reduction which reduces this high dimensional
vectors into smaller dimensions so that
it can be processed by the clustering algorithms. We will be looking at clustering
and hdb scan in detail and Jaspar will be presenting that
in a short while. And the last part is the importance
of words or the representation with term frequency and inverse document frequency
that is also taken care. And finally we get the required topics.
So a quick preview on clustering over
here. So, clustering, you may
have heard of a famous clustering algorithm called k means, right?
It allows you to select how many clusters you would like and forces every
single point to be in a clustering so that there are no outliers
at all. But in real life situations that is not possible. Also, K means
assumes that all the data points form a neat gaussian
sphere or a gaussian circle, right? That is
also not the case. And the shape of these clusters may be different in
real world. And that is where httpscan is quite
popular and quite effective. There are other algorithms like
agglomerative clustering as well, beg your pardon,
which can take care of these things. But yeah, it depends
on how you plug and play these algorithms and see what works
best in your use case.
So let's jump into the hands on part
where we have the data set description.
So here we pick up the Amazon Alexa review data
set, which is available on Kaggle. And this data set has five columns.
And for us, the verified reviews contains the textual
reviews. And that is the column that we will be working
with today. So without further ado,
let's jump to the Jupyter notebook on Google Collab.
All right, so this is the Kaggle website where
the data source data set is available. And this
is how I have retrieved the data. Now don't worry, this notebook is available
on GitHub for your perusal later
on after discussion. So first I get the data, and this is
how the data looks after I feed it into a pandas data
frame. And we are focused on this verified reviews
column. And then we install build topic and again check
our reviews data frame. And this is where the magic
happens. So first we instantiate a vectorizing model,
vectorizer model, which basically allows us to define certain
parameters on what are the ngrams
and certain configurations we can declare over here. And this
is where we are declaring the build topic model
with what is the language and certain config options
that are available in the documentation. And yeah,
this is where on line 19 we do the fit function fit transform
gets us basically the topics and the probabilities.
So moving ahead, this is how our topics look
like. So basically the first most topic
which is popular is 284 occurrences of
this topic is Alexa and love and then echo
dot and music and smart hub. All these are the topics
that have been generated organically from our data set.
And then over here we have the bertopic distance
map. So all these circles actually are representation of
topics. So for example, this circle refers to topic
30 where the keywords are hub and hue and plus and
here if you go here, this is topic 15 which talks about hulu and
streaming. And the one next to it which is topic ten talks
about stick and fire stick. So we can also zoom in
to see why these topics are so close to each
other. So as we can see,
there are two different topics over here. The first one is
about fastic and the second one is about streaming which talks
about similar topics of interest. And also
to wrap this up, basically the distance between
these circles actually indicate how close or how far
or how similar or how dissimilar one topic is
to the other. There is also a hierarchical representation
of all the topics that have been generated. And here if you
can see so the easy setup or easy
set is one of the topics that got
picked up. And if you see over here, all these related
topics are listed together to form the
bigger hierarchy. So moving
ahead, we have this final initialization of the topic word scores
where we actually display what are the most frequently occurring
topics and within those topics, what are the different keywords
and what is the frequency of their occurrence. So over here
you can see Alexa without doubt is the most popular topic.
And one of the interesting thing that comes out of this is this gift topic,
topic four which talks about love, gift and bot which says
that this product has been purchased as a gift by
customers to their friends and family and that is also being captured in
topic number four. So that's about
it on the hands on of bird topic.
And these code snippets and notebooks are available on
GitHub for your reference. I will now pass it on to Jaspal for
a detailed overview on httpscan over to you Jaspal.
Thank you abhinam for your explanation on build
topics. Now get ready with your swimsuits everyone
because we are going to go and take a deep
dive into HDB scan.
So what exactly is HDB
scan? If you look at the full form,
it says hierarchical density based spatial clustering
of applications with noise. Quite a mouthful, right?
Basically what I'm saying is HDB scan is
a long acronym, and in addition to that,
it's a clustering algorithm. To be exact, it's a density based
clustering algorithm. If we want to know what exactly
is HDB scan, we would want to know what is
DB scan. So on the right hand side of your screen,
you find a list of data points scattered around
the graph. You also see a small circles
encircling the data points.
Now that's what DBScan does. It creates
small circle around every data point.
And if there is another data
point within that small circle, it traverses
and it traverses as much as possible
until it finds like next data point,
as long as it finds next data point.
Now what it does is basically it
helps us to find high density regions inside
a data set like you see on the graph
here, where there are a lot of data points close to
each other. The circles, there are a lot of circles,
you see, and that is the result of DB scan algorithm.
It's a good algorithm and it is quite robust,
flexible, and it's quite like outlier
resistant as well. We don't need to define
any predefined number of clusters in this,
which is a really good advantage. But it has got a couple
of problems. Like you need to define
the radius of the circle. Now it is okay
if you have already seen the data set, but what
if you have not reduced the dimensionality
and then not seen the data sets? How do you know what should
be the radius of the circle?
So that's one of the problem. And the other is it's a bit slow.
So what if there was a mechanism?
If we don't need to define
the radius of the circle like this,
no fixed radius, but at the same time we could identify
dense regions. This would help us to be fast.
And that's what hdb scan does
here. Now we have
got an old friend in KNN algorithm. We take
help from this old friend of ours.
Now, KNN algorithm is basically identifying
its nearest neighbors. So we define a
k, and using that KNN algorithm,
you draw a circle around it, and the last data
point in that circle becomes the radius
from the center of that circle, which is the actual data point.
Now that is the core distance. And when
we define this minimum number of point in a cluster,
we will find out that at some
places your cluster size is huge,
like you see on the right hand side, the cluster size for the green
circle is really, really high. On the left hand side,
the blue circle is small because the data points are
closer to each other. Now,
this distance between two clusters is defined
by another distance called mutual reachability distance.
To explain this, we need to imagine a scenario.
Imagine yourself with your friend inside a
park with a lot of other people. Now,
one thing we know that some human beings are
short, some human beings are tall, and as and when their
arm size are different as well. Now imagine the arm size
being the core distance of a human being. Now, another human being
within that arm's distance is your mutual
relatability distance. Reachability distance. And that's
the concept, basically that how many clusters are
within your radius, and that's where dense
regions are formed. If they are outside of your radius,
then they are away from you. And this mutual reachability
distance helps us to identify the dense regions.
Now, once we have this mutual reachability distance,
we find out, like how many items
are closer to each other and by what meaning,
by what relation. Now, when we
are at a park,
you have your arm's length to identify who is
closer to you or who is not. Now let's take that
park example to a different place.
Let's imagine a room full of objects
and they are scattered around. Now,
you have a string and you want to tie
these objects to each other, but using shortest possible
paths. So how would you do that?
Okay, you'll try to do it by
calculating the distance between these object. And now you
have this mutual reachability distance defined against
each object and each cluster. So there
is an algorithm called minimum spanning tree
algorithm, which finds the shortest possible
path for you. And that minimum spanning tree algorithm
is really useful to identify the density and the hierarchy of the
cluster. Now, you might ask the density. Okay, you have the mutual
reachability distance as the minimum, but how the hierarchy?
For that, I need you to imagine another set.
Now imagine that you have tied all the object with
the minimum span using the minimum spanning tree algorithm.
Now, what you do next is pick up these objects,
and here you see a hierarchy is formed,
something like this.
The objects that are closer to each other come closer and form
a cluster. The objects that are farther from each other
are lying below in the hierarchy,
and that's the way the
spatial clustering happens. Now, let's say
you got the hierarchy, but you still are not really happy with the
number of clusters because there are a lot of hierarchy there.
So how do you define what should be your optimum
length of the clusters? So for that basically you need to
define how many number of objects do you need in
one cluster? And once you define that, you get a pruned
tree, something like this. So this tree basically is
the hierarchy of the objects where you
have removed the objects where the clusters are
not really big, and I
have put it across a lambda value.
Now you'll ask what is the lambda value? Now once
you have pruned those objects, you would want to capture
the clusters which are most meaningful
or reliable. So how do you do that is by using the stability
score. Now, stability score is calculated
for each cluster in the hierarchy based on two factors.
One is the density of the points and the other
is the persistence, basically how long it lasts
in the hierarchy. Now here when you see the y
axis, it's the stability score, which is the highest
for which cluster. So as you can see,
the first two are clearly highest stability scores
and the width of the cluster and
the color gives you the number
of points in that cluster. So the third cluster
which we choose is the green one on the right hand side, which is quite
stable as well as it has a lot of points in it.
Now these are the clusters that are being chosen based on the
stability score. Now once we have chosen these clustering,
this is how the final clustering look like. There would be a few points which
you can see are grayish in color.
It might be a little difficult, but yeah, they are there. So somewhere
near the greens, on the edges of the greens, those gray
points are the outliers. But we
are able to find highly well defined clusters
here. Now, just a recap of what we did.
We transformed the spaces as per the density we
identified, which is the highest density region.
Then we built a minimum spanning tree
and then we constructed a cluster hierarchy.
Then we condensed this hierarchy based on minimum cluster size,
and then we extracted stable clustering from the condensed
tree. So that's what HDB scan
does. Now let's
see what is the performance of the HDB can.
As you can see in this
graph, HDB can performs better than DB
scan. As you see, on the blue
line is the DB scan and the green line
is the HDB scan. On the x axis you see
the number of data points, and on the y axis, the time taken
to perform clustering activity.
Now, as you see, past cluster is not really
that good with higher
set of data points, but DB can is better than that
HDB scan, even better. And then the last two are the K means.
K means is the fastest, no doubt about it. But it's not really
that great in giving you better cluster.
Now let's see what are the strengths and weaknesses of
HDB scan? HDB scan basically
focuses on high density clustering and as a result
it reduces noise problem and
the minimum cluster size parameter can be set and as a
result it's relatively fast than
others as we see in the performance. And it
does have problems in handling large
amounts of data. But you can use cumL
HDB scan which uses GPU acceleration to
perform this, and I've given the link in the description
as well, so you can click on it and look
at it in your free time. Now knitting
it back to what we discussed previously with bird topic.
So basically we know that every component
of build topic is modular and it's scalable and
it's also flexible. If we are not using HTTPscan,
probably we might have used some other algorithm and
as a result clustering is modular, which gives you a
really good advantage over it.
And the assumption here is basically every
document contains one topic. So when we are
giving each topic to
analyze, we have only one topic for every document.
And also there can
be chat CPT might
be impacting like topic modeling,
but yeah, I mean it does the job, Chat GPT,
there are a bit of biases,
ethical issues and so on and so forth. But topic
modeling is just topic modeling. You identify which is the important
topic and you get it. So yeah, it's really
useful for that and it's
easier to operationalize as well. And there are a lot of cloud vendors
that provides this build topic
and you can do it using really
simple architecture,
using sort of lambda functions and stuff. And it's really quick.
So I have given some of the references and the
resources as well. So yeah, that's about
it. And thank you so much for joining my
session, hope you enjoyed it. Hope you took some of
the material with you, which is useful,
which might be useful for you in the future for your
further research. And thank you so much for joining,
see you later,