Conf42 Machine Learning 2024 - Online

Topic modelling for text documents using NLP techniques


I am a data engineering professional and have helped companies like AWS manage streaming data at scale and support analytics over it. My solution helped organisations convert batch processing pipelines to real time pipeline and hence reduce data freshness from 75+ hours to less than 40 minutes.


  • Topic for today's talk is topic modeling for text documentation using NLP techniques. AWS UK is a government bank fund which help organization by providing more funding to execute their projects. Jain will share his journey how we are trying to solve some of the use cases.
  • We have four kind of use cases right now to solve. The first use case is how we can identify the entities name within the documentations. The second is to clean the application data and identify the similarity between the documents. The fourth use case we have is to support a kind of ecosystem. Some of the use cases are still undergoing appropriation.
  • We basically use a two kind of methodology. The first methodology is to use inbuilt functionality of a specific kind of library. The next use case is to identify the similar documents. The score is something that can be used further to identify things at a particular level.


This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Akshay Jain, our engineering manager at Innovate UK. AWS UK is a government bank fund which help organization by providing more funding to execute their projects. At present I work with the data migration projects where the primary responsibility includes around executing a data migrations, data integrations and implementation of some of the NLP techniques to help them resolve some of the machine learning related use cases. Topic for today's talk is topic modeling for text documentation using NLP techniques. In this particular seminar I'm just going to share my journey how we are trying to solve some of the use cases that we have around the text document analysis and how we are solving it using the NLP techniques. So yeah, let me just walk you through to the some of the use cases and the challenges that we have and both that I will be just throwing at you some of the solutions and the things which we are implementing to meet those particular challenges. Yeah, let me just go back to the use cases. So on the use cases side, primarily we have four kind of use cases right now to solve. The first use case is how we can identify the entities name within the documentations. So generally what happened is that as it is a venture capitalist company, lots of people submit their applications in their intention to raise a fund. So lots of documentation comes in a way where people provide information such as what is the description, what is the purpose of their work and things like that. So all of those things come in a form of a large documentation and what we try to identify in the document is what is the kind of all different entities and things are involved just to identify. We are not putting government money to any of the sanctioned companies or sanctioned persons or something like that. The second kind of use cases, what we have is that whatever applications or things we are getting in that data, we may want to identify whether the documents have a certain similarity or not. We have a two purpose for it e to just identify what are the different segments or industry sectors in which we are getting a different application. And the second one is to identify the use cases where people are submitting our own applications multiple time with the word changes and things like that in a way. So we don't spend much effort to ss application again and again in a manner. So that's the kind of second use case we have where the purpose is to clean the application data and identify the similarity between the documents. The third use case, what we have is to understand what are the sectors and things in people are submitting their applications, how those applications are relate to each other from the industrial sector perspective and where the funding and things are coming in a way just to understand the market conditions. And the fourth use case we have is to support a kind of ecosystem where we can say that in which particular subcategory under the industry codes, the more and more money is getting funded, or more and more application which is coming in the market. So from all those aspects, we just try to identify and we try to analyze all the applications that we are getting. And for that purposes, we are trying to build a system which can help us resolve all of this kind of problem and give us a concrete solution around it in a particular way. So let me just walk you through the journey where we have solved some of these use cases, and some of the use cases are still undergoing appropriation. So I'll be just walking you through that journey in that particular sense. The first use case we have is to identify the entity documents. We generally have lots of textual information in our documents. What is the purpose of the fundraising, how it is going to help them, then, what they have written now, what kind of work they are building, what kind of partnerships they have, what are the people, what are the different peoples and entity that is involved in those things. So we get generally lots of actual information in the form of text and documentation from them on those kind of things. What we try to identify is that what is the kind of entities that are involving over here in terms of what is the country from which the funding is asked, what are the people who are asking for funding and some more textual information that we just want to identify from those documents is to ensure that, that we are working on the applications. As for the government guideline and no complacency issues and things are happening over there in that case. So in those particular cases, one of the things, what we have done is whatever documentation and text we are getting. Like this is one of the example of the text which I have just randomly from the Internet out of one of the news article. Well, they're specifically mentioning something about Dandy Murray and how the things are going on tennis side of the things. I just randomly took an extract out of it. And on this particular extract, if I want to identify that, what are the entities that has been one over here in terms of a people, country, date and other factors, what I can do is I can just use some of the available NLP libraries to identify those kind of information. One of the library that support identification of all this critical details with a very minimal usage of coding, is a spacy library. And in the Spacy library. What we can do is we can just take a spacy library. We can just load a model, whatever we like to use. Spacy library provide multiple models like here. In this example, I'm just showing you the usage of encore web large language model. You can use any other language model as well. Whatever works for you. There are specific language model that has been built to analyze the news related articles, web related details and things like that. So you can just choose what kind of LL model is working best for you and you can just basically load that model into the spacy. And after you just load that model, what you can do is you can very easily analyze the textual information using this particular library. And this library is going to provide your details that what is the kind of entities and entry types are involved in that particular document based on the supported entity types. So if I just go over here and run this particular code in the article that I have just shown you earlier, what it's going to do is it is going to provide me an output which is going to look something like this. Where this is going to shame me that in this particular article, these are the only different persons that are involved. These are the geopolitical locations that have been identified. In this particular article, it has identified the location as Dubai over here, as you can see event it is able to unidentify. For example, the Qatar open is one of the event which has identified. It is also providing the detail about the various dates, related details like whether it's a particular day or something related to d. Those kind of detail it is providing. Also it is able to identify some of the cardinals or the numerical information that we have within it, like two or six or any other numbers and things like that. Whatever is designed over here. So those kind of information are something which we can very easily identify. And once we identify, this information can be stored for acquiring purposes. Just to check whether a particular kind of entities or person or organization names or something like that is involved in a particular application or not. Spacy, inbuilt support all these kind of different entity types like person, organizations, geopolitical locations, products, law, date, time, etcetera. You would be able to differentiate whatever textual information you have within this particular categories. And based on that, basically you would be able to store this data and can be used, and further for querying purposes to perform some of the compliance related checks and things like that. This is like the way in which we have started the thing using innovate and we are just progressing further on it to use the information in this particular map. Now, the next use case, what we have is to identify the similar documents. As I mentioned in the scenario, what happens is that the application gets submitted and that application generally go through a cycle where the application get reviewed and reviewed by the multiple subject matter expert, depending on in which field people are looking for funding for and things like that. Basically analyze in that particular way and based on that general decision is taken whether to give a funding or not and things like that. What the general scenario, what we see is that people submit the applications, if the application got rejected, they just go and make a wordy changes over here and there. They change some words, they make a paragraph over here and there and add some more additional details over here and then, and then resubmit the applications. So what we basically try to identify over here is that how two applications are similar to each other, and if those applications are similar, or if those applications are submitted across the multiple categories or things like that, then we just want to identify those kind of applications. So we basically arrange and manage them properly. So in order to identify again those kind of textual informations and the details, we basically use a two kind of methodology. The first methodology is to use inbuilt functionality of a specific kind of library, where, you know, we are again going to use some kind of language model. And in that language model, when we are going to provide a particular document as an input on those particular documents, it basically process the documentation, apply the inbuilt algorithm on it, which is primarily towards the TF IDF kind of algorithm based scenarios. And then possibly application or vectorization on it. And based on that as an output, provide us a detail that how this two documents or this row information which is mentioned on those test documents are similar to each other. And if that particular match has crossed a certain threshold, then we just flag that these two applications can dissimilar across categories. Or if, let's say in the future, an application is rejected which has a similarity to the current application, then we basically process it in a different way to go further, deep dive on it. That whether this application is again submitted from the same source with some changes, or it is a new application altogether. So those kind of use cases which we can just solve with this kind of functionality, and as a part of a specific library, this particular functionality comes in build. So without writing a very minimal amount of code, you basically solve this particular problem and identify a solution in a particular way. So like here in this example, you can see that the first sentence is I like salty fries and hamburgers and the second one is fast food tastes very good. So in this particular way, it basically take those words, apply the laminizations and the localization related techniques on it to identify word like how they are similar to each other and bring the forms and bring the words in its original format and then it will just calculate the similarity score on it. So based on that you can just see the similarity scores that it has generated and the score is something that can be used further to identify the things at a particular level. Generally what happen is that when we go and we apply any kind of this kind of processes, if you are applying with a spacy libraries or things like that, it is good to apply directly on the raw data to a certain extent, because to a certain way specific library use some of the internal algorithms to implement the tokenization and the lamentations and the laminization kind of functionality before applying the respective algorithms. So the things can work out very smooth over there. But in general use cases, whenever, if we are going for a more further use cases where let's say we trying to implement some advanced logic over there to identify the text documentation based out of the cosine transformation related algorithms and the vectorization techniques like that, in those cases, the initial thing, what we generally do is we pre process the text information and that text information. We basically apply some kind of data cleansing in terms of making ensure that that everything is in lowercase. We stop the punctuations and the generally used words. We also apply the tokenizations and the laminarization. So laminization, what it generally do is it basically bring the words within a sentence to its root. So that when we are comparing our words and things like that particular command become on the same level playing field. And further, that if you are applying any kind of algorithm, like for example, let's say we are applying a DfIDf, some kind of n gram texting or something like that, then those kind of calculations become very effective over there to calculate the score and the further usage. There are some of the techniques which we can be used and for that, like NLTK is a general library which provides lots of functionality to implement those kind of use cases with a very minimal amount of coding and things like that. Now once we basically clean the data, after that we go to the other use cases, where in this particular use case, what generally happen is that at the next level what we try to identify is that whatever applications and the details and the textual information we are getting, how that particular information is segregated across the different industry sectors and things like that. So we understand that from which particular sectors more and more funding requirements are coming, or what kind of growth we are seeing. So a, we can understand the industry trend and b, we can basically manage our capacity to access those applications and things in a particular way. So in order to perform those kind of things, we basically manage, or we are building a kind of techniques using a clustering algorithms, which basically help us identify that whatever textual information we are getting, that actual informations belong to which particular clusters, in order to build and take this particular count, as I mentioned earlier, basically first go and we basically first clean the data. After doing this data, we basically apply some kind of actorizations on it. And after we do this particularizations, we basically go for algorithms to cluster them in a way, to see, you know, what kind of clustering is working best for us. We eventually started with the k means clustering. Then the k means clustering. This is one of the output which is there on the with some sample data, where we can see very clearly that with the k means clustering we are able to basically that data and you are able to basically tell that what are the kind of clusters and things, those applications belong in a particular way. And based on that, basically that can be used further, and the segmentation can just help us to understand the application in a proper way. The other things, what happen is, or why we need this particular thing is because there are chances that a company is working in a sector e, but when they basically submit the applications and the details whose application may belong to sector b. So those kind of things are very general in nature, because companies are doing generally an innovation either in their field or in some other fields as well. So basically try to capture those things. And we also try to understand that what are the kind of overlap we are seeing between the industries in terms of innovation and the kind of things they are doing. So this kind of mustering techniques basically help us identify those kind of things and provide answer more on that side. So here, basically, Keymans is one of the clustering techniques that we have used. And in that way, basically it has worked very well out of us to segregate our data in a certain way. Other thing, what we have tried out is a fuzzy seaming clustering, which has comparatively generated a better output for our data sets because of the characteristic of the data and how the words in the textual information is connected to each other in the vector arrays. So based on those particular inputs, we were able to basically generate a clusters and the things out of it. To understand those particular things in a more appropriate manner. Now, based out of this particular information, we are now able to cognitively basically categorize the data. And we can just say that, yeah, these are the kind of industry categories and the things that these applications belong to, and they can be used further on that side. All of these things are implemented generally using a Python library. With that, basically we can just build the pipelines on the top of the AWS using the sagemaker notebooks or something like that. And in that way, basically the spill dynamics can be solved in a particular manner along with this thing. The other thing, what also we want to identify is that whatever things we are getting and whatever things are getting in our cluster, those things are similar to each other. And how those things are in general helping us out to identify the things in a particular way, in order to identify those kind of details and some of the scenarios around that. What we generally do is that on, after doing a clustering of those applications, we also basically go and generate diagrams and triagrams on the top of it to identify the frequently used keywords and the things that are present on those particular clusters. And using the libraries and the techniques around it, we are basically able to implement those kind of details and identify what are the common or keywords and things we are seeing. And there are certain set of keywords which we basically take out of it, just to understand that these are the highest highly user keywords in this applications and the things like that. So this we can just implement using this diagram and diagram techniques. And we can also around that get us some kind of informations on dummy data, like the number you can just see over here for some of the dummy data that, you know, how those particular eggs are being used. What are the kind of instances we are seeing over here? Generally the term frequencies and the inverse from frequencies kind of scenarios in this manner. And we basically just see that how those terms are being used across application to understand the application similarity, dependency and the sector's influence on each other. So in that way, basically it help us get those information and process further on those particular sites. So once you basically get this particular information at the next level, this is something which is in progress. We are still working on it, and this is something like what we're trying to identify. Is that, okay, where we have identified what are the clusters we are having, we have identified like what is the kind of topics we are having in that particular clusters. But the missing part is that what is the kind of hierarchy that exists between those particular topics or cluster that has been present over there in that particular way. So for that purpose, we are just experimenting with some of the algorithms and the techniques around it to identify those things. Like one of the rhythm which we have used right now is related to the accommodative clustering techniques. So with this technique we are just trying to understand that how these topics and things are related to each other and whether we are able to build some kind of a graph around graph around those things or not, where the root node, maybe we can just see a cluster as a category. At a sub level we can see this different hierarchy, and at the end or at the root level we can just see the different tags or topics that has been identified within that cluster. So we can just use it in a particular manner. So this again, your progress. And mostly once we crack it in a future video related to machine learning or something like that, I will be happy to share those kind of intenses in terms of what is work out and what is thought. Right now, the results which we are seeing on this particular technique using algorithm clustering is not that much satisfactory. We are trying to improve it with other things and things like that. Yeah, so that's how we are basically trying to solve the use cases around the machine learning, using a topic modeling. And in order to build this particular things, we are generally using most of the open source Python libraries. And in order to perform this particular techniques or clustering with the diagrams and the diagrams technique, we also got an exposure to a different libraries like keyboard or something. And those libraries are generating a reasonable amount of results. But one of the things, what we identified is that rather than going for keyword libraries and other logging page and respective library, we in our experience, we basically, for a particular domain, we got a more positive results by using the basic python techniques around the malgram and grams and the frequency idea related techniques and things like that. So yeah, that's all about this particular presentation. And my goal was just to provide you in a detail in terms of what other kind of use cases exist and what are the kind of approach and things generally work out in the industry in that particular manner. So yeah, that's all about, that's all from me in today's presentation. If you guys want to learn more about it or just want to be in touch, please send my contact detail. Feel free to connect with me on the LinkedIn, or feel free to connect with me on the email to discuss any of the possible challenges or things that you would like to discuss around the use of topic modeling and the analytic data around us, and I would be happy to connect and share more details over there in that particular way. So thank you. Thank you for your time, and thank you for listening this particular session. Thank you.

Akshay Jain

Data Engineering Manager @ Innovate

Akshay Jain's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways