Natural language modelling with Amazon SageMaker BlazingText algorithm

Video size:

Abstract

In this session, we showcase the power of machine learning in taking a large corpus of a foreign language text (we use the entire Wikipedia) and automatically learning word embeddings for that language.

This is typically the first key step in building natural language processing (NLP) solutions, such as text classification or topic modelling. You see how easy it is to apply the BlazingText algorithm built into Amazon SageMaker in order to process the entire contents of Wikipedia in this language and visualize the results. You then can apply these learnings to any language of your choice.

Summary

Session on natural language mod link with Amazon, sagemaker and blazing text algorithm. We will use the language of Tamar, one of the world's longest surviving classical language. In order to get this magic happen, we need to introduce word embedding with word tovec algorithm.
Many machine learning techniques require numeric rather than text input. Words have different lengths and even written representations differ dramatically from language to language. Just randomly naming them with numerical label may not be great way of solving the problem. Typical approach is to use one, not encoding.
word to veg algorithm is trying to solve the problem of finding words that are in close relationship to the input word. The end goal is to achieve the word embeddings. How can we visualize this result high daily on a two dimensional picture?
Amazon Sagemaker is a platform that was built ground up for developers. The primary ambition was to make machine learning development easy. There are lot of services that it packs that makes the development lifecycle a lot easier.
Sagemaker allows you to build, train and deploy ML models at scale. Inbuilt algorithms take advantage of distributed computing infrastructure we have in cloud. You can kick start the training with just one line of python code.
blazing text algorithm was published back in 2017 by a couple of Amazonians. It provides highly optimized implementation of word to Vic and text classification algorithm. Can be 21 times faster and 20% cheaper than fast text on a single c four. But of course throughput is always not the only factor that you would choose to run with a particular algorithm because we also need to consider accuracy and cost.
We are going to use a large corpus of text in Tamil for data ingestion purpose. We are using a wiki extractor script to extract these information from the dump. More the data, more you could learn and actually the inferences could be a lot better.
Using the inbuilt algorithm is just one line of python code. The whole thing could be orchestrated in a CI CD fashion. There's a feature within Sagemaker that you could leverage. All of these tasks could be automated in a totally fashion.
Sage Maker could make your machine learning development lifecycle a lot simpler. Explore the natural language processing using some of the input algorithms we have within Sagemaker. Play with one of your favorite language of your choice and explore the world of machine learning within the AWS ecosystem.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, welcome to this session on natural language mod link with Amazon, sagemaker and blazing text algorithm. My name is Dinesh Kumar. I'm an aspiring ML specialist and I'm hoping to convince by end of this session that you don't need machine learning degree to take advantage of tooling AWS has put at your disposal. Now, as part of this session, we will pick a language purposefully, a foreign language likely unfamiliar to you, and apply machine learnings to create some magic. We will use the language of Tamar. The Tamar language is one of the world's longest surviving classical language with a history dating back to 300 bc. Tamar as a language is rich in literature and one of the language that has evolved over thousands of years. The reason I picked it is I'm familiar with this language and I was keen to see whether machine learnings was able to figure out the relationship between the words of this language and I want to apply known machine learning algorithm such as lacing text, in this case within Sagemaker to discover these relationships. What's cool about today's talk also is you will be able to apply these techniques post this session to the language of your own choice and discover the relationships yourselves. Now, what are the prerequisites for this session? It will be good to have fundamentals of the machine learning or deep learning and understanding of typical natural language processing problems, ie search and some familiarity with AWS services. Not necessarily sagemaker. I will surely cover that as part of my session and then knowledge of python or numpy. As long as you know any other coding language, this is straightforward, you would not be lost and these knowledge of Jupyter environment notebook environment as such would be very handy. Now in order to get this magic happen, we need to introduce word embedding with word to vec algorithm. Then we need to understand bit about Amazon Sagemaker and its unique capabilities, especially around NLP area. Then we can spend bulk of our time with building text algorithm and a quick demo. So that takes us to these interesting topic of word embedding. Let's dive into word embedding, natural language, text, contents of words and so we need to represent individual words, sentences and collection of words in some way. Now couldn't we just use strings containing the words? First of all, words have different lengths and even written representations differ dramatically from language to language. So if you look at the language of Tamil, some words might be mentioned in a different way in different region of Tamil Nadu, while the same words will have different representation or different way of being pronounced or spelled in different region. Now, with these complications and added to that, more importantly, many machine learning techniques require numeric rather than text input as you would obviously. You know, computers are good with numbers and may not be with natural language vocabulary like us. So that brings us to the topic of representing these words in a numerical form. So if I want to represent this particular phrase, you see there to be or not to be in a numerical form, how can I go about it? Let's say I represent each of it with an individual label. Let's say I represent two with zero, b with one, or with two, not with three. Now that gives me zero, one, two, these zero, one, a random set of numbers. Now, this particular way of representing it with just unique labels is actually going to create or introduce random relationships. For example, b as a word is now closer to two and r but away from not. Is that true? No. So how did we even arrive at these labels when there is no such relationship? So just randomly naming them with numerical label may not be great way of solving the problem. Now let's say we use vector instead of single number. Typical approach is to use one, not encoding. Here each word access an index into a vector of all zeros with only a single element set to one. Now you would have already guessed these problem with one out encoding approach. The example here is having just four words, and a typical language will have hundreds and thousands of words. In case of Tamar, I don't have account, but it's a very rich language, as I said earlier, and there is quite a lot of words to its vocabulary. So representing them by ones and zeros may not be the efficient way of actually solving the problem in hand. Now with that in mind, let's look at the other side of the coin. Given a sentence, what is our chances of maximizing the probability of predicting the context words? Let's say I introduce this word, Tom Hanks. How can I predict the context of this word? In this example? Let's say we want to predict the context of Tom Hanks. What is the probability that somewhere around Tom Hanks we will find words? Great. Can actor quite a lot because he's an actor. So there is quite a lot of chances that those words would appear somewhere near Tom Hanks. But what is the chances of me finding something like quantum physics next to Tom Hanks? Maybe very less. I would not say zero, but it is relatively less when you compare it to word actor. Now that's the point we are trying to actually make. How do I figure out that this particular word has more relationship and hence contextually more closer to this word. In typically deep learning world, we will actually have a fully connected network. So for every input word it receives, let's say a vocabulary has 10,000 words, and if we receive one word, then the output of this network should be able to figure out what are those words within those 10,000 words this particular input word is closer to. So there will be lot of hidden layers of network that will try to extract the contents words and then it'll spit out the probability of the words that might sit contextually closer to that input word. After training such a network, we can now quickly compute denser output vector for a sparse input vector that we had after our 1 hour ten coding. So we have now grasped a bit about the problem itself that we are trying to solve. The problem statement is we want to understand the probability of a word being in closer relationship with, or probability of having set of words that are in close relationship to the input word that is coming in for inference. Now it's time to understand about the word to veg algorithm itself. So, dimensionality of the output vector is a parameter we choose. This is why we said embeddings high dimensional object, one not encoded word, in this case into small dimensional space. Turning a sparse vector into a much denser representation is what we are trying to achieve. Once this representation is computed, we can simply convert every word into an n dimensional space. In the end, words that appear in similar context will likely be mapped to similar vectors. Words close to each other in vector space are likely to be similar in meaning as they tend to be used in similar context. This is where we are getting closer to these magic a machine learning system automatically discovering words that appear to have similar meaning. In theory, we also expect certain vector relationships to hold. This doesn't have to be exact and depends totally on the carcass being used for training. We can help to discover word relationships with transitive properties, as shown on the slide. For example, if we take a vector that corresponds to word king, that's a classic example that we usually see. Let's say I have found a vector for the word king and subtract vector for the word man. I get a meaning that says there is some sort of royalty associated, and then if I add vector of woman to it, I've arrived at vector of queen. This is magic here because we have managed to understand the meaning of a word and then doing typical addition subtraction that we play with numbers, but in this case with words and their inherent meaning that they bring with it. So that's exactly what word to vec algorithm is actually trying to solve. You may have heard about new models such as Bert, Roberta, but the end goal is to achieve the word embeddings. Yeah, I'll probably show you this completed word embedding for english language. And this has been mapped into 100 dimensional vector. How can we visualize this result high daily on a two dimensional picture? This is where we probably could use another trick of trade. T distributed stochastic neighbor embedding plot. Now that might be a bunch of words, but it's just simple way of telling that we now have way to visualize all those relationships in a two dimensional space. As you see here, the model has now figured out american, british, English, London, England, French, France, German. Now all of these are closer to each other. It has mapped that they all can contextually appear closer or they are all related in some form or shape. And these clusters are interesting clusters that it has figured out because of the corpus that was thrown at it. The very important one I would probably show is if you see there, it has found out that there is relationship between son, family, children, father, death, life. It is able to actually figure out that these words are related and are closer in context. And there is more probability of a word from this particular cluster appearing next to a word that has just come, that it has just come across. Now, enough of the word to vec and word embedding the actual problem statement that we were trying to solve. Let's try to understand the toolings that are at our disposal and how sage maker itself as a service is able to help us with in solving this problem. And our placing text algorithm then fits into it. This is a typical AAML stack. In the top you see AI services. Most of these services are out of box services in the sense you do not need to have any sort of machine learning skill. Let's say you are a team of developers who are having zero skill with machine learning, but want to actually see how you could use it for your own application or a problem or a workload that you have. Then these are some go to AI services. These are across different areas. For example, we have AI services for vision, speech language, chatbots, forecasting recommendations. All you need to do is an API call. These are models that are up and running and are always learning because so many of our other customers are using it. And you will be able to just, with an API call, get the inferences back. And you do not have to go down the route of building a model, training it, validating it and monitoring it of any sort, you're just going to consume it. But let's say you are a bunch of developers who are already actually into machine learning and want to make your life simple, then that's where ML services with Amazon Sagemaker as a platform would come into picture. Amazon Sagemaker is a platform that was built ground up for developers. The primary ambition was to actually make machine learning development easy for developers, and hence there are lot of services that it packs that makes the development lifecycle lot easier than it was before. We will dive into that part in detail, but otherwise, if you are experts and if you want to actually do it at your own pace, the frameworks of your choice, and then we completely support Tensorflow, Mxnet and all those popular frameworks, and quite a lot of instances and GPU based instances that are available for you to leverage. But we would primarily focus on Sagemaker as a platform as part of these session, because it's a very big world to explore on its own. So let's just take ourselves to the main hero of today's subject, Sagemaker. So amazing. Sagemaker. As I said, you can build, train and deploy ML models at scale. Now at scale is the key term there. Whatever be the part of your journey, either it be building or training or deploying. For each of these stages, Sagemaker as a platform offers you the right set of APIs with bunch of python code. You will be able to build your model, train your model, validate your model and deploy it. And of course, AWS, part of today's demo, we will show you how it is being done, but that's how easy it is to get going in Sagemaker. So to start with, we offer pre built notebooks. These pre built notebooks examples are very much available within sagemaker environment. In your own AWS console, machine learning developers can just use these examples to start experimenting for their own specific use cases. So if you are actually new to sagemaker, you do not have to start from scratch. You could leverage these pre built notebook examples and use that as a starter to explore the sagemaker environment. Now, let's say you explored it and you want to actually move forward. Then once you settle in with your training data, of course, that's these most important bit. Your models are as good as your training data. So quite a lot of our customers spend a lot of time in ensuring they have the right data. Let's say you have the right data, training data that you need. You then need to choose right algorithms that will have to actually go with them. Now, when it comes down to algorithm. We have lots of choices within these sage maker world. As you see either it be regression or classification, or image vision based aa models. You have quite a lot of built algorithms that comes very handy. These algorithms are in the form of containers. All you need to do is actually refer to the container registry and pull the right container for this algorithms, and then you should be get going with it. Otherwise, if none of these existing built in algorithms does not cater to your needs, you could always bring in your own algorithms in the form of containers and leverage the sagemaker as a platform. Now, what's unique about these inbuilt algorithms is that they could actually take advantage of distributed computing infrastructure we have in cloud. And you might sometimes find the equivalent of these algorithms in open source as well. But the ones that are inbuilt with Sagemaker are validated for their efficiency in utilizing the distributed compute environment and infrastructure that the cloud offers. And hence it is always relatively lot performant than the open source version of the equivalent algorithm you will find out in the market, in the open source market. So let's say you determine the algorithm, then you need to train the model. So you will want to tell the number of missions you want to use for your training purpose. You can then kick start the training with just one line of python code. Yes, you heard it right, it is just one line of python code within your sage maker SDK or click in a console. And that's all it takes to actually get the training going. And you can actually do your training and create your model out of it. These are so many deep learning framework containers that are supported in Sagemaker. So it uses docker containers, as I said earlier, that's designed for that specific algorithm, and that's designed to support that particular framework. So let's say you have a particular algorithm in Tensorflow that you want to use for your use case. Then you will find a container for that particular framework, for that particular algorithm. And as long as you actually refer to that in your code, you will be able to pick and run with it. What you do is when executing the training, as I said, you just select the right container and provide the data in form of a file in simple storage service. Now what Sagemaker would do is launch cluster of training machines. And again, it will not just launch a lot of machines, you do not have to worry about the cost, it just launches the number of machines you are told within your code and the type of instances that you have advised for it to pick up for the trading purpose, and then use those machines to train with the data that it has from s three. Now, it can also perform distributed training. As I said earlier, if it is distributed, you can train it in multiple instances, or you could always choose to train it in single machine and end of the day, the output, the resulting model will be back in s three. So you could always actually look at the model and see how accurate is the model giving inferences, and then decide to go ahead or not, or retrain the model again with new set of data. Now, what if the model is not optimal enough? As I said, there are so many optimization techniques that are available for you now. Usually, let's say you create a model and it is not really performing well. What the usual ML developers do is actually they tune the hyperparameters associated with that algorithm. So for people who may not know what are these? There is usually a bunch of parameters that contents behavior of a given algorithm. Often machine learning developers are left in dark as to what value should they use for these different parameters to get their model optimized. That's where automated model tuning comes to the rescue. The reality is even machine learning practitioners themselves, even experienced ones, often don't know what to do. In these cases, they just rely on random grid based hyperparameter choosing strategy. Now, Sagemaker allows you to do that by kicking off the so called hyperparameter optimization job, and you specify how many machines it needs to run on to control the cost. And ultimately it helps you identify the right parameters to optimize your model. It does it with the bayesian search algorithm. Basically, let's say it ran the training with bunch of parameters and it gets a model accuracy and it is not fitting the built it will now remember what are the parameters that it ran with in the previous iteration, and hence choose to deviate away or go towards it based on the iteration and the learning that it's picking up. It is not efficient way of actually tuning the model, and many of our customers leverage this for their machine learning training purpose. Amazon Sagemaker Neo is something that I would briefly touch on with Neo. You can compile your models to be ported to any of the target processes you may choose to run your model on. This way your model would not only be smaller and deployment ready built, also more performant. So it doesn't matter which end architecture you want to port your model to. Neo may help you do that and you will also be actually carry your model with you and deploy it in lot lighter form in any architecture of your choice. And the list of architectures to which you could port it to is always being added on. So you could always check our AWS pages to figure out what are the ones that we support. So let's say we have got the data, we trained the model, we have validated the model after the fine tuning thanks to hyperparameter optimization, and we have now got the right accuracy and performance we needed to deploy this in production. Now, what we will see as part of the demo as well is that with one line of python code, you can take this model to production. You can then manage the same and easily scale with Amazon Web services. Now that's the beauty of Sagemaker. Everything is simplified for the developers to leverage this distributed computing platform and focus on the business outcomes that they want, rather than all the undifferentiated heavy lifting they will have to do in building, training and validating these models. So that's the bet with Sage maker that I wanted to cover before we dive into the world of blazingtext algorithm. So as I said, blazing text algorithm was published back in 2017 by a couple of Amazonians. This is the paper that was released back in 2017 to discuss how the blazing text algorithm will go about this particular problem. Now, key thing to note these is this algorithm provides highly optimized implementation of word to Vic and text classification algorithm. Using blazing text. You can train a model with more than billion of words in probably a couple of minutes using multicore cpu or GPU, and you can achieve performance on par with state of art deep learning text classification algorithms out there. So the other important thing that it offers is an implementation of supervised multiclass, multilabel text classification algorithm, extending the fast text algorithm implementation by using GPU acceleration with custom CuDA kernels, but also relying on multiple cpus for certain modes of operating this algorithm. In this particular demo, we will be using distributed training, just so you know. But there is no hard role. You could always do it in single machine if that suffice your needs. These are some of the highlights of blazingtext that I would love to highlight. As I said, you can run with single cpu instances, you can run with multiple GPU acceleration if needed. And these interesting thing that you would see from these slide is it can be 21 times faster and 20% cheaper than fast text on a single c four. And if you go down the distributed training route, it can achieve a training speed of up to 50 million words per second. Now this is speed of eleven times over one c four lodge that we saw in the previous one, which is amazing, actually, the kind of efficiency that these have actually managed to harness from blazing text algorithm is amazing. Now how do they do that in our demo? As I said, we will use blazing text on multiple cpus. But even within single cpu, blazing text takes certain steps to optimize its performance. And that's what we are seeing here. It uses blast two by intel and hence it is a lot more optimized in terms of cpu utilization. And as you see here, this picture is showing how we are optimizing bird to vec by sharing the k negative samples across using the blast to advantage that we have from intel. Now this is a slide that compares the throughput of 1 billion bird benchmark data set. Over here you are seeing throughput characteristics of the blazingtext right hand side you see implementation of fast text, sort of what is published out there. Because fast text is not able to be distributed on multiple cpus or gpus, we are running it for benchmarking on single machine. Now you can compare and contrast the performance when you look at left hand side where the algorithm has been run on multiple multicore gpu machines. And in the middle section where you see the yellow bars, we have performed batch skip gram benchmarking. And again here you are seeing the results of running algorithm in distributed fashion on multiple cpus. But of course throughput is always not the only factor that you would choose to run with a particular algorithm because we also need to consider accuracy and cost. So I would love to actually show you another set of benchmark results. Basically, as you see here, I think the right way to interpret this diagram is that the circle that we see here is the throughput. And when you compare number eight and number two, sorry, I think these number eight is the one that was run by skip Graham blazing text. So yes, what I said is right, if you compare number eight and number two, you get lot of throughput, you get lot of accuracy. In fact, almost same accuracy for very less cost. The horizontal axis here is cost. The vertical axis here is accuracy. The size of the circle denotes the throughput that that particular algorithm is able to achieve with the given configuration. So it just goes on to confirm the previous claim we had that it is a lot more performant and lot more cheaper when compared to our other fast text algorithm. With that we will move on to the demo part of our discussion, which is called as semurai in Tambur. By the way, let me bring the demo page for you. So hope you're able to see this. This is the notebook that I have created in Amazon Sagemaker. Now you could just go to Amazon Sagemaker service. I can probably show you that quickly. I think we'll just continue with the notebook because it might be a bit tricky for me to actually share that screen. So it's very simple. If you go to Amazon Sagemaker service within AWS console, you go to notebook that will be in your left panel and you create a Jupyter notebook. This is as simple as any Jupyter notebook that you would have seen, nothing special about it. And it is just an environment for, if someone is not aware of it. It is just an environment for data scientists to share and do more data science in a collaborative way. That's all about this environment. Now, if you look at this, as I said, we are going to actually take a large corpus of text in Tamil. We will use this large corpus of text for data ingestion purpose. Now, in this case, I have actually taken the dump from this particular URL. You could very well actually get it from whatever of your choice. But ideally, Aws, I said your model is as good as your data. So always please be careful with the data that you choose. In my case, I've chosen it from wiki dump. It's totally up to you where you get it from, but more the data, more you could learn and actually the inferences could be a lot better. Now, we have downloaded this wiki dump. Now there is this wiki extractor py script created by Atadi. You could find it in GitHub. What we are doing here is we are actually passing the dump that we have got and we are doing extraction of that data. So we have downloaded that extractor script and we are using that extractor script to extract these information from the dump. So what this extractor would do is just cleanse the data and make it easy for us to do the machine learning model training. So as you see, it has picked up that file and it has actually given us the list of words that we could actually pass in for our training stage. So we have all these tamil words, mudarpakam, katirakalai, katirangalin, patiyal, poviyal, varala, arupuri. So these are all tamil words that it has picked up from the dump. Now, this is a very big dump that I had downloaded, so I didn't want to actually waste the time during the demo. And hence I have done that hard task of actually getting this downloaded and training the model prior to these session. So I'll just drag to the bottom of extraction part, or rather the cleanse part, and these we go. So the extractor is done. In fact, actually it was just going on for a long time. I thought I've got enough data, so I just killed it and to get on with the next stage. But you can actually leave it for a long time if you want a lot more data. As I said, more the data, more it is good. Now this is where the sagemaker comes to party. If you see here, we are creating a sagemaker. We are importing the sagemaker SDK sagemaker. We are creating a sagemaker session and we are creating a default bucket that we will use for this particular training purpose. Now what we are doing is we are uploading the data. So the cleansed data that we now have got after running that wiki extractor Py is what is now being pushed to the bucket and we are also setting the output location for s three. Now those are the basic constructs that we need from the sage maker service before we could actually go about the algorithm side of things. Now, as I said, using the inbuilt algorithm is just one line of python code. And there you see we are from Sagemaker. We are importing image URis and we are saying image Blazingtext. Now this brings you pointer to the blazingtext algorithm. And here you see you are using Sagemaker blazing text container from EU west one. After we create the container object, which now essentially holds the kind of algorithm that it is going to apply for this training purpose. Again, as I said in this case we have chosen basing text. You could either go and choose some other built algorithm of your choice within Amazon Sagemaker world, or you could bring in your own containers that you might have in on premise to use that for your training purpose. And now we are actually creating the estimator object. Now estimator is where we pass the container object. We just created the role that the training job would assume when it is actually doing the training. So this role is what is going to allow it to retrieve the data from s three, push the data back, or push the model trained model back to s three and do all that sort of start or any other service it has to interact with. This is the role that probably will actually control the permissions associated with that particular training job. We are also giving it the number of instances we wanted to use for training. As I said, it is totally dictated by you. What instances are being used, how many instances are being used, what instance type is being used and what is the input mode. You can actually choose it to be file or there is another option of actually another performance option of input mode. You could go for. And once you choose these parameters and create the estimator object, you then pass the hyperparameters. In this case, we have actually passed the hyperparameters ourselves. But as I said, we could actually use hyperparameter tuning or the hyperparameter optimization option that we had mentioned earlier and discussed about which could actually do that iterations to lock in to the best hyperparameters that will give you the best accuracy and give you the best performance model that you can choose from. Once you set the hyperparameters, you then actually kick start the training by you point these training data that needs to be used and then you kick start the training by calling this fit method. Now, once you say model fit aws, you see here, it starts the training job, it completes the training, and you are going to be just charged for whatever time that the training has run. As you see here, the total training time in seconds is 32 86. So these four instances that you had chosen of type c, four, two, x, lodge, they are going to be charged only for those whatever seconds, 36 odd seconds or 32 odd seconds that the actual training job took. Now, once the training is completed, the trained model is now uploaded to s three. It's now going to be residing in s three. And as I said again earlier during our session, right after this training is completed, if you are happy with these accuracy, usually our customers choose to have a validation stage. So one of your data engineers or data scientists, whoever controls what model gets deployed in production, might get an approval task, and they will see whether these accuracy of the model is good enough to be deployed in production, and then they will give it a go. So this whole thing actually could be orchestrated in a CI CD fashion. We have got something separately called sagemaker pipelines. It's a feature within Sagemaker that you could leverage. There's no charges for it, it's just the way you can do CI CD for machine learning. All of these tasks, starting from ingestion training and then validating, deploying, all of this could be orchestrated in a totally automated fashion if you want, but deployment itself is just that one line of code that you see there. So I'm happy with this model and I want to deploy it in this particular instance type. And that's it. It gets deployed. Now, once it is deployed, you now have an endpoint to do inference against. So if you see here, I am actually creating set of words that I want to use for inference. So these, you see, the first word is tamar, the second word is language, or mori music which is in Tamar isai, song is another word which in tamaris pardal, politics in Tamaris Arasiel, leader in Tamaris, Talibar, year in tamaris and century in Tamaris Notranda. So these are random words. Some of them are related, some of them are not related. We will see how the entrance behaves based on the context that it has explored with the blazingtext algorithm that we used for our training purpose. So we are pointing it to the endpoint that we just created by doing the deployment. And now what is happening here is Aws, I pass these words, it is creating vector representation of these words. Now for example, starting from here, you see, until here is the vector representation of word. Now what it is doing is it is actually trying to map these word thumbnail in an n dimensional space. And that's why you have so many weird numbers like it's being represented as a list here. So you will get these kind of list for EAch word you are trying to vectorize. And then what you are trying to do is actually map these words that are in n dimensional space into two dimensional space for Your picturization or visualization case. But otherwise, this is where the word tobac actually is trying to do the magic. Once this vectorization is completed, you can now actually, now you have the numerical representation of those words. It's just not zeros and ones, it's this weird list of array that we see there in the top. Now this is the real fun. If you see music, the word isai is closer to the word song because song is paddle. They both are closer. And hence, if you see the relationship or the vector subtraction is giving you 6.17, forget about that number. But that's how close they are is what actually it has inferred. But now if you see politics and leader, yes, they are close. And hence if you see vector of leader minus vector of politics, it's giving you 5.8. They are a lot closer because these are words that appear in contents. Now when I try music and politics, it has figured out that they are bit away than politics and leader. So what we have now achieved is actually we have now created the vector representation of each of these words and have now actually identified the distance between them contextually and where would they sit in an n dimensional space in terms of context. So that's what our inferences achieved. Now, because I restricted myself to less words within our carpus and did not bother much about accuracy, we are seeing what we are seeing, but with bit more effort on hyperparameter optimization. This can be lot accurate and very interesting inference could be made from this one. Now another trick that I probably mentioned earlier was you could actually bring in the model and unpack it and apply matte plotlib techniques to actually create a two dimensional representation of these words later on. But I think with that we come to the end of this session. Just to summarize, we started with an unknown language, the language of Tamar, and then we understood what is word to wake and what is word embedding and why do we need to do that. And then we introduced sagemaker as a platform and lot of features that comes packed into it and how sage Maker as a platform could actually help you in making your machine learning development lifecycle lot simpler by offloading the undifferentiated heavy lifting you will be doing at the moment. And we also explored how easy it is to actually apply the placing text algorithm on the data of your choice by simple demo that we saw at the later part of the session. Now, I believe this would have created some sort of interest within you to actually go and explore the natural language processing using some of the input algorithms we have within Sagemaker or the algorithm of your choice and play with one of your favorite language of your choice and explore the world of machine learning within the AWS ecosystem. Thanks for joining the session. It was my pleasure to actually give this session for you and wishing you a great day ahead. Bye now.

See all 23 talks at this event!

Conf42 Machine Learning 2021 - Online

July 29 2021

Natural language modelling with Amazon SageMaker BlazingText algorithm

Video size:

Abstract

Summary

Transcript

Dinesh Subramani

Solutions Architect @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2021 - Online

July 29 2021

Natural language modelling with Amazon SageMaker BlazingText algorithm

Video size:

Abstract

Summary

Transcript

Dinesh Subramani

Solutions Architect @ AWS

Join the community!