Conf42 Machine Learning 2021 - Online

Multilingual Natural Language Processing using Python

Video size:

Abstract

Natural Language Processing(NLP) is an interesting and challenging field. It becomes even more interesting and challenging when we take into consideration more than one human language. when we perform an NLP on a single language there is a possibility that the interesting insights from another human language might be missed out. The interesting and valuable information may be available in other human languages such as Spanish, Chinese, French, Hindi, and other major languages of the world. Also, the information may be available in various formats such as text, images, audio, and video.

In this talk, I will discuss techniques and methods that will help perform NLP tasks on multi-source and multilingual information. The talk begins with an introduction to natural language processing and its concepts. Then it addresses the challenges with respect to multilingual and multi-source NLP. Next, I will discuss various techniques and tools to extract information from audio, video, images, and other types of files using PyScreenshot, SpeechRecognition, Beautiful Soup, and PIL packages. Also, extracting the information from web pages and source code using pytessaract. Next, I will discuss concepts such as translation and transliteration that help to bring the information into a common language format. Once the language is in a common language format it becomes easy to perform NLP tasks. Next, I will explain with the help of a code walkthrough generating a summary from multi-source and multi-lingual information into a specific language using spacy and stanza packages.

Outline 1. Introduction to NLP and concepts (05 Minutes) 2. Challenges in Multi source multilingual NLP (02 Minutes) 3. Tools for extracting information from various file formats (04 Minutes) 4. Extract information from web pages and source code (04 Minutes) 5. Methods to convert information into common language format (05 Minutes) 6. code walkthrough for multi-source and multilingual summary generation (10 Minutes) 7. Conclusion and Questions (05 Minutes)

Summary

  • Gajendra Deshpande will be presenting a talk on multilingual natural language processing using Python. In today's tasks we will be discussing in brief about natural languageprocessing and its concepts. Then challenges in multisource multilingualnatural language processing tools.
  • Google Trans 3.00 is a free and unlimited Python library that implemented Google translation API. It uses Google translate ajax API to make calls to such methods as detect and translate. Once we translate the information to one language then we can process the text.
  • Stanza is a Python NLP package for many human languages. InLtk is a natural language toolkit for indic languages. The libraries are limited by features.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Everyone. My name is Gajendra Deshpande and today I will be presenting a talk on multilingual natural language processing using Python. So in today's tasks we will be discussing in brief about natural language processing and its concepts, then challenges in multisource multilingual natural language processing tools for extracting information from various file formats, extracting information from web pages and source code. Then finally methods convert information into common language format. Let us first look at the few of the basic concepts of natural language processing. So first one is tokenization. So in tokenization, what we do is we tokenize a paragraph into words and sentences. So basically you will be given huge text. So it is not possible to perform computation or processing the entire text at once. So we need to tokenize it say for example, we may have to compute the word frequency and sentence frequency, and we also need to perform ngram analysis. The next is word embeddings. So here we represent words into vectors, that is we convert words into some numeric formats for computation, then text completion. So here we try to predict next few words in a sentence, then sentence similarity, where we will try to find out the similarity score between two sentences. So here the scale will be from zero to one, then normalization. So here we transform the text into its canonical form, then transliteration. So write the text in language a, script using language b script say for example, you can write Korean using english script, then translation. Convert the text in language a to language b. That is, you are directly converting korean text into english language. So there is a difference between transliteration and translation, so both are totally different. Then phonetic analysis. So here we try to determine how two characters sound when we speak those characters. The next is syllabification. Convert text into syllables, then lamatization. Here we convert words into their root form. Say for example, if there is a word called as running, then the root form of running will be just run. Then next concept is stemming. It is bit similar to lamatization, but what we do in stemming is we just remove last few characters from the word. So sometimes it may be similar to lamatization, but the results are not accurate when you use stemming, so you will not be able to convert the words into their root form always. Then language detection. Detect the language of the text or words, then dependency parsing. Analyze the grammatical structure of a sentence, then named entity speechrecognition recognize the entities in the text, for example names, pages, et cetera. Then part of speech. Tag a part of speech in a text, then challenges in multilingual NLP so first challenge is that language is ambiguous. So same sentence may mean different in different languages. So even in the same language, we need to identify the context of the words. Then languages have different structure, grammar and order. Say, for example, we have western languages, we have indian languages, and we have some other languages. So each language has got their own grammar and syntaxes different. Some languages are read left to right and some languages are read right to left. Then hard to deal with missed languages information. Now we know that due to globalization, people speak multiple languages. And when they speak multiple languages, it's a general tendency that they mix the words of different languages. So when we have a data of such a kind, and when we need to process such data, then it may create a problem in processing that kind of information. Because our libraries may not be able to detect or may not be able to identify the words of other languages. The next translation from one language to another language is not accurate. So translation is not always word to word translation. So the meaning needs to be taken into account. Then language semantics need to be taken into account. Right. So that's what I was discussing just now. That is, translation is not word to word, so we need to take into account the context for more accurate translation. Then lack of libraries and features of course there are many libraries in Python, but these libraries may not support all languages and they may be limited by the features. So in this case we may have to use multiple libraries or we may have to hard code many features. Then let us consider a scenario of multi source, multilingual information processing. Say, for example, we need to generate a summary. In that case, these are the steps. So the first step is information source, which is there in different format, then extract the text, then identify the language, translation to source language, then processing the text, and finally translate to target language. So let us discuss these steps in detail. So, information source, our information may be present in various formats, it may be present in text, it may be present in audio, it may be present in video, it may be present in image. But for processing we need the information in textual format. So if the information is available in text format, then it's not a problem. But if the information is available in audio, video or image, then we need to extract the text from audio, video and image. And we know that there are various formats for audio, video as well as image. So we need to try to extract the information from maximum possible formats. So in second step, we are going to extract the text using the libraries available in Python. The next, identify the languages. So here we try to detect the language of the text. So here it is bit challenging because all the text may not be available in single source language. So some words may not be identified because of the feature limitations. So in this case we may have to hard code some features. The next is translate to source language. So here one important step we need to perform. We need to translate the entire text into one single language so that we can perform processing over the text. So this is very very important step. So once we translate the information to one language then we can process the text, say for example, if we have to generate the summary, then in this case we can perform tokenization, we can perform lambdization, we can calculate word frequency and sentence frequency and we can also perform Engram analysis. And based on these steps we can pick top n sentences for the generation of the summary, then finally translate to a target language. So you can generate a summary in the source language or you can generate the summary in the specified language or a destination language. Now let us look at few python packages which will help us to achieve our task. So first one is Google Trans 3.00. So it's a free and unlimited Python library that implemented Google translation API. This uses Google translate ajax API to make calls to such methods as detect and translate. So it is compatible with Python 3.6 and higher versions. It is fast and reliable. That's because it uses the same servers that translate google.com uses. Then auto language detection feature is supported, then bulk translations are possible. Then customizable service URL is supported and it also supports HTTP version two. Then you can install it using Pip command, so you can say pil install Google Trans, then it will be installed on your machine. So if the source language is not given, then Google translate attempts to direct the source language. So you can see here the source code. So first we are importing translator from Google Trans, then we are using the translate function. So note here that we have just specified source language, that is Korean, but we have not mentioned the language. So it tries to detect that it's a korean language and destination is not specified. So by default it will be converted to English. Then next in the next case we have not specified the source language, but we have specified the destination language. So here the source language will be korean and destination language will be japanese. So that is the korean language. Text will be converted to japanese language text. Then in the next example we have specified some text and we have specified that the source language is Latin. In this case we have not specified the destination language, so the latin text will be converted to English. Now you can use under Google translate domain for translation. Say for example if multiple URLs are provided then it randomly chooses a domain. So you can specify either one domain or you can specify multi domains. If you specify multiple domains then it will select one domain randomly. Then the detect methods, as its name implies identifies the language used in the given sentence so you can use detect method to identify the language of the given text. Important point here is that it is unofficial, unstable and maximum character limit on a single text is 15k characters. So the solution is use Google's official translate API for your requirements. The next is speechrecognition. Here we try to extract text from an audio file. So it's a library for performing speechrecognition with support for several engines and APIs and also it's available online as well as offline. So speech recognition engines or API supported by speechrecognition packages are CMU sphinx, it works offline and also snow by hot word detection also works offline. Apart from that Google speech recognition, then Google cloud speech API width AI, Microsoft Bing voice recognition, hindi fi API, then IBM speech to text engines are supported. Then this is how we write the code. First we import the speech recognition package, then we specify the file name from which we want to extract the text. Then we use a recognizer function to initialize the recognizer. Then we use record function and there we specify the source file name. And finally we use recognize Google function and this function will convert the speech to text and store the information in the text variable. And finally we can print the text or you can store it in a variable for further processing. The next is pytessaract package, so it is used to extract text from an image file. So Python Pytessaract is an optical character recognition tool for Python. That is, it will recognize and read the text embedded in images. Python Pytessaract is a wrapper for Google's pytessaract OCR engine. It is also useful as a standalone invocation script to Pytessaract as it can read all image types supported by the pillow and leptonica imaging libraries including JPeg, png, GiF, BMP, TIFF and others. Additionally, if used as a script, Python test track will print the recognized text instead of writing it to a files. So just three lines are enough to extracting the text from an image file. So first import the pytessaract package, then initialize the pytessaract command and then specify the images path. So we use image to string methods so which will read the image and convert the data in the image to a string. The next is beautiful soup four. Then it is used to extracting the information from a web page. So if you have performed web page scraping then you are familiar with this package. So that is beautiful soup four. So it's a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser providing pythonic idioms for extracting, searching and modifying the parse tree. So this is how we write the code here. So we first import requests package, then we import beautiful soup package. Then we specify the URL from which we want to extracting the information. Then next we specify the parser. So here HTML parser has been specified. Now the thing here is that when you use beautiful soup it extracts the source code. Also it's not the server side source code, but it is the code which is rendered by the web browser. So the next task will be to remove all the unwanted code and also to navigate to appropriate location in the web page. Maybe if you are using XML then you can use xPath queries to navigate to a particular location on the web pages to extracting its content. The next library which we will see is the stanza. So it is Python NLP package for many human languages. So stanza is by Stanford. So earlier it was known as Stanford NLP, but now they have challenges its name so now it is known as stanza. So it's a collection of accurate and efficient tools for many human languages in one place. So starting from raw text to syntactic analysis and entity recognition, stanza bring state of the art NLP models to languages of your choosing. So native Python implementation requiring minimal efforts to set up so full neural network pipeline for robust text analytics including tokenization, multi word token expansion, lamatization part of speech tagging and morphological feature, staging, dependency parsing and named entry speechrecognition are supported. Then pretrained neural models supporting 66 human languages. Then it's a stable, officially maintained Python interface to core NLP. So you can refer GitHub repo for more information. Then you can also visit stanza run website for live demo. Then on this slide you can see the output of stanza. So here I had pasted one story, so you all know about the story. That's a race between tortoise and rabbit. So you can see here that it is showing the part of speech output, then it is also showing the lemmas. Then it's also showing the named entity recognition. So you can say that in named entity recognition it's saying that two is a cardinal, then it is showing the universal dependency between different words in a sentence. The next, what I did was the same text, same story was converted to another language. So that's Hindi. So you can see here that part of speech is working fine. It has successfully identified part of speech. Then lemmas also working fine. But if you see the named entity recognition, this feature is not yet supported. So those who want to contribute, they can think of contributing here in this particular area for Hindi language. So likewise if you consider some other languages, so features are lacking. So this is what I was speaking earlier. So libraries are limited by features. We may have to hard code some features now it also shows universal dependencies. So that's fine, that's not a problem. Then next is inLtk. So it's a natural language toolkit for indic languages. It's created by Gaurav Aurora. It aims to provide out of the box support for various NLP tasks that an application developer might need for indic languages. So indic languages means the languages which are used in India. So India is very rich in terms of languages. So it has got around 22 official languages. Then it supports native languages and code missed languages. So native languages means the text in a single language. Maybe the Canada Hindi or Marathi, Tamil, Telugu or some other language code. Mixed languages means the words from two or more languages is mixed. Say for example we can say English which is combination of Hindi and English. We can say English which is consideration of Canada and English. That means the script will be in Canada, but in between some english language words are used. InTK is currently supported only on Linux and Windows ten with Python version greater than or equal to 3.6. The next library is indic NLP library. So you can see here that the language support there are different classifications, that is Indo Aryan, dravidian and others. Then it also shows that the features supported by the various languages in India. So if you see dravidian languages, most of the features are supported in Indo RN category, Hindi and Bengali, Gujarati, Marathi and Kokane. They support all the features. Even Punjabi supports the features like script information normalization, tokenization, word segmentation, romanization and so on. Then there are some languages which support bilingual features. That is, script conclusion is possible among the above mentioned languages, but except Urdu and English it is not possible. Then transliteration is possible. Then also the translation is possible. Then this library was created by Anup Konchukutans. The goal of the Indian LP library is to build Python based libraries for common text processing and multilingual natural language processing. Indian languages indian languages share a lot of similarity in terms of script, phonology, language syntax, et cetera. And this library is an attempt to provide general conclusion to very common required tool sets for indian language text. Then polyclot is another interesting library. It's a very vast library and it supports most of the human languages in the world. It's really a massive library. It's developed by Rami al Rafu. It supports various features, and you can see here in bracket how many languages it supports. Say for example, tokenization is supported by 165 languages, language detection for 196 languages, name recognition for 40 languages, part of speech tagging for 16 languages, sentiment analysis for 136 languages, word embeddings for 137 languages, morphological analysis for 135 languages. Then similarly transliteration for 69 languages. Again, note here that the features are limited for few of the languages. So again, there is a scope for contribution here. Then finally, summary. So, performing NLP tasks on multiple human languages at a time is hard, especially when the text includes mixed languages. The information need to be extracted from multiple sources and multiple languages, and should be converted to a common language. Multilingual NLP helps to generate output in a target language. So one feature what we are doing here is we are converting the information to a source language, and then we are converting the information to a specific target language. So there are various libraries offering different features, but not a single library offers all features. So that means that there's a lot of scope for contribution and also a lot of features needs to be hard coded. Thank you everyone for attend my talk.
...

Gajendra Deshpande

Assistant Professor @ KLS Gogte Institute of Technology

Gajendra Deshpande's LinkedIn account Gajendra Deshpande's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways