Word2Vec Implementation using Python Gensim and Google Colab
We will understand how the implementation of Word2Vec is processed using the Python library Gensim on a free cloud-based environment provided by Google, Colab. This is only to learn what a basic Word2Vec model implementation needs and looks like.
As explained in the previous article ‘Introducing Word2Vec & Word Embedding- Detailed Explanation‘, we now know the ideology used for the word2vec model. Let us now understand how it can be implemented.
Gensim library: It is an open-source Python library used for Natural Language Processing (NLP) tasks such as building word vectors, indexing a document, and other unsupervised topic modeling activities.
Pre-requisites: Any web browser.
Step 1: Start with Google Colab
For this, all you need to do is, search for Google Colab in your web browser. Then sign in with your Google account and create a new notebook. There you have your working space.
Step 2: Import article to prepare corpus
For a word2vec model to work, we need a data corpus that acts as the training data for the model. It is based on this data that our model will learn the contexts and semantics of each word. Google uses a dataset of 3 million words. But, for our ease, we’ll use a simple and easily available Wikipedia article.
Before summarizing any document, we first need to retrieve the data in that document. For this very purpose, we use the Python library BeautifulSoup. To download the library, run the following command in your Colab notebook.
What this library does is help in scraping the webpage for data that we need for our implementation.
The next library we need is lxml which can be installed by running the following command:
The chief function of the lxml library is to process XML and HTML in Python.
Now, we import all our necessary libraries such as urllib, beautifulsoup, nltk using the following code:
The use of ‘punkt’ library is used for tokenization and the ‘stopwords’ library to know what are the stop words in any given language.
Once we’re through with library installation, we proceed towards building our corpus. Our corpus is built using the Wikipedia page for Machine Learning. The Python script for building the corpus is as follows:
The above code uses the ‘urlopen’ method provided by the ‘request’ class which is a part of the ‘urllib’ library to make the Wikipedia article easily accessible by downloading it. Then we create an object of the ‘BeautifulSoup’ library to read the content present in the article and process it. A standard of Wikipedia that we use here is that they store all textual content of the article inside the ‘p’ HTML tag. So, we have used the ‘find_all’ method provided by BeautifulSoup object to retrieve the entire textual content present in the paragraph tag of the document. In our final step, we accumulate all our scraped data into ‘article_text’ for use in later stages of the processing. The result of this step is that we have successfully imported our desired article to build the corpus.
Step 3: Preprocessing
In this stage of processing, we perform dataset cleaning to make our corpus provide better results. This task can be performed using the following code:
When cleaning the text, we first convert the entire data gathered to lower case followed by removing all numerical, special, and blank characters. Now, our dataset consists only of words. This step is necessary to prevent any miscalculation when training the model.
The dataset we retrieved, consists of paragraphs. However, what we need is words that we represent in vector format so, we first break down paragraphs into sentences using the ‘sent_tokenize’ utility and further break those sentences down to words using the ‘word_tokenize’ utility.
At last, to obtain our corpus by adding only those words from the dataset that are not stopwords in the English language. These stopwords are the most common words in a language, such as ‘the’, ‘an’, ‘in’, etc.
The result of this step is a clean and ready to use data corpus which will be used to create and train the model.
Step 4: Creating the Word2Vec model
The use of Gensim makes word vectorization using word2vec a cakewalk as it is very straightforward. This is done by using the ‘word2vec’ class provided by the ‘gensim.models’ package and passing the list of all words to it as shown in the code below:
The only parameter that we need to specify is ‘min_count’. When ‘min_count’ is 2, the processor includes only those words for vector representation that occur in the dataset at least 2 times.
If you wish to view the dictionary comprising all the unique words appearing at least twice in the corpus, run the following code:
This generates a series of all words in the dataset that occur at least two times in the entire document as shown below:
and so on.
The result of this step is the successful creation of our word2vec model.
Step 5: Analyzing the model
Now that our model is ready, we will explore the functionalities of our model.
Vector representation of a word:
The primary purpose of creating this model is to represent words in a multi-dimensional space. So, we first understand retrieve the vector for a word as follows:
When ‘v1’ is printed using the following code, we get the vector representation of the word ‘machine’ based on the semantics and syntactic of the Wikipedia article on Machine Learning.
The default number of dimensions in a vector created by the Gensim Word2vec model is a hundred. Thus, the vector representation of the word ‘machine’ looks like this:
Finding similar words
Now that we have understood the word to vector conversion let us understand how this model helps us find similar words, which is an essential feature of the word2vec model. To find words similar to the word ‘machine and print them, we use the following Python code:
The output obtained will help us comprehend what words are similar to the word ‘machine and how similar are they.
As we can observe, the output consists of a series of words accompanied by their similarity indices concerning the word ‘learning’ sorted in ascending order of similarity index. The similarity index is calculated out of 1. The highest similarity index for the word ‘machine’ is possessed by ‘learning’ which makes sense as a lot of the times ‘machine’ is accompanied by the word ‘learning’. Parallelly, we see that the word ‘data’ also has a high similarity index. Thus, we have successfully created our word2vec model for word vectorization.
This article focused on the implementation of the word2vec model using the Gensim Python library and Google Colab for our working environment. We carried out this process by first retrieving data from the Wikipedia article for Machine Learning. This was followed by some preprocessing on the data such as stop word removal, tokenization to result in the data corpus. Then we built the word2vec model using the Gensim library and understood a couple of its functions.