Stanford’s GloVe Implementation using Python
Let us understand how we use the pre-built model given by Python to implement GloVe and perform word embedding using Google Colab.
As we already know from previous articles, word embedding is used to represent a word in their corresponding vector format so that it is easily understandable by the machine. There are two methods to implement GloVe for word embedding:
- Using pre-built models.
- Building the model from scratch.
In this guide, we use the pre-build model as our goal is to understand what GloVe implementation looks like. Before we get started, go through this Guide on How to use Google Colab so that flowing about the implementation is easy.
Step 1: Install Libraries
The first steps to any Python program are importing all the necessary libraries and install those that may not already be present. that the application needs. So, GloVe implementation needs the following libraries:
This library helps us use the pre-built GloVe model that will perform word embedding by factorizing the logarithm of the co-occurrence matrix based on the words in the corpus.
Stands for Natural language toolkit helps perform tasks and analyses that have linguistic and language aspects.
This is a library that helps us perform word and sentence level tokenization.
It has the set of all stopwords in any desired language that we need to remove when pre-processing the data.
We use the data and methods of this library to lemmatize our data.
Step 2: Define the Input Sentence
Now, that all our libraries have been successfully installed and imported, we begin by taking and input and cleaning the input to make the embedding efficient and devoid of any processing taking place on noisy data. The input that we have considered is a list of various strings that form a meaningful sentence. There are many ways in which an input can be taken, that is, through text, speech, etc.
Step 3: Tokenize
This step on, we begin with the data pre-processing methods, beginning with tokenization. It is the process of breaking down into smaller units. Since our input is already an individual sentence and we need to represent each word uniquely, we perform word tokenization. Here, each word in a sentence is split and considered as a single unit.
Step 4: Stop Word Removal
The entire input that we started with has now been converted into a sequence of words that are no more related to each other. The next step is to remove stopwords. Stopwords are those words in a language that are used to frame a sentence but hold no specific meaning. For example, the English language has stop words like ‘a’, ‘this’, ‘of’, etc. So, here we scan each word in the input and check if it belongs to the set of stopwords that are present in the English language. This set of stopwords is already given to us by the ‘stopword’ library, the user need not explicitly define them.
Step 5: Lemmatize
What we now have is a set of only those important words that define the meaning of any sentence and these are the only words that we need to embed. But before that, we convert all the words into a standard dictionary format by removing any suffixes, prefixes, tense affixes, etc. For this, we perform lemmatization where our words are transformed from the form that they occur in the sentence to the form they occur in an ideal description. For example, the lemmatized output for the word ‘players’ is ‘player’.
On successful completion of this step, our data is clean and without any noise with only important words and in a standard format and we proceed towards building our model.
Step 6: Building model
In order to build the model, we begin by importing the ‘Corpus’ and ‘Glove’ module provided by Python in Google Colab. These libraries help us define a corpus and modify the pre-defined model according to our requirements.
The corpus.fit method is used to alter the user input according to the number of dimensions intended for the application. It takes two arguments:
- input_array: this is a 2D array that we pre-processed and need to get word embeddings for.
- Window: this defines the distance between any two words that the algorithm is supposed to consider in order to find some relationship between them.
The next method is the ‘Glove’ method which defines what the output will be like and the format and dimensions of the output. It takes the following parameters:
- no_of_components: this determines the dimensions that the vector will have.
- learning_rate: this defines the rate at which the algorithm reaches towards the maxima that is the best possible vector representation because the algorithm uses the gradient descent method.
Moving further, the next method that is used is ‘glove.fit’. Using this, we specify how the model training will take place and on what input. The following parameters are used:
- Co-occurrence matrix: the word matrix that consists of the co-occurrence count from global statistics.
- Epochs: the number of times the algorithm scans through the dataset.
- No_of_threads: total number of threads used by the algorithm for execution.
At this point in implementation, the glove object holds the word vectors corresponding to the lines that we have used as input but the dictionary is still held by the corpus object. So, we add the dictionary to the glove object using the ‘add_dictionary’ function in order to make the representation complete.
On successfully reaching this point, your model is ready to provide word embeddings for almost any point. Thus, we proceed towards evaluation.
Step 7: Evaluate the model
The model is ready to be used for the intended application as it can now efficiently produce word embeddings for almost any word. For example, we need to find the vector representation for the word ‘samsung’, this can be done using the following code.
The output of this will be in form of an array that contains some numbers. These numbers are the vectors for the input word ‘samsung’.
Using the above-explained method, we can easily incorporate the GloVe word embedding method for any application by simply modifying a few parameters to suit the application. This is used to create many Machine Learning algorithms such as KNN, K-means, SVM, Document classification, Sentiment Analysis, etc.
Read more about the detailed explanation of GloVe for Word Embedding