NLP: Stanford’s GloVe for Word Embedding
If you are looking to embed your words into vector representation but using a global dataset, this is the right space for you. With this article, I hope to guide you through how GloVe works, what the benefits of GloVe are, and even some of its applications.
Global Vectors (GloVe) is an unsupervised learning algorithm used to represent words in a more machine-understandable format, i.e., vectors. This is a word embedding technique, meaning, it is used to portray input words in such a format that can be interpreted by a machine without any extra efforts, which is the vector representation where a set of dimensions defines each word.
It functions on the primary intent to retrieve word vectors using global statistics and not just a particular dataset used to achieve very general results. For GloVe, training takes place on the aggregate of the globally co-occurring word statistics, which depicts the relation between words using a variety of linear sub-structures.
GloVe is a log-bilinear model that works on a weighted least-squares objective. The fundamental logic behind it is that the ratio of word-word co-occurrence probability can be used to determine at least some meaning of the words to be vectorized easily.
How is GloVe better than other models?
Similar to most unsupervised learning models, this method too uses statistics of word occurrences in a dataset as the chief source of gathering information. A co-occurrence matrix depicts how often a particular couple of words occur together using which the relationship between them can be understood. Each value in the co-occurrence matrix represents the number of times the corresponding words occurred together.
GloVe is essentially a count-based model. Count-based models generally learn vectors by performing dimensionality reduction on the co-occurrence matrix.
The method begins by constructing a large matrix comprising of the co-occurrence information which informs us the number of times each word is used (rows) and the context it is used in (columns). After this, this matrix is factorized to yield a lower-dimensional matrix consisting of words and features, which results in each row representing the vector for the corresponding word.
The result is achieved by ensuring minimal ‘reconstruction loss’ which looks for such lower-dimensional representation which can easily explain the variance in high-dimensional data.
For GloVe, the pre-processing of the count matrix is done by log-smoothing the counts and normalizing them.
To better understand the working, let’s define a few terms first.
Xij represents the number of times word j occurs in the context for word i
Xi = ∑k Xik
Pij = P(j|i) = Xij/Xi
GloVe encodes the information regarding the probability ratio to give out vectors of words. The ratio of co-occurrence probabilities can be represented as:
As can be understood from the above formula, word correlation in GloVe is extracted by the ratio of probabilities of those words occurring together in global statistics.
Let us consider an example with two words i= ice and j=steam. The correlation between these words can be understood by studying the ratio of their co-occurrence probabilities with a variety of probe words k.
As per results obtained from a study, where probe words solid, gas, water, and fashion are studied, the following co-occurrence matrix was obtained.
This shows that the probability of solid and ice occurring together is higher than that of gas and ice occurring together. Similarly, the probability of steam and gas occurring together is higher than that for steam and solid occurring together. On the other hand, when looked at the probability water and fashion occurring with ice and steam, it can be comprehended that either these probe words are highly related to ice and steam or are not at all related as their probability is close to 1 (as seen from the lowest row).
How is GloVe better than other models?
As an alternative to GloVe, the word2vec model can also be used for word embedding. However, weighing their pros and cons will give you a better idea of the benefits of GloVe. The principal difference is that GloVe allows for parallel implementation making it faster to train over huge data.
GloVe also extracts the co-occurrence probability with the global statistics whereas word2vec is trained on the dataset it is made to focus on.
GloVe is credited to combine the merits of the word2vec’s skip-gram model for tasks requiring word analogy tasks and those of matrix factorization methods to work on global statistical information.
The Skip-gram model may be better at word analogizing words, but they poorly utilize the statistics of the dataset as they are not trained on the co-occurrence counts of global information. The prime essence of GloVe is to capture word similarity, and that of the word2vec is analogical reasoning.
However, GloVe has a few limitations
- It uses a huge amount of memory as the processing of training the model happens on global statistics.
- Sometimes, it is a bit sensitive to the initial learning rate.
- Co-occurrence matrices train quickly but they focus on the word similarity and emphasize the common words a little too much.
- To extract the context of words, they require extra effort.
Because GloVe can efficiently extract relations, it can be used to determine linear relations like those between zip codes and cities, company and product, gender relations (king and queen), or even synonyms. GloVe is used as the word embedding framework for the offline and online systems that were designed to detect psychological discomfort in patients. It was also used by the SpaCy model to build semantic word vectors to devise the top list words to match with the distance measures such as Euclidean distance and Cosine similarity.
Like everything else GloVe has its own merits and demerits. But it is the developers’ choice of how to use them. Implementing GloVe with Python is an easy way to do so and design creative applications that need word embedding.