Introducing Word2Vec & Word Embedding- Detailed Explanation
The article aims to throw light on the basic concepts of the processing of the Word2Vec model by covering topics that will help you get a clear idea of the technique.
What is Word Embedding?
It is near to impossible for a machine to understand words like humans do, these need to be converted to some numerical representation, this is where word embedding comes in. Broadly speaking, word embedding is a technique used to depict a word in a vector format. It is used for word or context prediction given an input sequence by representing each word or phrase into a multidimensional space.
The simple idea behind word embedding comes from the fact that words that have similar semantics, are closely placed in a multidimensional representation. It permits machine learning algorithms to understand words that have semantic homogeneity.
In a technical sense, word embedding provides the vector representation of words or phrases as real numbers, that are comprehensible by machines, using probabilistic models, neural networks, or dimension reduction on the confusion matrix. The resultant vector space contains clusters of similar meaning words.
For example, the vectors representing cars will be placed in a tight neighborhood, and vectors representing schools will be placed in another densely clustered neighborhood. For word embedding, various models such as word2vec (published by Google), GloVe (published by Stanford), and FastText (modeled by Facebook).
What is the Word2Vec model?
Before this model came into being, natural language processing considered each word to be an individual unit and did not comprehend the similarity between words. However, this model throws light on the fact that the meaning of each word depends on the context it is used in. This channels the fact that words that appear closely in a corpus are bound to appear in adjacent clusters in vector space.
For example, let us take two sentences “The tree is protected by a bark” and “It was a dog bark”. Both these sentences use the word “bark” in a different meaning. From this, it can be understood that the word2vec model focuses on extracting the syntactic and semantic relationship linking the words to result in an accurate vector representation of those words. It is a two-layered neural network which means that before giving an output, 2 layers perform the processing on the input provided, one hidden layer and the output layer.
The output so obtained will be categorized in such a manner that similar meaning words appear nearby and words with dissimilar meaning are distant from each other. This is also known as semantic relationship.
How does Word2Vec work?
The fundamental basis of word2vec exploits the sequential nature of the text. Each sentence or phrase possesses sequential nature which makes it correct syntactically and understandable to the human mind. Thus, word2vec understands the meaning of each word by predicting or understanding the text that comes before and after that word. Given a sentence, “She likes roses”, we need to obtain the vector representation for the word “likes”.
So, the system compares the “likes” with every word in the corpus to retrieve its context based on each occurrence of “likes” and extracting what comes before and after it, the probability of those words, the most common contexts it is used in. After the machine has learned this through a large number of trials, we can give input like “He likes …” and the machine will predict what comes after “likes” taking into account the context of the corpus used for training purposes.
The format of output obtained is usually mathematical, known as cosine similarity, where vectors with complete similarity form a 0-degree angle, and those with complete dissimilarity form a 90-degree angle.
What is the architecture of the Word2Vec model?
Word2vec architecture comprises two techniques: the Skip-gram model and the Continuous Bag of Words (CBOW) model. The process of learning the context and semantics of words for vector representation is chiefly an unsupervised learning task but, to efficiently train the model, labels are needed.
Thus, skip-gram and/or CBOW models are used for the conversion of the unsupervised training model into a supervised model. During the initial phase of training, each word is assigned a vector based on some parameters.
With each prediction made after this, the model makes adjustments to the components of these vectors aiming to place words with homogenous definition and contextual use in a close circuit.
This model aims to predict the current word for a given set of context words enclosed within a specific window. The input to this type of model is given as the words that make up the context of the environment, the further processing is done in the hidden layer the number of dimensions that our output vector will consist, and finally, the output vector gives out the desired vector.
According to a study, it was learned that the best output was obtained with a window of 4, i.e. four words before and four words after each word are considered to extract the context.
This can be understood as a reverse of CBOW in the sense that this model predicts the context within a specific given any input word. The input layer for this model is just one word, unlike CBOW which has the context as input.
The next layer, which is the hidden layer, contains the number of dimensions that we desire to express the word at the input layer in and the final layer, the output layer, consists of the context words.
This model essentially predicts a sentence given any input word concerning the training data used.
In comparison, CBOW proved to provide a result faster than skip-gram and providing higher accuracy with frequent words but skip-gram is more efficient when it comes to training data, as it requires lesser training data and can even represent infrequent or rare words.
This model also proves beneficial when the application demands mathematical operations of word vectors. For example, vectorOf(King) – vectorOf(Man) + vectorOf(Woman) must give us a resultant vector in the vicinity of vectorOf(Queen).
But, the Word2Vec model has its limitations. So, what are they?
- Output evaluation: The format of the output obtained may not be easily understandable for every type of user, thus, its evaluation to get what the system is programmed to do becomes a tedious task.
- Lack of proper training data: For an accurate and efficient system a suitable training data corpus must be available. But, for a huge number of applications, such data is not easily available. For example, to detect the emotion of any given input sequence, it is very difficult to have such a vast corpus as it varies for every person resulting in inefficient systems for emotion detection.
- Computational complexity: Even though huge data corpus is desirable and favored, it only increases the computational complexity when training and implementing the model as the degree of infrequency and variety increases too.
- Hard time dealing with ambiguities: This model cannot deal with ambiguous data efficiently resulting in such a vector for a word possessing multiple meanings that reflects an average word, which does not even closely represent the desired word.
Finally, where can you use the Word2Vec model?
This is a very generic model that can be modified to adapt to any application with just a few changes. Thus, the uses vary widely ranging from recommender systems to automatic summarization, to information retrieval from any document to answering questions.
About the Author:
If machines are going to take over the world, why not make amazing ones? With this thought, I am pursuing B.Tech in Computers. The areas of most interest for me are Machine Learning and Android Development. Using the knowledge that I received throughout my studies, I plan on not just advancing the AI and Machine Learning community but also help those who wish to be a part of it.