A Quick Guide to Transformer Models
Natural Language Processing is at its all-time peak today and a lot goes in the background for you to take advantage of the progress made to date. The latest architecture used for this is the Transformer Model and its variants.

NLP is what helps the computer understand the human language with ease. This requires the transformation of human input language to a numerical representation like vector representation and then analyzing those values to extract the semantics and context to give the desired output. Ever since mankind was introduced to NLP, there have been a lot of advancements in the technology, empowered majorly by deep learning. One of these revolutionary creations is the Transformer Model.
But what makes it such a groundbreaking development?
Among the first models used for NLP is the sequence-to-sequence model, used to convert from one representation to another, for example, English to French. These models were based on the Recurrent Neural Network (RNN). A beneficial feature of this translation is the ability to capture the dependency relationship between words of the language phrases.
For example, in the following input phrase:
If a human were to analyze this sentence, they would easily understand that the first ‘It’ refers to the hill and the second ‘It’ refers to the horse. This relation is called the dependency relation. But, when it comes to computer analysis, it is not very easy for the computer to understand this.
Thus, RNNs were used to identify this relationship. Their performance was improved even more when the attention mechanism was introduced in 2015.
However, there are a few challenges that demanded the development of a better model. Seq-to-seq model, even with the use of attention mechanism, could not process the long-range dependencies accurately, and did not let the developers work with parallelization of tasks. So, Google Brain’s Transformer model came into existence to overcome all these issues
Transformer Model
This is a relatively new model in NLP that processes a sequential dataset solving the issues of the sequence-to-sequence model. The chief application of this model is machine translation and text processing. As per research, the transformer is seen to perform better than all the other models as per evaluation by nearly all benchmarks.
Unlike the models that you have seen, the transformer model does not employ any recurrent kind of connections, hence, no memory of the states before the current one. The major downfall of this no-memory model is conquered by processing the entire sequence simultaneously. Initially, the transformer was designed for sequence transduction tasks like the text to speech, speech recognition, etc. using convolutional neural networks (CNN).
As a transformer works on a feed-forward mechanism, they demand to use an unique design of hardware that suits the modern ML algorithms better. A transformer may need to employ a lot of memory when training but training with a lowered precision can in turn reduce the demand for memory required.
The transformer model works on the attention mechanism and a network of feed-forward connections, and the entire processing is divided into two sections; the encoder and the decoder. On the encoder end, an input of sequence representation is mapped to a sequence of symbol representation. And the decoder then produces an output sequence one element symbol at a time.
Attention Mechanism
The various building blocks of the model are easily comprehensible, but the attention mechanism is a very small yet extremely crucial component that adds to the performance of the transformer model.
When an input is taken in a given language from the user, it is obvious that not all words from the phrase comprise the meaning and context of the input. And it is quite obvious that not every word in a sentence holds equal weight when contributing to the meaning of the sentence. Thus, for the computer to do this, the attention mechanism is used.
When a human translates from one language to another, they pay special attention to the word they are translating at the time. But neural networks perform this process with the use of the attention mechanism, which focuses only on a part of the information that is given to it.
Besides using attention for the computation of representations from the hidden state of the encoder, the attention mechanism is used to compute the hidden state of the encoder itself. The merit of this is that it helps get rid of the sequential formation of the RNN, which was the chief reason preventing parallelization. The attention mechanism solves the problem of parallelization by supremely boosting the speed of translation from one model to another.
Thus, attention is the solution used for selectively weighing the variety of elements of the input phrase such that they have a calculated effect on the hidden state of the following downstream layers. And it allows the model to look at other words to extract the meaning of the current word in the sentence.
Model Architecture
The transformer model consists of two steps; encoding and decoding. The encoder chunk is made up of a multi-head attention layer and a feed-forward network. Whereas, the decoder has a masked multi-head attention layer in addition to those similar to the encoder. Both these chunks, the encoder, and the decoder, are a set of a club of encoders and decoders working together, with the same number of units on both sides. The computation of self-attention takes place multiple times in parallel independently, hence the name multi-head attention.
Encoder
The input from the user enters the system through this block. Each encoder layer has two sub-layers; multi-head attention and the position-wise feed-forward network. To connect each of these sub-layers, a residual connection is used, for which a predefined dimension of output is produced by each layer that ensures that the data passing throughout the model is uniform in nature. The encoder analyzes each word and produces a new representation of it after the word is scanned by all the functioning entities of the encoder.
The input phrase is first cleaned to get rid of outliers, stop words, and tokenization also is performed to extract each word individually. After this, with the use of a word embedding method, like GloVe or Word2vec, vector representations of those words are created. Once vectorized input is produced, the computer then gets to the actual transformation work by deciding the weight of each word under the attention mechanism. The output of this is passed to the decoder to translate to the target language.
Decoder
This is made up of the same number of layers like the encoder performing the similar functions and a multi-head attention layer that analyzes the output of the encoder. The attention mechanism uses a masking function to avoid overlapping of the resulting tokens, by adding an offset of one position.
It makes sure that the translation of a word at position ‘i’ is calculated only by the words that have appeared before it. This makes it easy to visualize how the information flows by helping understand the amount of attention paid by the transformer to other parts of the input sentence. Like almost all other sequence translation models, learned embeddings are used to convert the input and the output to their corresponding vector representations.
Once the major function of the decoder is covered, i.e. translation of the separate tokens from the input language to a desired target language, post-processing methods like softmax function and linear transformation are used to make the output human-comprehensible, by predicting the probability of occurrence for each token we have to possibly convert.
Positional Embedding
Unlike the models that have existed, this model does not make use of the recurrent or convolutional networks, but the sequence of input is input in the form of vectors at the lowermost layer of the encoder and decoder stack.
This vector is used to identify the pattern for determining the position of words in the sequence relative to each other. To do this, positional embedding is used. What happens here is that the distance between vectors is stored after they have been projected in the multi-dimensional space.
The components are clubbed together to help transform from one input language to a desired target language. To lower the result of prediction error, optimizers are used when developing the model, which is usually the Adam optimizer.
Limitations
The transformer is undoubtedly an improvement over the RNN empowered seq-to-seq model. However, it still has a long way to go by overcoming its limitation which majorly sources from the attention mechanism. The length for which the attention mechanism can process is fixed and limited.
This requires the text to be split into diverse chunks before being translated. This splitting of the text leads to a context fragmentation and may not extract the meaning of the entire sentence correctly. To resolve this problem, Transformer-XL was developed, where the hidden states from the segments before the current one are used for the information for the current segment empowering the user to recognize long-term dependency identification from one fragment to another.
It can be said that the development of the transformer model is a groundbreaking discovery for the NLP industry, which lets developers modify this model as required by their application.