LSTM: Sentimental Analysis Using Keras with IMDB dataset
Long Short Term Memory is considered to be among the best models for sequence prediction. In this article, I hope to help you clearly understand how to implement sentiment analysis on an IMDB movie review dataset using Python.
The Sequence prediction problem has been around for a while now, be it a stock market prediction, text classification, sentiment analysis, language translation, etc. The most commonly and efficiently used model to perform this task is LSTM.
To understand LSTM, we must start at the very root, that is neural networks.
An Artificial Neural Network (ANN) is a structure of neurons connected. This connection doesn’t contain just one algorithm but a combination of complex algorithms that help us with advanced computation. A type of ANN is Recurrent Neural Network (RNN). It is a classification of NN to process temporary data. Here, the neurons possess a memory. It is designed specifically for extracting sequential data. However, its major drawback is the vanishing gradient problem. Besides, RNN also fails to maintain long-term dependencies. And, to add new information, the whole model modifies all the data it has rather than just inserting the data. Thus, it does not even differentiate between the important and negligible parts of any information. This is what led to the creation of Long Short Term Memory Networks (LSTM).
LSTM is a special category of RNN that possesses the capability to capture long-term dependencies and their selective remembering property which enables them to focus only on the important parts for prediction.
It is a language processing task for prediction where the polarity of input is assessed as Positive, Negative, or Neutral. This proves fruitful for businesses to analyze customer reviews to provide better customer service, or to analyze the aggregate movie rating, etc. Sentiment analysis is not to be confused with emotion analysis where the input may be classified such as joy, sadness, anger, surprise, and many more possibilities.
Basic LSTM working model
A very general LSTM module consists of a cell state and three gates which enables the selective remembering property by deciding what information to learn, unlearn, or retain. The cell state is used to make the information flow run smoothly without any changes by only a few linear interactions. Each unit has a forget gate, an input gate, and an output gate.
Fig : LSTM architecture (Source:Github.io)
- Forget gate:
Used to eliminate unnecessary/less important information from the cell state. This takes two inputs; hidden/output state from the previous cell (h_t-1) and input at that instance (x_t). The inputs undergo a multiplication with weight matrices and then bias is added. Post this, a sigmoid function is applied which gives the output as a vector from 0 to 1, representing each number in the cell state. This sigmoid function is essential because this is what helps decide what values to keep and what can be ignored. ‘0’ indicates forgetting the value and ‘1’ indicates retaining that information. For example, to analyze for sentiment analysis, consider the sentence “I like watching action movies. And this was a DC movie, that is why I liked this movie a lot”. To determine whether the person responded to the movie positively or negatively, we do not need to learn information like it was a DC movie. All that is important is because the person likes action movies, they liked this movie as well. Thus, our forget state gets rid of the DC movie phrase.
- Input Gate:
Used for adding information to the cell state which takes three steps to happen:
- Apply a sigmoid function to decide what part of new information to learn and what can be discarded.
- Then, create a vector of all such values that can be possibly added to the cell state using the tanh function, which gives an output in the range from -1 to 1.
- Finally, multiply the output of the sigmoid function with the created vector and add useful information to the cell state.
3. Output Gate:
This gate is responsible for selecting only the useful information from the current cell state and producing it as the output. The output gate also functions in three steps:
- Here, we first apply the tanh function to the cell state to get an output range from -1 to 1
- And then apply the sigmoid function to the inputs from the previous cell.
- Finally, multiply the output of the sigmoid function with the output of applying tanh and send it as an output.
For this implementation, we used the IMDB movie review dataset. So, download the dataset and bring it onto your working system.
Step 1: Import libraries
Like for every other code, we first import all the necessary libraries that include NumPy, Keras, Pandas, learn. These libraries help us import any prebuilt methods to make reading CSV files, analyzing mathematical data, and other such tasks easy.
Step 2: Load the data
To begin designing our model, we need to import the dataset files into the project code, that is what this step is about. Read the reviews and their corresponding labels from the .txt files into the program.
Step 3: Preprocessing the data
In this step, we aim to represent all the data in a uniform representation by removing the punctuations like !, #, [, ], *, etc, and converting all the words to a lower case so no incorrect identification due to case difference takes place. Once the data is clean, we create a list of all reviews in the dataset and print the number of reviews.
On running this, an output similar to ‘Number of reviews: 25001’ can be seen.
Step 4: Tokenization
Now, we first create a vocabulary to integer mapping using the ‘Counter’ method from the ‘collections’ library. This is done such that all words with higher occurrence frequency are assigned a lower index. Then, we encode the words in our ‘review’ dataset to be represented as integers for better machine understanding and then encode the labels from the ‘label’ dataset.
Step 5: Remove outliers and analyze mean review length
At this stage, we have successfully represented the textual information into integers.
We, now, analyze the average review length to determine what reviews can be used for analysis. For our dataset, the mean review sequence length is 240.
The dataset, however, contains reviews with 1-2 sequences which shall be discarded as they do not help make the system efficient.
Also, the dataset consists of extremely long reviews with sequence length 2514 which only distracts the model from converging at a point and are thus discarded too.
Step 6: Add padding to/ truncate the remaining data.
To make the data uniformly distributed, we add pad bits to reviews with sequence length lower than the mean length and reduce the longer reviews to consider the first words equivalent to the mean sequence length.
Step 7: Splitting the data into train, test, validate.
Now that all our data is uniform in all aspects, we split into a ratio of 8:1:1 to train, validate, and test respectively.
Step 8: Dataloader and batching To integrate data from multiple sources for further analysis, we create data loaders concerning using batching and extract one batch of training data for visualization. This can be easily done using TensorDataset and DataLoaders.
Step 9: Creating LSTM architecture
At this stage, we have everything that we need, to design an LSTM model for sentiment analysis, set up. So, the model processing takes place in the following structure:
Fig: LSTM model flowchart
Step 10: Define the model class
Here, we define the exact specifications of the model with respect to the processing of its hidden layers, the process that happens inside each hidden layer.
Step 11: Network training
- Instantiate the network:
This is done to initialize the network with some values called hyperparameters that can be tuned according to the model training requirements.
- Train the loop
This consists of a standard deep learning code that is usually used to implement PyTorch’s framework defining the optimizer, loss stats calculation, performance backdrop, etc.
Step 12: Testing
Our model is successfully built and is ready to be tested. For this, we first test the testing data from the downloaded dataset and then input from a real-time user.
- On testing data
- On data taken from the user
For this, we need to clean the input and preprocess it to be represented the same way all the data in the dataset is. And then, pad bits/truncate is done to unify the sequence length, if needed. The input data is then ready for prediction.
Results for user data:
user_input = “I did not really get what I was expecting here. There was no story, wouldn’t watch it again”
Negative review detected
We have successfully implemented a model that analyses the sentiment of a movie review using Long Short Term Memory.
Benefits of using LSTM over other models:
- LSTM has memory and can store the information from previous timesteps which is how it efficiently learns the network
- With a few modifications, the model can be made bi-directional to capture the future and past context for each word which better helps understand the importance of each information unit.
- It captures the long-term dependency in any given information.
- Exploits the sequential nature of data such as speech which means that no two words are randomly placed next to each other, they occurring together define some relationship between them which might be important for context extraction.
But, there are some limitations too
- The time-series data can not always be predicted with complete assurance.
- The dataset does not have access to all possible data and computer power.
So, what alternatives can also be considered?
- Gated Recurrent Units (GRU): don’t need memory units and faster to train than LSTM
- Deep Independently RNN (IndRNN): process longer sequences 10 times faster
- Residual Network (ResNet): helps minimize the vanishing gradient problem using skip connections.