Sentiment Analysis of IMDB Movie Reviews using Convolutional Neural Network (CNN) with Hyperparameters Tuning

Alireza Bagheri

Data

In this project, I will use IMDB movie reviews. This dataset contains 50,000 movie's reviews from IMDB, labeled by sentiment (positive/negative). The dataset can be loaded and splitted into training and test sets as the following.

Load IMDB movie reviews

In [1]:
from keras.datasets import imdb

(X_train, y_train), (X_test, y_test) = imdb.load_data()

print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
25000 train sequences
25000 test sequences

Let us have a look at the first sample of training set.

In [2]:
print(X_train[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

As it clear, the text of reviews is integer-encoded, where each integer represents a specific word in the dictionary.

Decode reviews from index

We can convert the integers back to words as the following.

In [3]:
INDEX_FROM = 3
word_index = imdb.get_word_index()
word_index = {key:(value+INDEX_FROM) for key,value in word_index.items()}
word_index["<PAD>"] = 0    # the padding token
word_index["<START>"] = 1  # the starting token
word_index["<UNK>"] = 2    # the unknown token
reverse_word_index = {value:key for key, value in word_index.items()}

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

decode_review(X_train[0])
Out[3]:
"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

In continue, I will only consider the top 5,000 most common words. I will also consider 20% of the training set for validation purpose.

In [4]:
vocab_size = 5000 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words= vocab_size)

X_train, X_val = X_train[:-5000], X_train[-5000:]
y_train, y_val = y_train[:-5000], y_train[-5000:]

print(len(X_train), 'train sequences')
print(len(X_val), 'val sequences')
print(len(X_test), 'test sequences')
20000 train sequences
5000 val sequences
25000 test sequences

Let us inspect how the first review looks like when we only consider the top 5,000 frequent words.

In [5]:
decode_review(X_train[0])
Out[5]:
"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly <UNK> was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big <UNK> for the whole film but these children are amazing and should be <UNK> for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was <UNK> with us all"

Truncate and pad the review sequences

Movie reviews can be different lengths. We will use the pad_sequences function to standardize the lengths of the reviews.

In [6]:
from keras.preprocessing.sequence import pad_sequences

maximum_sequence_length = 500 # maximum length of all review sequences

X_train = pad_sequences(X_train, value= word_index["<PAD>"], padding= 'post', maxlen= maximum_sequence_length)
X_val = pad_sequences(X_val, value= word_index["<PAD>"], padding= 'post', maxlen= maximum_sequence_length)
X_test = pad_sequences(X_test, value= word_index["<PAD>"], padding= 'post', maxlen= maximum_sequence_length)

print('X_train shape:', X_train.shape) # (n_samples, n_timesteps)
print('X_val shape:', X_val.shape)
print('X_test shape:', X_test.shape)
X_train shape: (20000, 500)
X_val shape: (5000, 500)
X_test shape: (25000, 500)

Let us check the first padded review.

In [7]:
print(X_train[0])
[   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
    2   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117    2   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141    6  194    2   18    4  226   22   21  134  476
   26  480    5  144   30    2   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16    2   19  178   32    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0]

Build the model

Create the model

In this project, I will consider a Convolutional Neural Network (CNN) for the text classification.

In [8]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import ParameterGrid
from keras.callbacks import EarlyStopping

embedding_dim = 16

def create_model(filters = 64, kernel_size = 3, strides=1, units = 256, 
                 optimizer='adam', rate = 0.25, kernel_initializer ='glorot_uniform'):
    model = Sequential()
    # Embedding layer
    model.add(Embedding(vocab_size, embedding_dim, input_length= maximum_sequence_length))
    # Convolutional Layer(s)
    model.add(Dropout(rate))
    model.add(Conv1D(filters = filters, kernel_size = kernel_size, strides= strides, 
                     padding='same', activation= 'relu'))
    model.add(GlobalMaxPooling1D())
    # Dense layer(s)
    model.add(Dense(units = units, activation= 'relu', kernel_initializer= kernel_initializer))
    model.add(Dropout(rate))
    # Output layer
    model.add(Dense(1, activation= 'sigmoid'))
    
    # Compile the model
    model.compile(loss='binary_crossentropy',
                  optimizer= optimizer,
                  metrics=['accuracy'])
    return model
# Build the model
model = KerasClassifier(build_fn= create_model)
Tune hyperparameters

Now, it is time to tweak hyperparameters to imporve accuracy over validation set.

In [9]:
# Set the hyperparameters
filters = [128] #[64, 128, 256]
kernel_size = [5] #[3, 5, 7]
strides= [1] # [1, 2, 5]
Dense_units = [128, 512]
kernel_initializer = ['TruncatedNormal'] #['zero', 'glorot_uniform', 'glorot_normal','TruncatedNormal']
rate_dropouts = [0.25] #[0.1, 0.25, 0.5]
optimizers = ['adam'] #['adam','rmsprop']
epochs = [5]
batches = [64] #[32, 64, 128]
# ----------------------------------------------
# Exhaustive Grid Search
param_grid = dict(optimizer= optimizers, epochs= epochs, batch_size= batches,
                  filters = filters, kernel_size = kernel_size, strides = strides, 
                  units = Dense_units, kernel_initializer= kernel_initializer, rate = rate_dropouts)

grid = ParameterGrid(param_grid)
param_sets = list(grid)

param_scores = []
for params in grid:

    print(params)
    model.set_params(**params)

    earlystopper = EarlyStopping(monitor='val_acc', patience= 0, verbose=1)
    
    history = model.fit(X_train, y_train,
                        shuffle= True,
                        validation_data=(X_val, y_val),
                        callbacks= [earlystopper])

    param_score = history.history['val_acc']
    param_scores.append(param_score[-1])
    print('+-'*50) 

print('param_scores:', param_scores)
print("best score:", param_scores[p])
# Choose best parameters
p = np.argmax(np.array(param_scores))
best_params = param_sets[p]
print("best parameter set", best_params)
{'batch_size': 64, 'epochs': 5, 'filters': 128, 'kernel_initializer': 'TruncatedNormal', 'kernel_size': 5, 'optimizer': 'adam', 'rate': 0.25, 'strides': 1, 'units': 128}
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 57s 3ms/step - loss: 0.5705 - acc: 0.6634 - val_loss: 0.3607 - val_acc: 0.8482
Epoch 2/5
20000/20000 [==============================] - 60s 3ms/step - loss: 0.3236 - acc: 0.8582 - val_loss: 0.2914 - val_acc: 0.8792
Epoch 3/5
20000/20000 [==============================] - 62s 3ms/step - loss: 0.2426 - acc: 0.9029 - val_loss: 0.2723 - val_acc: 0.8894
Epoch 4/5
20000/20000 [==============================] - 62s 3ms/step - loss: 0.2053 - acc: 0.9207 - val_loss: 0.2776 - val_acc: 0.8902
Epoch 5/5
20000/20000 [==============================] - 65s 3ms/step - loss: 0.1698 - acc: 0.9348 - val_loss: 0.2765 - val_acc: 0.8908
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
{'batch_size': 64, 'epochs': 5, 'filters': 128, 'kernel_initializer': 'TruncatedNormal', 'kernel_size': 5, 'optimizer': 'adam', 'rate': 0.25, 'strides': 1, 'units': 512}
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 67s 3ms/step - loss: 0.5520 - acc: 0.6775 - val_loss: 0.3555 - val_acc: 0.8404
Epoch 2/5
20000/20000 [==============================] - 65s 3ms/step - loss: 0.2975 - acc: 0.8766 - val_loss: 0.2785 - val_acc: 0.8852
Epoch 3/5
20000/20000 [==============================] - 66s 3ms/step - loss: 0.2270 - acc: 0.9097 - val_loss: 0.2581 - val_acc: 0.9000
Epoch 4/5
20000/20000 [==============================] - 66s 3ms/step - loss: 0.1868 - acc: 0.9285 - val_loss: 0.2644 - val_acc: 0.8972
Epoch 00004: early stopping
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
param_scores: [0.8908, 0.8972]
best score: 0.8972
best parameter set {'batch_size': 64, 'epochs': 5, 'filters': 128, 'kernel_initializer': 'TruncatedNormal', 'kernel_size': 5, 'optimizer': 'adam', 'rate': 0.25, 'strides': 1, 'units': 512}

Train the model

Here, I train the model with the best obtained hyperparameters over train + validation sets.

In [10]:
model.set_params(**best_params)
model.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))
Epoch 1/5
25000/25000 [==============================] - 47s 2ms/step - loss: 0.4977 - acc: 0.7292
Epoch 2/5
25000/25000 [==============================] - 47s 2ms/step - loss: 0.2748 - acc: 0.8858
Epoch 3/5
25000/25000 [==============================] - 49s 2ms/step - loss: 0.2203 - acc: 0.9132
Epoch 4/5
25000/25000 [==============================] - 52s 2ms/step - loss: 0.1850 - acc: 0.9295
Epoch 5/5
25000/25000 [==============================] - 56s 2ms/step - loss: 0.1573 - acc: 0.9400
Out[10]:
<keras.callbacks.History at 0x45e29470>

Evaluate the model

Finally, I evaluate performance of the trained model over unsean test set.

In [11]:
print("Test accuracy = %f%%" % (accuracy_score(y_test, model.predict(X_test))*100))
Test accuracy = 89.220000%