forked from susanli2016/NLP-with-Python
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b804513
commit 6ce1a5b
Showing
1 changed file
with
347 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,347 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## RNNs\n", | ||
"\n", | ||
"We will use Recurrent Neural Networks, and in particular LSTMs, to perform sentiment analysis in Keras. Conveniently, Keras has a built-in IMDb movie reviews dataset that we can use." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from keras.datasets import imdb" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Loaded dataset with 25000 training samples, 25000 test samples\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"vocabulary_size = 5000\n", | ||
"\n", | ||
"(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)\n", | ||
"print('Loaded dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
" Inspect a sample review and its label" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"---review---\n", | ||
"[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]\n", | ||
"---label---\n", | ||
"1\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print('---review---')\n", | ||
"print(X_train[6])\n", | ||
"print('---label---')\n", | ||
"print(y_train[6])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Map word IDs back to words" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"---review with words---\n", | ||
"['the', 'and', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'and', 'and', 'br', 'villain', 'and', 'and', 'need', 'has', 'of', 'costumes', 'b', 'message', 'to', 'may', 'of', 'props', 'this', 'and', 'and', 'concept', 'issue', 'and', 'to', \"god's\", 'he', 'is', 'and', 'unfolds', 'movie', 'women', 'like', \"isn't\", 'surely', \"i'm\", 'and', 'to', 'toward', 'in', \"here's\", 'for', 'from', 'did', 'having', 'because', 'very', 'quality', 'it', 'is', 'and', 'and', 'really', 'book', 'is', 'both', 'too', 'worked', 'carl', 'of', 'and', 'br', 'of', 'reviewer', 'closer', 'figure', 'really', 'there', 'will', 'and', 'things', 'is', 'far', 'this', 'make', 'mistakes', 'and', 'was', \"couldn't\", 'of', 'few', 'br', 'of', 'you', 'to', \"don't\", 'female', 'than', 'place', 'she', 'to', 'was', 'between', 'that', 'nothing', 'and', 'movies', 'get', 'are', 'and', 'br', 'yes', 'female', 'just', 'its', 'because', 'many', 'br', 'of', 'overly', 'to', 'descent', 'people', 'time', 'very', 'bland']\n", | ||
"---label---\n", | ||
"1\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"word2id = imdb.get_word_index()\n", | ||
"id2word = {i: word for word, i in word2id.items()}\n", | ||
"print('---review with words---')\n", | ||
"print([id2word.get(i, ' ') for i in X_train[6]])\n", | ||
"print('---label---')\n", | ||
"print(y_train[6])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Maximum review length and minimum review length" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Maximum review length: 2697\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print('Maximum review length: {}'.format(\n", | ||
"len(max((X_train + X_test), key=len))))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Minimum review length: 14\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print('Minimum review length: {}'.format(\n", | ||
"len(min((X_test + X_test), key=len))))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Pad sequences\n", | ||
"\n", | ||
"In order to feed this data into our RNN, all input documents must have the same length. We will limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value (0). We can accomplish this using the pad_sequences() function in Keras. For now, set max_words to 500." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from keras.preprocessing import sequence\n", | ||
"\n", | ||
"max_words = 500\n", | ||
"X_train = sequence.pad_sequences(X_train, maxlen=max_words)\n", | ||
"X_test = sequence.pad_sequences(X_test, maxlen=max_words)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### TODO: Design an RNN model for sentiment analysis\n", | ||
"\n", | ||
"Build our model architecture in the code cell below. We have imported some layers from Keras that you might need but feel free to use any other layers / transformations you like.\n", | ||
"\n", | ||
"Remember that our input is a sequence of words (technically, integer word IDs) of maximum length = max_words, and our output is a binary sentiment label (0 or 1)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"_________________________________________________________________\n", | ||
"Layer (type) Output Shape Param # \n", | ||
"=================================================================\n", | ||
"embedding_1 (Embedding) (None, 500, 32) 160000 \n", | ||
"_________________________________________________________________\n", | ||
"lstm_1 (LSTM) (None, 100) 53200 \n", | ||
"_________________________________________________________________\n", | ||
"dense_1 (Dense) (None, 1) 101 \n", | ||
"=================================================================\n", | ||
"Total params: 213,301\n", | ||
"Trainable params: 213,301\n", | ||
"Non-trainable params: 0\n", | ||
"_________________________________________________________________\n", | ||
"None\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from keras import Sequential\n", | ||
"from keras.layers import Embedding, LSTM, Dense, Dropout\n", | ||
"\n", | ||
"embedding_size=32\n", | ||
"model=Sequential()\n", | ||
"model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))\n", | ||
"model.add(LSTM(100))\n", | ||
"model.add(Dense(1, activation='sigmoid'))\n", | ||
"\n", | ||
"print(model.summary())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To summarize, our model is a simple RNN model with 1 embedding, 1 LSTM and 1 dense layers. 213,301 parameters in total need to be trained." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Train and evaluate our model\n", | ||
"\n", | ||
"We first need to compile our model by specifying the loss function and optimizer we want to use while training, as well as any evaluation metrics we'd like to measure. Specify the approprate parameters, including at least one metric 'accuracy'." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"model.compile(loss='binary_crossentropy', \n", | ||
" optimizer='adam', \n", | ||
" metrics=['accuracy'])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Once compiled, we can kick off the training process. There are two important training parameters that we have to specify - batch size and number of training epochs, which together with our model architecture determine the total training time.\n", | ||
"\n", | ||
"Training may take a while, so grab a cup of coffee, or better, go for a run!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Train on 24936 samples, validate on 64 samples\n", | ||
"Epoch 1/3\n", | ||
"24936/24936 [==============================] - 225s 9ms/step - loss: 0.5240 - acc: 0.7362 - val_loss: 0.2415 - val_acc: 0.9219\n", | ||
"Epoch 2/3\n", | ||
"24936/24936 [==============================] - 239s 10ms/step - loss: 0.3327 - acc: 0.8587 - val_loss: 0.3031 - val_acc: 0.9062\n", | ||
"Epoch 3/3\n", | ||
"24936/24936 [==============================] - 233s 9ms/step - loss: 0.2578 - acc: 0.8985 - val_loss: 0.2591 - val_acc: 0.9062\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"<keras.callbacks.History at 0x1a9529dea20>" | ||
] | ||
}, | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"batch_size = 64\n", | ||
"num_epochs = 3\n", | ||
"\n", | ||
"X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]\n", | ||
"X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]\n", | ||
"\n", | ||
"model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"scores[1] will correspond to accuracy if we pass metrics=['accuracy']" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Test accuracy: 0.86964\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"scores = model.evaluate(X_test, y_test, verbose=0)\n", | ||
"print('Test accuracy:', scores[1])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |