diff --git a/keras/2.1-a-first-look-at-a-neural-network.ipynb b/keras/2.1-a-first-look-at-a-neural-network.ipynb new file mode 100644 index 0000000..c211e3f --- /dev/null +++ b/keras/2.1-a-first-look-at-a-neural-network.ipynb @@ -0,0 +1,425 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**First of all, set environment variables and initialize spark context:**" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: PYSPARK_PYTHON=/usr/bin/python3.5\n", + "env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n" + ] + } + ], + "source": [ + "%env PYSPARK_PYTHON=/usr/bin/python3.5\n", + "%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "\n", + "from zoo.common.nncontext import *\n", + "sc = init_nncontext(init_spark_conf().setMaster(\"local[4]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A first look at a neural network\n", + "\n", + "----\n", + "\n", + "We will now take a look at a first concrete example of a neural network, which makes use of Keras (v1.2.2) API in [Analytics Zoo](https://github.com/intel-analytics/analytics-zoo) to learn to classify hand-written digits. Unless you already have experience with Keras or similar libraries, you will not understand everything about this first example right away. You probably haven't even installed Analytics zoo yet. Don't worry, that is perfectly fine. In the next chapter, we will review each element in our example and explain them in detail. So don't worry if some steps seem arbitrary or look like magic to you! We've got to start somewhere.\n", + "\n", + "The problem we are trying to solve here is to classify grayscale images of handwritten digits (28 pixels by 28 pixels), into their 10 categories (0 to 9). The dataset we will use is the MNIST dataset, a classic dataset in the machine learning community, which has been around for almost as long as the field itself and has been very intensively studied. It's a set of 60,000 training images, plus 10,000 test images, assembled by the National Institute of Standards and Technology (the NIST in MNIST) in the 1980s. You can think of \"solving\" MNIST as the \"Hello World\" of deep learning -- it's what you do to verify that your algorithms are working as expected. As you become a machine learning practitioner, you will see MNIST come up over and over again, in scientific papers, blog posts, and so on.\n", + "\n", + "The MNIST dataset comes pre-loaded in the Keras API of Analytics Zoo, in the form of a set of four Numpy arrays:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Datasets import\n", + "_In Keras one could use following code to import the datasets:_\n", + "\n", + " from keras.datasets import mnist\n", + "_Just replace it with following in Analytics zoo:_" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/.zoo/dataset/mnist/train-images-idx3-ubyte.gz\n", + "Extracting /tmp/.zoo/dataset/mnist/train-labels-idx1-ubyte.gz\n", + "Extracting /tmp/.zoo/dataset/mnist/t10k-images-idx3-ubyte.gz\n", + "Extracting /tmp/.zoo/dataset/mnist/t10k-labels-idx1-ubyte.gz\n" + ] + } + ], + "source": [ + "from zoo.pipeline.api.keras.datasets import mnist\n", + "(train_images, train_labels), (test_images, test_labels) = mnist.load_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`train_images` and `train_labels` form the \"training set\", the data that the model will learn from. The model will then be tested on the \n", + "\"test set\", `test_images` and `test_labels`. Our images are encoded as Numpy arrays, and the labels are simply an array of digits, ranging \n", + "from 0 to 9. There is a one-to-one correspondence between the images and the labels.\n", + "\n", + "Let's have a look at the training data:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(60000, 28, 28, 1)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_images.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "60000" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(train_labels)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_labels" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(10000, 28, 28, 1)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_images.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10000" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(test_labels)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([7, 2, 1, ..., 4, 5, 6], dtype=uint8)" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_labels" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our workflow will be as follow: first we will present our neural network with the training data, `train_images` and `train_labels`. The \n", + "network will then learn to associate images and labels. Finally, we will ask the network to produce predictions for `test_images`, and we \n", + "will verify if these predictions match the labels from `test_labels`.\n", + "\n", + "Let's build our network -- again, remember that you aren't supposed to understand everything about this example just yet." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Module import\n", + "_In Keras one could use following code to import the modules we need to build the network:_\n", + "\n", + " from keras import models\n", + " from keras import layers\n", + "_Just replace it with following in Analytics zoo:_" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras import models\n", + "from zoo.pipeline.api.keras import layers" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "network = models.Sequential()\n", + "network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))\n", + "network.add(layers.Dense(10, activation='softmax'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The core building block of neural networks is the \"layer\", a data-processing module which you can conceive as a \"filter\" for data. Some \n", + "data comes in, and comes out in a more useful form. Precisely, layers extract _representations_ out of the data fed into them -- hopefully \n", + "representations that are more meaningful for the problem at hand. Most of deep learning really consists of chaining together simple layers \n", + "which will implement a form of progressive \"data distillation\". A deep learning model is like a sieve for data processing, made of a \n", + "succession of increasingly refined data filters -- the \"layers\".\n", + "\n", + "Here our network consists of a sequence of two `Dense` layers, which are densely-connected (also called \"fully-connected\") neural layers. \n", + "The second (and last) layer is a 10-way \"softmax\" layer, which means it will return an array of 10 probability scores (summing to 1). Each \n", + "score will be the probability that the current digit image belongs to one of our 10 digit classes.\n", + "\n", + "To make our network ready for training, we need to pick three more things, as part of \"compilation\" step:\n", + "\n", + "* A loss function: the is how the network will be able to measure how good a job it is doing on its training data, and thus how it will be \n", + "able to steer itself in the right direction.\n", + "* An optimizer: this is the mechanism through which the network will update itself based on the data it sees and its loss function.\n", + "* Metrics to monitor during training and testing. Here we will only care about accuracy (the fraction of the images that were correctly \n", + "classified).\n", + "\n", + "The exact purpose of the loss function and the optimizer will be made clear throughout the next two chapters." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createRMSprop\n", + "creating: createZooKerasSparseCategoricalCrossEntropy\n", + "creating: createZooKerasAccuracy\n" + ] + } + ], + "source": [ + "network.compile(optimizer='rmsprop',\n", + " loss='sparse_categorical_crossentropy',\n", + " metrics=['accuracy'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before training, we will preprocess our data by reshaping it into the shape that the network expects, and scaling it so that all values are in \n", + "the `[0, 1]` interval. Previously, our training images for instance were stored in an array of shape `(60000, 28, 28)` of type `uint8` with \n", + "values in the `[0, 255]` interval. We transform it into a `float32` array of shape `(60000, 28 * 28)` with values between 0 and 1." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "train_images = train_images.reshape((60000, 28 * 28))\n", + "train_images = train_images.astype('float32') / 255\n", + "\n", + "test_images = test_images.reshape((10000, 28 * 28))\n", + "test_images = test_images.astype('float32') / 255" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are now ready to train our network, which in Keras API of Analytics Zoo is done via a call to the `fit` method of the network: \n", + "we \"fit\" the model to its training data." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "network.fit(train_images, train_labels, nb_epoch=5, batch_size=128)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Messages below is the last INFO of training, you can find full training process in INFO, which outputs in your terminal or IDE (not the output of the program)\n", + "\n", + "_INFO - Trained 128 records in 0.018066358 seconds. Throughput is 7084.992 records/second. Loss is 0.012087556._" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We quickly reach an accuracy of 0.989 (i.e. 98.9%) on the training data. Now let's check that our model performs well on the test set too:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "test_acc: 0.9797000288963318\n" + ] + } + ], + "source": [ + "test_loss, test_acc = network.evaluate(test_images, test_labels, batch_size=32)\n", + "print('test_acc:', test_acc)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This concludes our very first example -- you just saw how we could build and a train a neural network to classify handwritten digits, in \n", + "less than 20 lines of Python code. In the next chapter, we will go in detail over every moving piece we just previewed, and clarify what is really \n", + "going on behind the scenes. You will learn about \"tensors\", the data-storing objects going into the network, about tensor operations, which \n", + "layers are made of, and about gradient descent, which allows our network to learn from its training examples." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/keras/3.5-classifying-movie-reviews.ipynb b/keras/3.5-classifying-movie-reviews.ipynb new file mode 100644 index 0000000..2a2d22c --- /dev/null +++ b/keras/3.5-classifying-movie-reviews.ipynb @@ -0,0 +1,1740 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**First of all, set environment variables and initialize spark context:**" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: SPARK_DRIVER_MEMORY=32g\n", + "env: PYSPARK_PYTHON=/usr/bin/python3.5\n", + "env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n" + ] + } + ], + "source": [ + "%env SPARK_DRIVER_MEMORY=32g\n", + "%env PYSPARK_PYTHON=/usr/bin/python3.5\n", + "%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "\n", + "from zoo.common.nncontext import *\n", + "sc = init_nncontext(init_spark_conf().setMaster(\"local[4]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that you have to allocate 32g memory to `SPARK_DRIVER_MEMORY` if you are about to finish the contents in this notebook. Perhaps there is no such memory left on your machine, see memory saving approach at the end of this notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Classifying movie reviews: a binary classification example\n", + "\n", + "----\n", + "\n", + "Two-class classification, or binary classification, may be the most widely applied kind of machine learning problem. In this example, we \n", + "will learn to classify movie reviews into \"positive\" reviews and \"negative\" reviews, just based on the text content of the reviews." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The IMDB dataset\n", + "We'll be working with \"IMDB dataset\", a set of 50,000 highly-polarized reviews from the Internet Movie Database. They are split into 25,000 \n", + "reviews for training and 25,000 reviews for testing, each set consisting in 50% negative and 50% positive reviews.\n", + "\n", + "Why do we have these two separate training and test sets? You should never test a machine learning model on the same data that you used to \n", + "train it! Just because a model performs well on its training data doesn't mean that it will perform well on data it has never seen, and \n", + "what you actually care about is your model's performance on new data (since you already know the labels of your training data -- obviously \n", + "you don't need your model to predict those). For instance, it is possible that your model could end up merely _memorizing_ a mapping between \n", + "your training samples and their targets -- which would be completely useless for the task of predicting targets for data never seen before. \n", + "We will go over this point in much more detail in the next chapter.\n", + "\n", + "Just like the MNIST dataset, the IMDB dataset comes packaged with the Keras API of Analytics Zoo. It has already been preprocessed: the reviews (sequences of words) \n", + "have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.\n", + "\n", + "The following code will load the dataset (when you run it for the first time, about 80MB of data will be downloaded to your machine):" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras.datasets import imdb\n", + "(train_data, train_labels), (test_data, test_labels) = imdb.load_data(nb_words=10000)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The argument `nb_words=10000` means that we will only keep the top 10,000 most frequently occurring words in the training data. Rare words \n", + "will be discarded. This allows us to work with vector data of manageable size.\n", + "\n", + "The variables `train_data` and `test_data` are lists of reviews, each review being a list of word indices (encoding a sequence of words). \n", + "`train_labels` and `test_labels` are lists of 0s and 1s, where 0 stands for \"negative\" and 1 stands for \"positive\":" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since we restricted ourselves to the top 10,000 most frequent words, no word index will exceed 10,000:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9999" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "max([max(sequence) for sequence in train_data])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For kicks, here's how you can quickly decode one of these reviews back to English words:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\"distracting ? a evil entertainment ? ? might ? might an films ? tries who because truly tries talent too br she man ? steven how determination will ? looks world's which can ? it screen that in have way gonna of of least ? want take toxic even paint ? similar ? japanese that ? would ? ? charles cover movie even ? moment ear ? ? not wanted involved ? ? ? ? quality ? ? ? point and sequences will ? bruckheimer how actually he way kinds ? genre fact fine l a either her ? ? movie ? cover and minute ? ending ? ? favorite ? of of private ? spoiler down remember and while ? ? having 200 while movie ? prove charisma pretty ? chuck and perspective seriously if a bed movie in ? cover was most five springer for to free film been but woods was showing director movie this ? ? display sinister much details to and ? ? many no there if i which explore is will ? paramount the we without on most ? eddie urban just the ? harold came even like cares charged ? be most comparison buck hollywood or mixed well 1 or have doesn't comes and point more ? meant scenes you'd for work doesn't school is pants ? was ? mask of of example is ? to friend flying making br any or ? seems as sending and you it this ? mcintire ? is style it role think parts guy was most feeling ? and awful if is close and down meets and shoot more lies ? anything question making for or ? try finally the way plot way car of of if assistant ? as a first is victim ? ? most corrupt be ? does few in a ? former teeth completely and top ? pets ? 50 score seems as will scenes ? plot done independent to ? waste the ? all john is talent just am ? for reading ? we ? this writing guy maintain jokes has on think team ? nudity been its film guy is 3 ? ? as and ? let's body from ground film was it terrified voice throughout distracting ? ? there and ? negative of of ? bland began been his most yourself was most enjoy the across cried ? luis for br been his it restaurant or better but shows ? very an off and comments in\"" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# word_index is a dictionary mapping words to an integer index\n", + "word_index = imdb.get_word_index()\n", + "# We reverse it, mapping integer indices to words\n", + "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n", + "# We decode the review; note that our indices were offset by 3\n", + "# because 0, 1 and 2 are reserved indices for \"padding\", \"start of sequence\", and \"unknown\".\n", + "decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])\n", + "decoded_review" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We cannot feed lists of integers into a neural network. We have to turn our lists into tensors. There are two ways we could do that:\n", + "\n", + "* We could pad our lists so that they all have the same length, and turn them into an integer tensor of shape `(samples, word_indices)`, \n", + "then use as first layer in our network a layer capable of handling such integer tensors (the `Embedding` layer, which we will cover in \n", + "detail later in the book).\n", + "* We could one-hot-encode our lists to turn them into vectors of 0s and 1s. Concretely, this would mean for instance turning the sequence \n", + "`[3, 5]` into a 10,000-dimensional vector that would be all-zeros except for indices 3 and 5, which would be ones. Then we could use as \n", + "first layer in our network a `Dense` layer, capable of handling floating point vector data.\n", + "\n", + "We will go with the latter solution. Let's vectorize our data, which we will do manually for maximum clarity:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "def vectorize_sequences(sequences, dimension=10000):\n", + " # Create an all-zero matrix of shape (len(sequences), dimension)\n", + " results = np.zeros((len(sequences), dimension))\n", + " for i, sequence in enumerate(sequences):\n", + " results[i, sequence] = 1. # set specific indices of results[i] to 1s\n", + " return results\n", + "\n", + "# Our vectorized training data\n", + "x_train = vectorize_sequences(train_data)\n", + "# Our vectorized test data\n", + "x_test = vectorize_sequences(test_data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's what our samples look like now:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0., 1., 1., ..., 0., 0., 0.])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x_train[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We should also vectorize our labels, which is straightforward:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "y_train = np.asarray(train_labels).astype('float32')\n", + "y_test = np.asarray(test_labels).astype('float32')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now our data is ready to be fed into a neural network." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Building our network\n", + "\n", + "\n", + "Our input data is simply vectors, and our labels are scalars (1s and 0s): this is the easiest setup you will ever encounter. A type of \n", + "network that performs well on such a problem would be a simple stack of fully-connected (`Dense`) layers with `relu` activations: `Dense(16, activation='relu')`\n", + "\n", + "The argument being passed to each `Dense` layer (16) is the number of \"hidden units\" of the layer. What's a hidden unit? It's a dimension \n", + "in the representation space of the layer. You may remember from the previous chapter that each such `Dense` layer with a `relu` activation implements \n", + "the following chain of tensor operations:\n", + "\n", + "`output = relu(dot(W, input) + b)`\n", + "\n", + "Having 16 hidden units means that the weight matrix `W` will have shape `(input_dimension, 16)`, i.e. the dot product with `W` will project the \n", + "input data onto a 16-dimensional representation space (and then we would add the bias vector `b` and apply the `relu` operation). You can \n", + "intuitively understand the dimensionality of your representation space as \"how much freedom you are allowing the network to have when \n", + "learning internal representations\". Having more hidden units (a higher-dimensional representation space) allows your network to learn more \n", + "complex representations, but it makes your network more computationally expensive and may lead to learning unwanted patterns (patterns that \n", + "will improve performance on the training data but not on the test data).\n", + "\n", + "There are two key architecture decisions to be made about such stack of dense layers:\n", + "\n", + "* How many layers to use.\n", + "* How many \"hidden units\" to chose for each layer.\n", + "\n", + "In the next chapter, you will learn formal principles to guide you in making these choices. \n", + "For the time being, you will have to trust us with the following architecture choice: \n", + "two intermediate layers with 16 hidden units each, \n", + "and a third layer which will output the scalar prediction regarding the sentiment of the current review. \n", + "The intermediate layers will use `relu` as their \"activation function\", \n", + "and the final layer will use a sigmoid activation so as to output a probability \n", + "(a score between 0 and 1, indicating how likely the sample is to have the target \"1\", i.e. how likely the review is to be positive). \n", + "A `relu` (rectified linear unit) is a function meant to zero-out negative values, \n", + "while a sigmoid \"squashes\" arbitrary values into the `[0, 1]` interval, thus outputting something that can be interpreted as a probability." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's what our network looks like:\n", + "\n", + "![3-layer network](https://s3.amazonaws.com/book.keras.io/img/ch3/3_layer_network.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And here's the Analytics zoo implementation, very similar to the MNIST example you saw previously:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from zoo.pipeline.api.keras import models\n", + "from zoo.pipeline.api.keras import layers\n", + "\n", + "model = models.Sequential()\n", + "model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))\n", + "model.add(layers.Dense(16, activation='relu'))\n", + "model.add(layers.Dense(1, activation='sigmoid'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lastly, we need to pick a loss function and an optimizer. Since we are facing a binary classification problem and the output of our network \n", + "is a probability (we end our network with a single-unit layer with a sigmoid activation), is it best to use the `binary_crossentropy` loss. \n", + "It isn't the only viable choice: you could use, for instance, `mean_squared_error`. But crossentropy is usually the best choice when you \n", + "are dealing with models that output probabilities. Crossentropy is a quantity from the field of Information Theory, that measures the \"distance\" \n", + "between probability distributions, or in our case, between the ground-truth distribution and our predictions.\n", + "\n", + "Here's the step where we configure our model with the `rmsprop` optimizer and the `binary_crossentropy` loss function. Note that we will \n", + "also monitor accuracy during training." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['accuracy'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Validating our approach\n", + "\n", + "In order to monitor during training the accuracy of the model on data that it has never seen before, we will create a \"validation set\" by \n", + "setting apart 10,000 samples from the original training data:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "x_val = x_train[:10000]\n", + "partial_x_train = x_train[10000:]\n", + "y_val = y_train[:10000]\n", + "partial_y_train = y_train[10000:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will now train our model for 20 epochs (20 iterations over all samples in the `x_train` and `y_train` tensors), in mini-batches of 512 \n", + "samples. At this same time we will monitor loss and accuracy on the 10,000 samples that we set apart. This is done by passing the \n", + "validation data as the `validation_data` argument:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Accuracy checkout\n", + "_To checkout the behavior of this model in Keras, one could use following code accompanied with `matplotlib` library to draw the following `history` object_\n", + " \n", + " history = model.fit(partial_x_train,\n", + " partial_y_train,\n", + " nb_epoch=5,\n", + " batch_size=512,\n", + " validation_data=(x_val, y_val)\n", + " )\n", + "_After `fit` method finishes, the results are stored in `history` and thus could be visualized. Currently in Analytics zoo, `fit` method does not have any return. Results can only be checked via setting tensorboard._" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To do training visualization, you can configure tensorboard in the model. The code of setting tensorboard and train is following, note that `set_tesnsorboard` need to be called before `fit` method:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "dir_name = '3-5 ' + str(time.ctime())\n", + "model.set_tensorboard('./', dir_name)\n", + "model.fit(partial_x_train,\n", + " partial_y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_val, y_val))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 512 records in 0.020173091 seconds. Throughput is 25380.344 records/second. Loss is 0.0092472015.\n", + "Top1Accuracy is Accuracy(correct: 8707, count: 10000, accuracy: 0.8707)_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then result could be visualized in either of following ways: \n", + "\n", + "* Start tensorboard web interface in terminal by `tensorboard --logdir ./` and go to web browser url `localhost:port_number` as shown in your terminal.\n", + "* Use Analytics zoo built-in method `get_scalar_from_summary` with parameter `Loss` or `Validation` to get the array of scalar, then visualize via `matplotlib`.\n", + "\n", + "We use the second approach here in order to directly show the result in this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "train_loss = np.array(model.get_train_summary('Loss'))\n", + "val_loss = np.array(model.get_validation_summary('Loss'))\n", + "\n", + "import matplotlib.pyplot as plt\n", + "plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')\n", + "plt.plot(val_loss[:,0],val_loss[:,1],label='validation loss',color='green')\n", + "plt.title('Training and validation loss')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The blue line is the training loss, while the green line is the validation loss. Note that your own results may vary \n", + "slightly due to a different random initialization of your network." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, the training loss decreases with every epoch. That's what you would \n", + "expect when running gradient descent optimization -- the quantity you are trying to minimize should get lower with every iteration. But that \n", + "isn't the case for the validation loss: it seems to be optimized at about 1/5 of the total training epochs, which is 20/5 = 4. This is an example of what we were warning \n", + "against earlier: a model that performs better on the training data isn't necessarily a model that will do better on data it has never seen \n", + "before. In precise terms, what you are seeing is \"overfitting\": after the second epoch, we are over-optimizing on the training data, and we \n", + "ended up learning representations that are specific to the training data and do not generalize to data outside of the training set.\n", + "\n", + "In this case, to prevent overfitting, we could simply stop training after three epochs. In general, there is a range of techniques you can \n", + "leverage to mitigate overfitting, which we will cover in the next chapter.\n", + "\n", + "Let's train a new network from scratch for 4 epochs, then evaluate it on our test data:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "model = models.Sequential()\n", + "model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))\n", + "model.add(layers.Dense(16, activation='relu'))\n", + "model.add(layers.Dense(1, activation='sigmoid'))\n", + "\n", + "model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['accuracy'])\n", + "\n", + "model.fit(x_train, y_train, nb_epoch=4, batch_size=512)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 512 records in 0.023978868 seconds. Throughput is 21352.133 records/second. Loss is 0.108611815._" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.3063896596431732, 0.8806399703025818]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results = model.evaluate(x_test, y_test)\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our fairly naive approach achieves an accuracy of 88%. With state-of-the-art approaches, one should be able to get close to 95%." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using a trained network to generate predictions on new data\n", + "\n", + "After having trained a network, you will want to use it in a practical setting. You can generate the likelihood of reviews being positive \n", + "by using the `predict` method:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Predict result\n", + "_In Keras, one could just call following code to predict the test data_\n", + "\n", + " model.predict(x_test)\n", + "_In Analytics zoo, the return of `predict` is RDD, so you need to call `collect` method to get the result:_" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[array([0.6597656], dtype=float32),\n", + " array([0.97529125], dtype=float32),\n", + " array([1.39369495e-05], dtype=float32),\n", + " array([0.9499197], dtype=float32),\n", + " array([0.69558215], dtype=float32),\n", + " array([0.98174447], dtype=float32),\n", + " array([0.01318819], dtype=float32),\n", + " array([0.9626703], dtype=float32),\n", + " array([0.98742026], dtype=float32),\n", + " array([0.00059057], dtype=float32),\n", + " array([0.6133139], dtype=float32),\n", + " array([0.978926], dtype=float32),\n", + " array([0.99840707], dtype=float32),\n", + " array([0.07168697], dtype=float32),\n", + " array([0.89191675], dtype=float32),\n", + " array([0.48994958], dtype=float32),\n", + " array([0.02672931], dtype=float32),\n", + " array([0.78033304], dtype=float32),\n", + " array([0.07513892], dtype=float32),\n", + " array([0.00686305], dtype=float32),\n", + " array([0.04119945], dtype=float32),\n", + " array([1.49263915e-05], dtype=float32),\n", + " array([0.99980336], dtype=float32),\n", + " array([0.8471216], dtype=float32),\n", + " array([0.00010777], dtype=float32),\n", + " array([0.9340031], dtype=float32),\n", + " array([0.8214722], dtype=float32),\n", + " array([0.9786547], dtype=float32),\n", + " array([0.00837058], dtype=float32),\n", + " array([0.9238503], dtype=float32),\n", + " array([0.00408007], dtype=float32),\n", + " array([0.18840362], dtype=float32),\n", + " array([0.999974], dtype=float32),\n", + " array([0.9948447], dtype=float32),\n", + " array([0.8062789], dtype=float32),\n", + " array([0.11027395], dtype=float32),\n", + " array([0.04690371], dtype=float32),\n", + " array([0.07576486], dtype=float32),\n", + " array([0.9307181], dtype=float32),\n", + " array([0.9578869], dtype=float32),\n", + " array([4.3841228e-05], dtype=float32),\n", + " array([0.97011423], dtype=float32),\n", + " array([0.372595], dtype=float32),\n", + " array([0.08670929], dtype=float32),\n", + " array([0.9922921], dtype=float32),\n", + " array([0.00444584], dtype=float32),\n", + " array([0.9995722], dtype=float32),\n", + " array([0.90575284], dtype=float32),\n", + " array([0.03082987], dtype=float32),\n", + " array([8.931183e-06], dtype=float32),\n", + " array([0.01119773], dtype=float32),\n", + " array([0.9681336], dtype=float32),\n", + " array([0.839909], dtype=float32),\n", + " array([0.00667274], dtype=float32),\n", + " array([0.99168044], dtype=float32),\n", + " array([0.99999154], dtype=float32),\n", + " array([1.9037469e-05], dtype=float32),\n", + " array([0.9974356], dtype=float32),\n", + " array([0.00046782], dtype=float32),\n", + " array([0.00524331], dtype=float32),\n", + " array([0.8870116], dtype=float32),\n", + " array([0.9076144], dtype=float32),\n", + " array([0.02826679], dtype=float32),\n", + " array([0.95415473], dtype=float32),\n", + " array([0.3839109], dtype=float32),\n", + " array([0.99069595], dtype=float32),\n", + " array([0.06462941], dtype=float32),\n", + " array([0.99408925], dtype=float32),\n", + " array([0.00728586], dtype=float32),\n", + " array([0.9963102], dtype=float32),\n", + " array([0.88912857], dtype=float32),\n", + " array([0.99318165], dtype=float32),\n", + " array([0.98711836], dtype=float32),\n", + " array([0.9997482], dtype=float32),\n", + " array([0.12893666], dtype=float32),\n", + " array([2.553328e-05], dtype=float32),\n", + " array([0.81136394], dtype=float32),\n", + " array([0.6672609], dtype=float32),\n", + " array([0.6661795], dtype=float32),\n", + " array([0.03229121], dtype=float32),\n", + " array([0.56833935], dtype=float32),\n", + " array([0.23906621], dtype=float32),\n", + " array([0.9886596], dtype=float32),\n", + " array([0.9827251], dtype=float32),\n", + " array([0.08567941], dtype=float32),\n", + " array([0.37140584], dtype=float32),\n", + " array([0.00025531], dtype=float32),\n", + " array([0.99791545], dtype=float32),\n", + " array([0.02411093], dtype=float32),\n", + " array([0.9877809], dtype=float32),\n", + " array([0.908092], dtype=float32),\n", + " array([0.8383248], dtype=float32),\n", + " array([0.00739653], dtype=float32),\n", + " array([0.00090695], dtype=float32),\n", + " array([0.9652902], dtype=float32),\n", + " array([0.01431155], dtype=float32),\n", + " array([0.93294597], dtype=float32),\n", + " array([0.99896336], dtype=float32),\n", + " array([0.9984067], dtype=float32),\n", + " array([0.93452567], dtype=float32),\n", + " array([0.99430794], dtype=float32),\n", + " array([0.36339617], dtype=float32),\n", + " array([0.8769031], dtype=float32),\n", + " array([0.9518878], dtype=float32),\n", + " array([0.83151025], dtype=float32),\n", + " array([0.9985399], dtype=float32),\n", + " array([0.0002125], dtype=float32),\n", + " array([0.714252], dtype=float32),\n", + " array([0.27901366], dtype=float32),\n", + " array([0.8523226], dtype=float32),\n", + " array([0.99559104], dtype=float32),\n", + " array([0.18001182], dtype=float32),\n", + " array([0.9432954], dtype=float32),\n", + " array([0.8350808], dtype=float32),\n", + " array([0.00853516], dtype=float32),\n", + " array([0.15583186], dtype=float32),\n", + " array([0.92990994], dtype=float32),\n", + " array([0.7541111], dtype=float32),\n", + " array([0.69654137], dtype=float32),\n", + " array([0.01848821], dtype=float32),\n", + " array([0.59170055], dtype=float32),\n", + " array([0.9971204], dtype=float32),\n", + " array([0.9903796], dtype=float32),\n", + " array([0.9991167], dtype=float32),\n", + " array([0.9316476], dtype=float32),\n", + " array([0.06031401], dtype=float32),\n", + " array([0.02550006], dtype=float32),\n", + " array([0.9999504], dtype=float32),\n", + " array([0.00857145], dtype=float32),\n", + " array([0.47920564], dtype=float32),\n", + " array([0.9485018], dtype=float32),\n", + " array([0.00464081], dtype=float32),\n", + " array([0.08251999], dtype=float32),\n", + " array([0.98797554], dtype=float32),\n", + " array([0.97623616], dtype=float32),\n", + " array([0.00270883], dtype=float32),\n", + " array([0.41065904], dtype=float32),\n", + " array([0.00041126], dtype=float32),\n", + " array([0.9735677], dtype=float32),\n", + " array([0.01444051], dtype=float32),\n", + " array([0.1193343], dtype=float32),\n", + " array([0.94883794], dtype=float32),\n", + " array([0.81132954], dtype=float32),\n", + " array([0.9701367], dtype=float32),\n", + " array([0.99988973], dtype=float32),\n", + " array([0.95782846], dtype=float32),\n", + " array([0.9999559], dtype=float32),\n", + " array([0.02463553], dtype=float32),\n", + " array([0.80905896], dtype=float32),\n", + " array([0.00272602], dtype=float32),\n", + " array([0.9443275], dtype=float32),\n", + " array([0.6925543], dtype=float32),\n", + " array([0.96254104], dtype=float32),\n", + " array([0.9993697], dtype=float32),\n", + " array([0.90027475], dtype=float32),\n", + " array([0.05616611], dtype=float32),\n", + " array([1.1050109e-05], dtype=float32),\n", + " array([0.8539005], dtype=float32),\n", + " array([0.7169908], dtype=float32),\n", + " array([0.06052893], dtype=float32),\n", + " array([0.03273512], dtype=float32),\n", + " array([0.98712534], dtype=float32),\n", + " array([0.00043659], dtype=float32),\n", + " array([0.9919195], dtype=float32),\n", + " array([0.5189989], dtype=float32),\n", + " array([0.01810263], dtype=float32),\n", + " array([0.00150598], dtype=float32),\n", + " array([0.06606124], dtype=float32),\n", + " array([0.00081787], dtype=float32),\n", + " array([0.01792734], dtype=float32),\n", + " array([0.9788325], dtype=float32),\n", + " array([0.95970446], dtype=float32),\n", + " array([0.09366837], dtype=float32),\n", + " array([0.01276378], dtype=float32),\n", + " array([0.9993555], dtype=float32),\n", + " array([0.027029], dtype=float32),\n", + " array([0.56499213], dtype=float32),\n", + " array([0.99708503], dtype=float32),\n", + " array([0.00154167], dtype=float32),\n", + " array([0.2801673], dtype=float32),\n", + " array([0.52925706], dtype=float32),\n", + " array([0.0010483], dtype=float32),\n", + " array([0.9990589], dtype=float32),\n", + " array([0.00761955], dtype=float32),\n", + " array([0.936439], dtype=float32),\n", + " array([0.9875731], dtype=float32),\n", + " array([0.05203724], dtype=float32),\n", + " array([0.9949458], dtype=float32),\n", + " array([0.12733188], dtype=float32),\n", + " array([0.01648956], dtype=float32),\n", + " array([0.7714576], dtype=float32),\n", + " array([0.7118609], dtype=float32),\n", + " array([0.09135327], dtype=float32),\n", + " array([0.94923663], dtype=float32),\n", + " array([0.00418737], dtype=float32),\n", + " array([0.39404547], dtype=float32),\n", + " array([0.98599905], dtype=float32),\n", + " array([0.7954801], dtype=float32),\n", + " array([0.42050537], dtype=float32),\n", + " array([0.02979656], dtype=float32),\n", + " array([0.9153005], dtype=float32),\n", + " array([0.7568136], dtype=float32),\n", + " array([0.5575319], dtype=float32),\n", + " array([0.9995894], dtype=float32),\n", + " array([0.9746347], dtype=float32),\n", + " array([6.51397e-05], dtype=float32),\n", + " array([0.14501932], dtype=float32),\n", + " array([0.97661], dtype=float32),\n", + " array([0.01651403], dtype=float32),\n", + " array([0.73719937], dtype=float32),\n", + " array([0.9063153], dtype=float32),\n", + " array([0.997982], dtype=float32),\n", + " array([0.91056806], dtype=float32),\n", + " array([0.00447078], dtype=float32),\n", + " array([0.09257668], dtype=float32),\n", + " array([0.9366054], dtype=float32),\n", + " array([0.9811677], dtype=float32),\n", + " array([0.0012391], dtype=float32),\n", + " array([0.00391587], dtype=float32),\n", + " array([0.00012618], dtype=float32),\n", + " array([0.0366583], dtype=float32),\n", + " array([0.00550616], dtype=float32),\n", + " array([0.890634], dtype=float32),\n", + " array([0.00715845], dtype=float32),\n", + " array([0.72381204], dtype=float32),\n", + " array([0.19576788], dtype=float32),\n", + " array([0.99990416], dtype=float32),\n", + " array([0.0158124], dtype=float32),\n", + " array([0.61522424], dtype=float32),\n", + " array([0.9689464], dtype=float32),\n", + " array([0.04064468], dtype=float32),\n", + " array([0.00022891], dtype=float32),\n", + " array([0.02944768], dtype=float32),\n", + " array([0.999653], dtype=float32),\n", + " array([0.40116826], dtype=float32),\n", + " array([0.9913776], dtype=float32),\n", + " array([0.0029448], dtype=float32),\n", + " array([0.32557806], dtype=float32),\n", + " array([0.6863088], dtype=float32),\n", + " array([0.00081112], dtype=float32),\n", + " array([0.97927356], dtype=float32),\n", + " array([0.19653757], dtype=float32),\n", + " array([0.9705768], dtype=float32),\n", + " array([0.04453946], dtype=float32),\n", + " array([0.00284266], dtype=float32),\n", + " array([0.03559921], dtype=float32),\n", + " array([0.9526187], dtype=float32),\n", + " array([0.7230885], dtype=float32),\n", + " array([0.8201464], dtype=float32),\n", + " array([0.00017875], dtype=float32),\n", + " array([0.97747767], dtype=float32),\n", + " array([0.5449069], dtype=float32),\n", + " array([0.09639208], dtype=float32),\n", + " array([0.90544367], dtype=float32),\n", + " array([0.167667], dtype=float32),\n", + " array([0.9997439], dtype=float32),\n", + " array([0.9310318], dtype=float32),\n", + " array([0.37656942], dtype=float32),\n", + " array([0.0002848], dtype=float32),\n", + " array([0.0001366], dtype=float32),\n", + " array([0.7440771], dtype=float32),\n", + " array([0.88802665], dtype=float32),\n", + " array([0.9152749], dtype=float32),\n", + " array([0.5734805], dtype=float32),\n", + " array([0.9993099], dtype=float32),\n", + " array([0.49408263], dtype=float32),\n", + " array([0.8506351], dtype=float32),\n", + " array([0.00250183], dtype=float32),\n", + " array([0.9945287], dtype=float32),\n", + " array([0.9684286], dtype=float32),\n", + " array([0.90822536], dtype=float32),\n", + " array([0.9937883], dtype=float32),\n", + " array([0.99190396], dtype=float32),\n", + " array([0.01760691], dtype=float32),\n", + " array([0.5422416], dtype=float32),\n", + " array([0.29439396], dtype=float32),\n", + " array([0.99019873], dtype=float32),\n", + " array([0.06950508], dtype=float32),\n", + " array([0.00818285], dtype=float32),\n", + " array([0.9632261], dtype=float32),\n", + " array([0.99473333], dtype=float32),\n", + " array([0.25060079], dtype=float32),\n", + " array([0.00048786], dtype=float32),\n", + " array([0.01472425], dtype=float32),\n", + " array([0.00318411], dtype=float32),\n", + " array([0.00093868], dtype=float32),\n", + " array([0.83109117], dtype=float32),\n", + " array([0.00123343], dtype=float32),\n", + " array([0.9713263], dtype=float32),\n", + " array([0.04610278], dtype=float32),\n", + " array([0.05665827], dtype=float32),\n", + " array([0.5868943], dtype=float32),\n", + " array([0.98522806], dtype=float32),\n", + " array([0.03351312], dtype=float32),\n", + " array([0.02006613], dtype=float32),\n", + " array([0.00033519], dtype=float32),\n", + " array([0.67317265], dtype=float32),\n", + " array([0.30107507], dtype=float32),\n", + " array([3.784242e-05], dtype=float32),\n", + " array([0.6087148], dtype=float32),\n", + " array([0.997804], dtype=float32),\n", + " array([0.32963577], dtype=float32),\n", + " array([0.03810342], dtype=float32),\n", + " array([0.99538136], dtype=float32),\n", + " array([0.5548133], dtype=float32),\n", + " array([0.9353912], dtype=float32),\n", + " array([0.9966528], dtype=float32),\n", + " array([0.00378726], dtype=float32),\n", + " array([0.43726218], dtype=float32),\n", + " array([0.95121735], dtype=float32),\n", + " array([0.9728295], dtype=float32),\n", + " array([3.875886e-06], dtype=float32),\n", + " array([0.98975426], dtype=float32),\n", + " array([0.9864806], dtype=float32),\n", + " array([0.00165366], dtype=float32),\n", + " array([0.1064606], dtype=float32),\n", + " array([0.89174306], dtype=float32),\n", + " array([0.00587977], dtype=float32),\n", + " array([0.98498905], dtype=float32),\n", + " array([0.06515972], dtype=float32),\n", + " array([0.06025562], dtype=float32),\n", + " array([0.0166713], dtype=float32),\n", + " array([0.93327284], dtype=float32),\n", + " array([0.36270353], dtype=float32),\n", + " array([0.99993503], dtype=float32),\n", + " array([0.75670844], dtype=float32),\n", + " array([0.8717547], dtype=float32),\n", + " array([0.3455405], dtype=float32),\n", + " array([0.79031855], dtype=float32),\n", + " array([0.28538352], dtype=float32),\n", + " array([0.9997949], dtype=float32),\n", + " array([0.26040974], dtype=float32),\n", + " array([0.9983621], dtype=float32),\n", + " array([0.04919887], dtype=float32),\n", + " array([0.00535334], dtype=float32),\n", + " array([0.33617225], dtype=float32),\n", + " array([0.07422278], dtype=float32),\n", + " array([0.15734425], dtype=float32),\n", + " array([0.8681399], dtype=float32),\n", + " array([3.36514e-05], dtype=float32),\n", + " array([0.220001], dtype=float32),\n", + " array([0.03030171], dtype=float32),\n", + " array([0.00071725], dtype=float32),\n", + " array([0.20411605], dtype=float32),\n", + " array([0.38738677], dtype=float32),\n", + " array([0.99825364], dtype=float32),\n", + " array([0.97874314], dtype=float32),\n", + " array([0.9536651], dtype=float32),\n", + " array([0.99999595], dtype=float32),\n", + " array([0.9274589], dtype=float32),\n", + " array([0.67642564], dtype=float32),\n", + " array([0.86876076], dtype=float32),\n", + " array([0.99380374], dtype=float32),\n", + " array([0.00764247], dtype=float32),\n", + " array([0.00141049], dtype=float32),\n", + " array([0.44760624], dtype=float32),\n", + " array([0.7392404], dtype=float32),\n", + " array([0.94820905], dtype=float32),\n", + " array([0.01543296], dtype=float32),\n", + " array([0.0030313], dtype=float32),\n", + " array([0.9983657], dtype=float32),\n", + " array([0.9877472], dtype=float32),\n", + " array([0.14449687], dtype=float32),\n", + " array([0.0175909], dtype=float32),\n", + " array([0.9933814], dtype=float32),\n", + " array([0.1099957], dtype=float32),\n", + " array([0.502743], dtype=float32),\n", + " array([0.0021092], dtype=float32),\n", + " array([0.4014902], dtype=float32),\n", + " array([8.531843e-05], dtype=float32),\n", + " array([0.0042778], dtype=float32),\n", + " array([0.91485137], dtype=float32),\n", + " array([0.02211919], dtype=float32),\n", + " array([0.00567074], dtype=float32),\n", + " array([0.06237838], dtype=float32),\n", + " array([0.9416742], dtype=float32),\n", + " array([0.0665731], dtype=float32),\n", + " array([0.8300122], dtype=float32),\n", + " array([0.93574494], dtype=float32),\n", + " array([0.99325573], dtype=float32),\n", + " array([0.24700274], dtype=float32),\n", + " array([0.99896765], dtype=float32),\n", + " array([0.93945384], dtype=float32),\n", + " array([0.18341716], dtype=float32),\n", + " array([0.00710799], dtype=float32),\n", + " array([0.00717159], dtype=float32),\n", + " array([0.9978796], dtype=float32),\n", + " array([0.39169902], dtype=float32),\n", + " array([0.9921503], dtype=float32),\n", + " array([0.33547845], dtype=float32),\n", + " array([0.97284275], dtype=float32),\n", + " array([0.99999547], dtype=float32),\n", + " array([0.04805868], dtype=float32),\n", + " array([0.6807831], dtype=float32),\n", + " array([0.38082442], dtype=float32),\n", + " array([0.7750744], dtype=float32),\n", + " array([0.99722785], dtype=float32),\n", + " array([0.77780694], dtype=float32),\n", + " array([0.9519044], dtype=float32),\n", + " array([0.00215464], dtype=float32),\n", + " array([0.29531085], dtype=float32),\n", + " array([0.9999316], dtype=float32),\n", + " array([0.7214245], dtype=float32),\n", + " array([0.8033163], dtype=float32),\n", + " array([0.6166736], dtype=float32),\n", + " array([0.26327613], dtype=float32),\n", + " array([0.21962917], dtype=float32),\n", + " array([0.10679483], dtype=float32),\n", + " array([0.04216451], dtype=float32),\n", + " array([0.00307667], dtype=float32),\n", + " array([0.99923015], dtype=float32),\n", + " array([0.00597921], dtype=float32),\n", + " array([0.99360764], dtype=float32),\n", + " array([0.973897], dtype=float32),\n", + " array([0.13671698], dtype=float32),\n", + " array([0.44968152], dtype=float32),\n", + " array([0.07701934], dtype=float32),\n", + " array([0.05103498], dtype=float32),\n", + " array([0.9994609], dtype=float32),\n", + " array([0.07936312], dtype=float32),\n", + " array([0.8839954], dtype=float32),\n", + " array([1.365624e-06], dtype=float32),\n", + " array([0.00480004], dtype=float32),\n", + " array([0.12765045], dtype=float32),\n", + " array([0.9904794], dtype=float32),\n", + " array([0.6438497], dtype=float32),\n", + " array([0.8862176], dtype=float32),\n", + " array([7.784928e-05], dtype=float32),\n", + " array([0.19045115], dtype=float32),\n", + " array([0.00067149], dtype=float32),\n", + " array([0.9358372], dtype=float32),\n", + " array([0.02452566], dtype=float32),\n", + " array([0.9958995], dtype=float32),\n", + " array([0.550974], dtype=float32),\n", + " array([0.30900526], dtype=float32),\n", + " array([0.99798125], dtype=float32),\n", + " array([0.01287526], dtype=float32),\n", + " array([0.01379994], dtype=float32),\n", + " array([0.12119947], dtype=float32),\n", + " array([0.665414], dtype=float32),\n", + " array([0.00102568], dtype=float32),\n", + " array([0.2067204], dtype=float32),\n", + " array([0.0050051], dtype=float32),\n", + " array([0.00433443], dtype=float32),\n", + " array([0.39867714], dtype=float32),\n", + " array([0.00024582], dtype=float32),\n", + " array([0.00571835], dtype=float32),\n", + " array([0.00590702], dtype=float32),\n", + " array([0.5449246], dtype=float32),\n", + " array([0.97699547], dtype=float32),\n", + " array([0.00366751], dtype=float32),\n", + " array([0.13479914], dtype=float32),\n", + " array([0.98704463], dtype=float32),\n", + " array([0.0312269], dtype=float32),\n", + " array([0.00039572], dtype=float32),\n", + " array([0.7193606], dtype=float32),\n", + " array([0.07044102], dtype=float32),\n", + " array([0.03585317], dtype=float32),\n", + " array([0.17524014], dtype=float32),\n", + " array([0.14926364], dtype=float32),\n", + " array([0.21622558], dtype=float32),\n", + " array([0.47393447], dtype=float32),\n", + " array([0.8796138], dtype=float32),\n", + " array([0.57277304], dtype=float32),\n", + " array([0.9692422], dtype=float32),\n", + " array([0.9952886], dtype=float32),\n", + " array([0.95525163], dtype=float32),\n", + " array([0.3414528], dtype=float32),\n", + " array([0.6035593], dtype=float32),\n", + " array([0.03257844], dtype=float32),\n", + " array([0.01301803], dtype=float32),\n", + " array([0.47819394], dtype=float32),\n", + " array([1.6677832e-08], dtype=float32),\n", + " array([0.22340754], dtype=float32),\n", + " array([0.9999951], dtype=float32),\n", + " array([0.96137166], dtype=float32),\n", + " array([0.9981943], dtype=float32),\n", + " array([0.05160893], dtype=float32),\n", + " array([0.99629396], dtype=float32),\n", + " array([0.9625849], dtype=float32),\n", + " array([0.0002911], dtype=float32),\n", + " array([0.980667], dtype=float32),\n", + " array([0.9892765], dtype=float32),\n", + " array([0.9987301], dtype=float32),\n", + " array([0.9874142], dtype=float32),\n", + " array([0.9936329], dtype=float32),\n", + " array([0.997771], dtype=float32),\n", + " array([0.5043148], dtype=float32),\n", + " array([0.8399789], dtype=float32),\n", + " array([0.9929483], dtype=float32),\n", + " array([0.31873196], dtype=float32),\n", + " array([0.0675632], dtype=float32),\n", + " array([0.00233161], dtype=float32),\n", + " array([0.98852634], dtype=float32),\n", + " array([0.9999845], dtype=float32),\n", + " array([0.08548676], dtype=float32),\n", + " array([0.00016344], dtype=float32),\n", + " array([0.06375157], dtype=float32),\n", + " array([0.98533106], dtype=float32),\n", + " array([0.9875267], dtype=float32),\n", + " array([0.02328171], dtype=float32),\n", + " array([0.7528208], dtype=float32),\n", + " array([0.6718994], dtype=float32),\n", + " array([0.7016442], dtype=float32),\n", + " array([0.29562166], dtype=float32),\n", + " array([0.21487534], dtype=float32),\n", + " array([0.05325569], dtype=float32),\n", + " array([0.98829865], dtype=float32),\n", + " array([0.0206712], dtype=float32),\n", + " array([0.39194584], dtype=float32),\n", + " array([0.05182257], dtype=float32),\n", + " array([0.12892328], dtype=float32),\n", + " array([0.98039585], dtype=float32),\n", + " array([0.07023581], dtype=float32),\n", + " array([0.998417], dtype=float32),\n", + " array([0.7812852], dtype=float32),\n", + " array([0.09137525], dtype=float32),\n", + " array([0.8678507], dtype=float32),\n", + " array([0.9933328], dtype=float32),\n", + " array([0.3079019], dtype=float32),\n", + " array([0.8708483], dtype=float32),\n", + " array([0.9929174], dtype=float32),\n", + " array([0.85494846], dtype=float32),\n", + " array([0.9882675], dtype=float32),\n", + " array([0.9930362], dtype=float32),\n", + " array([0.44101492], dtype=float32),\n", + " array([0.00028029], dtype=float32),\n", + " array([0.98733073], dtype=float32),\n", + " array([0.94348913], dtype=float32),\n", + " array([3.119138e-05], dtype=float32),\n", + " array([0.980949], dtype=float32),\n", + " array([0.9913406], dtype=float32),\n", + " array([0.99495846], dtype=float32),\n", + " array([0.9629638], dtype=float32),\n", + " array([0.0100573], dtype=float32),\n", + " array([0.02189975], dtype=float32),\n", + " array([0.99831617], dtype=float32),\n", + " array([0.98490876], dtype=float32),\n", + " array([0.54414076], dtype=float32),\n", + " array([0.06107181], dtype=float32),\n", + " array([0.9978096], dtype=float32),\n", + " array([0.9745584], dtype=float32),\n", + " array([0.00242021], dtype=float32),\n", + " array([0.03076136], dtype=float32),\n", + " array([0.35039175], dtype=float32),\n", + " array([0.83999205], dtype=float32),\n", + " array([0.99990547], dtype=float32),\n", + " array([0.05263938], dtype=float32),\n", + " array([0.8979464], dtype=float32),\n", + " array([0.03534276], dtype=float32),\n", + " array([0.00471485], dtype=float32),\n", + " array([0.99737906], dtype=float32),\n", + " array([0.929945], dtype=float32),\n", + " array([0.01993787], dtype=float32),\n", + " array([0.9856134], dtype=float32),\n", + " array([0.7457446], dtype=float32),\n", + " array([0.99158585], dtype=float32),\n", + " array([0.9860604], dtype=float32),\n", + " array([0.03886136], dtype=float32),\n", + " array([0.96496195], dtype=float32),\n", + " array([0.31795], dtype=float32),\n", + " array([0.99946743], dtype=float32),\n", + " array([0.996521], dtype=float32),\n", + " array([0.03773015], dtype=float32),\n", + " array([0.00583928], dtype=float32),\n", + " array([0.99041665], dtype=float32),\n", + " array([0.9955739], dtype=float32),\n", + " array([0.01058325], dtype=float32),\n", + " array([0.00011865], dtype=float32),\n", + " array([0.8401856], dtype=float32),\n", + " array([0.63474256], dtype=float32),\n", + " array([0.9829626], dtype=float32),\n", + " array([0.01037378], dtype=float32),\n", + " array([0.26479724], dtype=float32),\n", + " array([0.21121329], dtype=float32),\n", + " array([0.9914016], dtype=float32),\n", + " array([0.9588108], dtype=float32),\n", + " array([0.99756277], dtype=float32),\n", + " array([0.30543897], dtype=float32),\n", + " array([0.99640626], dtype=float32),\n", + " array([0.30586973], dtype=float32),\n", + " array([0.9993086], dtype=float32),\n", + " array([0.9949649], dtype=float32),\n", + " array([0.6421015], dtype=float32),\n", + " array([0.14092435], dtype=float32),\n", + " array([0.01815344], dtype=float32),\n", + " array([0.00090887], dtype=float32),\n", + " array([0.9869277], dtype=float32),\n", + " array([0.22545609], dtype=float32),\n", + " array([0.9994192], dtype=float32),\n", + " array([0.10223134], dtype=float32),\n", + " array([0.9989011], dtype=float32),\n", + " array([0.02059738], dtype=float32),\n", + " array([0.88542646], dtype=float32),\n", + " array([0.9960936], dtype=float32),\n", + " array([0.9262567], dtype=float32),\n", + " array([0.9434017], dtype=float32),\n", + " array([0.98046255], dtype=float32),\n", + " array([0.9889431], dtype=float32),\n", + " array([0.7408156], dtype=float32),\n", + " array([0.00285646], dtype=float32),\n", + " array([0.9890942], dtype=float32),\n", + " array([0.7398897], dtype=float32),\n", + " array([0.9671184], dtype=float32),\n", + " array([0.99998057], dtype=float32),\n", + " array([0.9491266], dtype=float32),\n", + " array([0.54299086], dtype=float32),\n", + " array([0.00412416], dtype=float32),\n", + " array([0.6694579], dtype=float32),\n", + " array([0.95415497], dtype=float32),\n", + " array([0.01549284], dtype=float32),\n", + " array([0.0003646], dtype=float32),\n", + " array([0.99999607], dtype=float32),\n", + " array([0.9999577], dtype=float32),\n", + " array([0.00113213], dtype=float32),\n", + " array([0.9941749], dtype=float32),\n", + " array([0.9958812], dtype=float32),\n", + " array([0.99189734], dtype=float32),\n", + " array([0.0017188], dtype=float32),\n", + " array([0.985795], dtype=float32),\n", + " array([0.9998721], dtype=float32),\n", + " array([0.99999976], dtype=float32),\n", + " array([0.98263794], dtype=float32),\n", + " array([0.58947575], dtype=float32),\n", + " array([0.00927054], dtype=float32),\n", + " array([0.9716789], dtype=float32),\n", + " array([0.84313625], dtype=float32),\n", + " array([0.96165526], dtype=float32),\n", + " array([0.9851811], dtype=float32),\n", + " array([0.9854842], dtype=float32),\n", + " array([0.00322469], dtype=float32),\n", + " array([0.9309462], dtype=float32),\n", + " array([0.20306274], dtype=float32),\n", + " array([0.04456307], dtype=float32),\n", + " array([0.9654337], dtype=float32),\n", + " array([0.01055153], dtype=float32),\n", + " array([0.99989104], dtype=float32),\n", + " array([0.03129936], dtype=float32),\n", + " array([0.2108155], dtype=float32),\n", + " array([0.98949], dtype=float32),\n", + " array([0.99999154], dtype=float32),\n", + " array([0.94526803], dtype=float32),\n", + " array([0.99107426], dtype=float32),\n", + " array([0.99824476], dtype=float32),\n", + " array([0.99930096], dtype=float32),\n", + " array([0.9494158], dtype=float32),\n", + " array([0.59529406], dtype=float32),\n", + " array([0.00836287], dtype=float32),\n", + " array([0.99950933], dtype=float32),\n", + " array([0.8118227], dtype=float32),\n", + " array([0.6227854], dtype=float32),\n", + " array([0.9727045], dtype=float32),\n", + " array([0.99001336], dtype=float32),\n", + " array([0.61210626], dtype=float32),\n", + " array([0.00018276], dtype=float32),\n", + " array([0.09038408], dtype=float32),\n", + " array([0.08299794], dtype=float32),\n", + " array([0.0105845], dtype=float32),\n", + " array([0.16678979], dtype=float32),\n", + " array([0.9531919], dtype=float32),\n", + " array([0.9998332], dtype=float32),\n", + " array([5.249855e-05], dtype=float32),\n", + " array([0.00057517], dtype=float32),\n", + " array([0.997013], dtype=float32),\n", + " array([0.12925929], dtype=float32),\n", + " array([0.07413327], dtype=float32),\n", + " array([0.98919934], dtype=float32),\n", + " array([0.5382614], dtype=float32),\n", + " array([0.9996692], dtype=float32),\n", + " array([0.8613375], dtype=float32),\n", + " array([0.71423596], dtype=float32),\n", + " array([0.09667405], dtype=float32),\n", + " array([0.9979893], dtype=float32),\n", + " array([0.00794561], dtype=float32),\n", + " array([0.00175152], dtype=float32),\n", + " array([0.21769904], dtype=float32),\n", + " array([0.94123036], dtype=float32),\n", + " array([0.96663105], dtype=float32),\n", + " array([0.01070287], dtype=float32),\n", + " array([0.07400733], dtype=float32),\n", + " array([0.012168], dtype=float32),\n", + " array([0.01236583], dtype=float32),\n", + " array([0.998744], dtype=float32),\n", + " array([0.00689602], dtype=float32),\n", + " array([0.9943845], dtype=float32),\n", + " array([0.81676173], dtype=float32),\n", + " array([0.9999511], dtype=float32),\n", + " array([0.00488768], dtype=float32),\n", + " array([0.33030394], dtype=float32),\n", + " array([1.3143154e-05], dtype=float32),\n", + " array([0.97804296], dtype=float32),\n", + " array([0.97198254], dtype=float32),\n", + " array([0.98943126], dtype=float32),\n", + " array([0.9713336], dtype=float32),\n", + " array([0.44176972], dtype=float32),\n", + " array([0.9996177], dtype=float32),\n", + " array([0.97341985], dtype=float32),\n", + " array([0.9889993], dtype=float32),\n", + " array([0.9999981], dtype=float32),\n", + " array([0.9906634], dtype=float32),\n", + " array([0.7313937], dtype=float32),\n", + " array([4.6735212e-07], dtype=float32),\n", + " array([0.9986381], dtype=float32),\n", + " array([0.5398831], dtype=float32),\n", + " array([0.5327877], dtype=float32),\n", + " array([0.99454075], dtype=float32),\n", + " array([0.7781688], dtype=float32),\n", + " array([0.00171901], dtype=float32),\n", + " array([0.9790917], dtype=float32),\n", + " array([1.7694707e-05], dtype=float32),\n", + " array([0.22174618], dtype=float32),\n", + " array([0.00032948], dtype=float32),\n", + " array([0.98750776], dtype=float32),\n", + " array([0.9930167], dtype=float32),\n", + " array([0.7805735], dtype=float32),\n", + " array([0.874757], dtype=float32),\n", + " array([0.10298155], dtype=float32),\n", + " array([0.00014828], dtype=float32),\n", + " array([0.01591556], dtype=float32),\n", + " array([0.96804875], dtype=float32),\n", + " array([0.91091835], dtype=float32),\n", + " array([0.00087433], dtype=float32),\n", + " array([0.02600787], dtype=float32),\n", + " array([0.00168016], dtype=float32),\n", + " array([0.93263006], dtype=float32),\n", + " array([0.19706792], dtype=float32),\n", + " array([0.9951959], dtype=float32),\n", + " array([0.024617], dtype=float32),\n", + " array([0.9766921], dtype=float32),\n", + " array([0.04694933], dtype=float32),\n", + " array([0.9548745], dtype=float32),\n", + " array([0.01036863], dtype=float32),\n", + " array([0.9931427], dtype=float32),\n", + " array([0.01541146], dtype=float32),\n", + " array([0.02152353], dtype=float32),\n", + " array([0.78170955], dtype=float32),\n", + " array([0.5403529], dtype=float32),\n", + " array([0.22647694], dtype=float32),\n", + " array([0.00169189], dtype=float32),\n", + " array([0.9999125], dtype=float32),\n", + " array([0.00411672], dtype=float32),\n", + " array([0.92826104], dtype=float32),\n", + " array([0.78801906], dtype=float32),\n", + " array([0.9407639], dtype=float32),\n", + " array([0.98959863], dtype=float32),\n", + " array([0.9553592], dtype=float32),\n", + " array([0.01293176], dtype=float32),\n", + " array([0.01023797], dtype=float32),\n", + " array([0.03475741], dtype=float32),\n", + " array([0.9997676], dtype=float32),\n", + " array([0.97791255], dtype=float32),\n", + " array([0.00023193], dtype=float32),\n", + " array([0.00889389], dtype=float32),\n", + " array([0.957821], dtype=float32),\n", + " array([0.8767215], dtype=float32),\n", + " array([0.12694164], dtype=float32),\n", + " array([0.00611601], dtype=float32),\n", + " array([0.3953812], dtype=float32),\n", + " array([0.0004641], dtype=float32),\n", + " array([0.9987463], dtype=float32),\n", + " array([0.01019047], dtype=float32),\n", + " array([0.5764202], dtype=float32),\n", + " array([0.01138657], dtype=float32),\n", + " array([0.5458222], dtype=float32),\n", + " array([0.9942966], dtype=float32),\n", + " array([0.00240233], dtype=float32),\n", + " array([0.7553514], dtype=float32),\n", + " array([0.0881868], dtype=float32),\n", + " array([0.34226933], dtype=float32),\n", + " array([0.4873583], dtype=float32),\n", + " array([0.33895075], dtype=float32),\n", + " array([0.03251609], dtype=float32),\n", + " array([0.00574167], dtype=float32),\n", + " array([0.988293], dtype=float32),\n", + " array([0.00064724], dtype=float32),\n", + " array([0.17580858], dtype=float32),\n", + " array([0.94925475], dtype=float32),\n", + " array([0.18242276], dtype=float32),\n", + " array([0.9117029], dtype=float32),\n", + " array([0.717524], dtype=float32),\n", + " array([0.9948232], dtype=float32),\n", + " array([0.41392937], dtype=float32),\n", + " array([0.39889827], dtype=float32),\n", + " array([0.21468543], dtype=float32),\n", + " array([0.00194653], dtype=float32),\n", + " array([0.8318713], dtype=float32),\n", + " array([0.9048981], dtype=float32),\n", + " array([0.00159927], dtype=float32),\n", + " array([0.01717789], dtype=float32),\n", + " array([0.99441886], dtype=float32),\n", + " array([0.9672044], dtype=float32),\n", + " array([0.9958038], dtype=float32),\n", + " array([0.8791019], dtype=float32),\n", + " array([0.9852657], dtype=float32),\n", + " array([0.09051052], dtype=float32),\n", + " array([0.00520805], dtype=float32),\n", + " array([0.4179241], dtype=float32),\n", + " array([0.02102872], dtype=float32),\n", + " array([0.999458], dtype=float32),\n", + " array([0.07276529], dtype=float32),\n", + " array([0.89086306], dtype=float32),\n", + " array([0.58678746], dtype=float32),\n", + " array([0.9981602], dtype=float32),\n", + " array([0.98019546], dtype=float32),\n", + " array([0.81768113], dtype=float32),\n", + " array([0.3091106], dtype=float32),\n", + " array([0.7304271], dtype=float32),\n", + " array([0.00713154], dtype=float32),\n", + " array([0.10799696], dtype=float32),\n", + " array([0.00034327], dtype=float32),\n", + " array([0.97954047], dtype=float32),\n", + " array([0.9953832], dtype=float32),\n", + " array([0.06257767], dtype=float32),\n", + " array([0.8372882], dtype=float32),\n", + " array([1.8113557e-05], dtype=float32),\n", + " array([0.04951284], dtype=float32),\n", + " array([0.04139359], dtype=float32),\n", + " array([0.5803639], dtype=float32),\n", + " array([0.01002938], dtype=float32),\n", + " array([0.44129696], dtype=float32),\n", + " array([0.88426584], dtype=float32),\n", + " array([0.01807107], dtype=float32),\n", + " array([0.87367356], dtype=float32),\n", + " array([0.09437197], dtype=float32),\n", + " array([0.98715776], dtype=float32),\n", + " array([0.06557368], dtype=float32),\n", + " array([0.9997048], dtype=float32),\n", + " array([0.5877887], dtype=float32),\n", + " array([0.10160982], dtype=float32),\n", + " array([0.2194032], dtype=float32),\n", + " array([0.996086], dtype=float32),\n", + " array([0.70603895], dtype=float32),\n", + " array([0.0575645], dtype=float32),\n", + " array([0.58087355], dtype=float32),\n", + " array([0.9330629], dtype=float32),\n", + " array([0.004917], dtype=float32),\n", + " array([0.19366205], dtype=float32),\n", + " array([0.99521846], dtype=float32),\n", + " array([0.9976768], dtype=float32),\n", + " array([0.01894422], dtype=float32),\n", + " array([0.5626045], dtype=float32),\n", + " array([0.99873656], dtype=float32),\n", + " array([0.98620087], dtype=float32),\n", + " array([0.20380375], dtype=float32),\n", + " array([0.00324226], dtype=float32),\n", + " array([0.03813465], dtype=float32),\n", + " array([0.07607552], dtype=float32),\n", + " array([0.02199142], dtype=float32),\n", + " array([0.7561464], dtype=float32),\n", + " array([0.9669124], dtype=float32),\n", + " array([0.86246103], dtype=float32),\n", + " array([0.189888], dtype=float32),\n", + " array([6.4221174e-05], dtype=float32),\n", + " array([0.61084515], dtype=float32),\n", + " array([0.9931891], dtype=float32),\n", + " array([0.95753783], dtype=float32),\n", + " array([0.96757764], dtype=float32),\n", + " array([0.99537355], dtype=float32),\n", + " array([0.05853846], dtype=float32),\n", + " array([0.9369336], dtype=float32),\n", + " array([0.99967706], dtype=float32),\n", + " array([0.48768336], dtype=float32),\n", + " array([0.38854727], dtype=float32),\n", + " array([0.16301523], dtype=float32),\n", + " array([0.44746688], dtype=float32),\n", + " array([0.9951616], dtype=float32),\n", + " array([0.9310025], dtype=float32),\n", + " array([0.9793833], dtype=float32),\n", + " array([0.9996581], dtype=float32),\n", + " array([0.06153212], dtype=float32),\n", + " array([0.99993515], dtype=float32),\n", + " array([8.6169755e-05], dtype=float32),\n", + " array([0.14121674], dtype=float32),\n", + " array([0.001046], dtype=float32),\n", + " array([0.96887445], dtype=float32),\n", + " array([0.9940006], dtype=float32),\n", + " array([0.20827933], dtype=float32),\n", + " array([1.3143304e-05], dtype=float32),\n", + " array([0.70770514], dtype=float32),\n", + " array([0.00062637], dtype=float32),\n", + " array([0.09923268], dtype=float32),\n", + " array([0.00062528], dtype=float32),\n", + " array([0.9974062], dtype=float32),\n", + " array([0.6399337], dtype=float32),\n", + " array([0.9582232], dtype=float32),\n", + " array([5.5980826e-07], dtype=float32),\n", + " array([0.98064935], dtype=float32),\n", + " array([0.9810916], dtype=float32),\n", + " array([0.02825824], dtype=float32),\n", + " array([0.00210933], dtype=float32),\n", + " array([0.03763315], dtype=float32),\n", + " array([0.9897635], dtype=float32),\n", + " array([0.38776097], dtype=float32),\n", + " array([0.01495247], dtype=float32),\n", + " array([0.00611806], dtype=float32),\n", + " array([0.998847], dtype=float32),\n", + " array([0.01276735], dtype=float32),\n", + " array([0.00442079], dtype=float32),\n", + " array([0.2124616], dtype=float32),\n", + " array([0.01237443], dtype=float32),\n", + " array([0.01144132], dtype=float32),\n", + " array([0.92837715], dtype=float32),\n", + " array([0.02206292], dtype=float32),\n", + " array([0.98381627], dtype=float32),\n", + " array([0.00593874], dtype=float32),\n", + " array([0.26435003], dtype=float32),\n", + " array([0.02000471], dtype=float32),\n", + " array([0.84790653], dtype=float32),\n", + " array([0.9852173], dtype=float32),\n", + " array([0.9987846], dtype=float32),\n", + " array([0.99995625], dtype=float32),\n", + " array([0.17164613], dtype=float32),\n", + " array([0.18840362], dtype=float32),\n", + " array([0.9717937], dtype=float32),\n", + " array([0.74185556], dtype=float32),\n", + " array([0.00340732], dtype=float32),\n", + " array([0.01526649], dtype=float32),\n", + " array([0.61485744], dtype=float32),\n", + " array([0.9119215], dtype=float32),\n", + " array([0.02722141], dtype=float32),\n", + " array([0.39047685], dtype=float32),\n", + " array([0.19983715], dtype=float32),\n", + " array([0.00018045], dtype=float32),\n", + " array([0.76507735], dtype=float32),\n", + " array([0.00108664], dtype=float32),\n", + " array([0.8838372], dtype=float32),\n", + " array([0.9674925], dtype=float32),\n", + " array([0.00014587], dtype=float32),\n", + " array([0.01428808], dtype=float32),\n", + " array([0.6684856], dtype=float32),\n", + " array([0.03062288], dtype=float32),\n", + " array([0.46116126], dtype=float32),\n", + " array([0.16899237], dtype=float32),\n", + " array([0.9975586], dtype=float32),\n", + " array([0.91609216], dtype=float32),\n", + " array([0.9852622], dtype=float32),\n", + " array([0.5730661], dtype=float32),\n", + " array([0.19011642], dtype=float32),\n", + " array([0.9962901], dtype=float32),\n", + " array([0.00494908], dtype=float32),\n", + " array([0.9681047], dtype=float32),\n", + " array([0.03208594], dtype=float32),\n", + " array([0.00147857], dtype=float32),\n", + " array([0.12340485], dtype=float32),\n", + " array([0.996431], dtype=float32),\n", + " array([0.9512111], dtype=float32),\n", + " array([0.9922307], dtype=float32),\n", + " array([0.02449521], dtype=float32),\n", + " array([0.9568155], dtype=float32),\n", + " array([0.99991953], dtype=float32),\n", + " array([0.9982376], dtype=float32),\n", + " array([0.1572257], dtype=float32),\n", + " array([0.34052122], dtype=float32),\n", + " array([0.6778389], dtype=float32),\n", + " array([0.9513396], dtype=float32),\n", + " array([0.99644357], dtype=float32),\n", + " array([0.3379453], dtype=float32),\n", + " array([0.9816772], dtype=float32),\n", + " array([0.01320378], dtype=float32),\n", + " array([0.00027732], dtype=float32),\n", + " array([0.99997675], dtype=float32),\n", + " array([0.49815693], dtype=float32),\n", + " array([0.00038428], dtype=float32),\n", + " array([0.03885539], dtype=float32),\n", + " array([0.5476643], dtype=float32),\n", + " array([0.9998455], dtype=float32),\n", + " array([0.9970118], dtype=float32),\n", + " array([0.5124474], dtype=float32),\n", + " array([0.38307184], dtype=float32),\n", + " array([0.99099356], dtype=float32),\n", + " array([0.25695708], dtype=float32),\n", + " array([0.9953335], dtype=float32),\n", + " array([0.97055674], dtype=float32),\n", + " array([0.4068285], dtype=float32),\n", + " array([1.4898453e-06], dtype=float32),\n", + " array([0.66622144], dtype=float32),\n", + " array([0.99686724], dtype=float32),\n", + " array([0.00997034], dtype=float32),\n", + " array([0.2946419], dtype=float32),\n", + " array([0.70338255], dtype=float32),\n", + " array([0.02406825], dtype=float32),\n", + " array([0.99934345], dtype=float32),\n", + " array([0.03414964], dtype=float32),\n", + " array([0.00095879], dtype=float32),\n", + " array([0.99705076], dtype=float32),\n", + " array([0.21492238], dtype=float32),\n", + " array([0.87716794], dtype=float32),\n", + " array([0.47392538], dtype=float32),\n", + " array([0.24244678], dtype=float32),\n", + " array([0.03492213], dtype=float32),\n", + " array([0.9038005], dtype=float32),\n", + " array([0.51358217], dtype=float32),\n", + " array([0.3492779], dtype=float32),\n", + " array([0.37952748], dtype=float32),\n", + " array([0.9956209], dtype=float32),\n", + " array([0.05870749], dtype=float32),\n", + " array([0.93354183], dtype=float32),\n", + " array([0.45190257], dtype=float32),\n", + " array([0.99952877], dtype=float32),\n", + " array([0.35226253], dtype=float32),\n", + " ...]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prediction = model.predict(x_test)\n", + "result = prediction.collect()\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Further experiments\n", + "\n", + "\n", + "* We were using 2 hidden layers. Try to use 1 or 3 hidden layers and see how it affects validation and test accuracy.\n", + "* Try to use layers with more hidden units or less hidden units: 32 units, 64 units...\n", + "* Try to use the `mse` loss function instead of `binary_crossentropy`.\n", + "* Try to use the `tanh` activation (an activation that was popular in the early days of neural networks) instead of `relu`.\n", + "\n", + "These experiments will help convince you that the architecture choices we have made are all fairly reasonable, although they can still be \n", + "improved!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusions\n", + "\n", + "\n", + "Here's what you should take away from this example:\n", + "\n", + "* There's usually quite a bit of preprocessing you need to do on your raw data in order to be able to feed it -- as tensors -- into a neural \n", + "network. In the case of sequences of words, they can be encoded as binary vectors -- but there are other encoding options too.\n", + "* Stacks of `Dense` layers with `relu` activations can solve a wide range of problems (including sentiment classification), and you will \n", + "likely use them frequently.\n", + "* In a binary classification problem (two output classes), your network should end with a `Dense` layer with 1 unit and a `sigmoid` activation, \n", + "i.e. the output of your network should be a scalar between 0 and 1, encoding a probability.\n", + "* With such a scalar sigmoid output, on a binary classification problem, the loss function you should use is `binary_crossentropy`.\n", + "* The `rmsprop` optimizer is generally a good enough choice of optimizer, whatever your problem. That's one less thing for you to worry \n", + "about.\n", + "* As they get better on their training data, neural networks eventually start _overfitting_ and end up obtaining increasingly worse results on data \n", + "never-seen-before. Make sure to always monitor performance on data that is outside of the training set.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## \\* Memory saving\n", + "To run this notebook based on codes above, you need 32g `SPARK_DRIVER_MEMORY`, which is a bit expensive. Following is a viable memory saving approach which could save your `SPARK_DRIVER_MEMORY` to 12g.\n", + "\n", + "Taking a review of the time you have compiled the model, and prepared the `ndarray` type of datasets. And in old code above, the next step you would do is fit:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model.fit(partial_x_train,\n", + " partial_y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_val, y_val))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Just hold on here! Before you call this `fit` method, use following code to do the training to save the memory:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from bigdl.util.common import to_sample_rdd\n", + "\n", + "train = to_sample_rdd(partial_x_train, partial_y_train)\n", + "val = to_sample_rdd(x_val, y_val)\n", + "\n", + "model.fit(train, None,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=val)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This code zip the training data and label into RDD. The reason why it works is that every time when `fit` method takes `ndarray` as input, it transforms the `ndarray` to RDD and some memory is taken for cache in this process. And in this notebook, we use the same dataset as input repeatedly. If we call this operation only once and reuse the RDD afterwards, all the subsequential memory use would be saved." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/keras/3.6-classifying-newswires.ipynb b/keras/3.6-classifying-newswires.ipynb new file mode 100644 index 0000000..09ac085 --- /dev/null +++ b/keras/3.6-classifying-newswires.ipynb @@ -0,0 +1,554 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First of all, set environment variables and initialize spark context:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: SPARK_DRIVER_MEMORY=8g\n", + "env: PYSPARK_PYTHON=/usr/bin/python3.5\n", + "env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n" + ] + } + ], + "source": [ + "%env SPARK_DRIVER_MEMORY=8g\n", + "%env PYSPARK_PYTHON=/usr/bin/python3.5\n", + "%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "\n", + "from zoo.common.nncontext import *\n", + "sc = init_nncontext(init_spark_conf().setMaster(\"local[4]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# # Classifying newswires: a multi-class classification example\n", + "\n", + "----\n", + "\n", + "In the previous section we saw how to classify vector inputs into two mutually exclusive classes using a densely-connected neural network. \n", + "But what happens when you have more than two classes? \n", + "\n", + "In this section, we will build a network to classify Reuters newswires into 46 different mutually-exclusive topics. Since we have many \n", + "classes, this problem is an instance of \"multi-class classification\", and since each data point should be classified into only one \n", + "category, the problem is more specifically an instance of \"single-label, multi-class classification\". If each data point could have \n", + "belonged to multiple categories (in our case, topics) then we would be facing a \"multi-label, multi-class classification\" problem." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The Reuters dataset\n", + "\n", + "\n", + "We will be working with the _Reuters dataset_, a set of short newswires and their topics, published by Reuters in 1986. It's a very simple, \n", + "widely used toy dataset for text classification. There are 46 different topics; some topics are more represented than others, but each \n", + "topic has at least 10 examples in the training set.\n", + "\n", + "Like IMDB and MNIST, the Reuters dataset comes packaged as part of Keras API of Analytics Zoo. Let's take a look right away:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras.datasets import reuters\n", + "(train_data, train_labels), (test_data, test_labels) = reuters.load_data(nb_words=10000)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Like with the IMDB dataset, the argument `nb_words=10000` restricts the data to the 10,000 most frequently occurring words found in the \n", + "data.\n", + "\n", + "We have 8,982 training examples and 2,246 test examples:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "word_index = reuters.get_word_index()\n", + "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing the data\n", + "\n", + "We can vectorize the data with the exact same code as in our previous example:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "def vectorize_sequences(sequences, dimension=10000):\n", + " results = np.zeros((len(sequences), dimension))\n", + " for i, sequence in enumerate(sequences):\n", + " results[i, sequence] = 1.\n", + " return results\n", + "\n", + "x_train = vectorize_sequences(train_data)\n", + "x_test = vectorize_sequences(test_data)\n", + "# this part pending to modify, one-hot or integer issue" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Building our network\n", + "\n", + "\n", + "This topic classification problem looks very similar to our previous movie review classification problem: in both cases, we are trying to \n", + "classify short snippets of text. There is however a new constraint here: the number of output classes has gone from 2 to 46, i.e. the \n", + "dimensionality of the output space is much larger. \n", + "\n", + "In a stack of `Dense` layers like what we were using, each layer can only access information present in the output of the previous layer. \n", + "If one layer drops some information relevant to the classification problem, this information can never be recovered by later layers: each \n", + "layer can potentially become an \"information bottleneck\". In our previous example, we were using 16-dimensional intermediate layers, but a \n", + "16-dimensional space may be too limited to learn to separate 46 different classes: such small layers may act as information bottlenecks, \n", + "permanently dropping relevant information.\n", + "\n", + "For this reason we will use larger layers. Let's go with 64 units:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from zoo.pipeline.api.keras import models\n", + "from zoo.pipeline.api.keras import layers\n", + "\n", + "model = models.Sequential()\n", + "model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))\n", + "model.add(layers.Dense(64, activation='relu'))\n", + "model.add(layers.Dense(46, activation='softmax'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are two other things you should note about this architecture:\n", + "\n", + "* We are ending the network with a `Dense` layer of size 46. This means that for each input sample, our network will output a \n", + "46-dimensional vector. Each entry in this vector (each dimension) will encode a different output class.\n", + "* The last layer uses a `softmax` activation. You have already seen this pattern in the MNIST example. It means that the network will \n", + "output a _probability distribution_ over the 46 different output classes, i.e. for every input sample, the network will produce a \n", + "46-dimensional output vector where `output[i]` is the probability that the sample belongs to class `i`. The 46 scores will sum to 1.\n", + "\n", + "The best loss function to use in this case is `categorical_crossentropy`. It measures the distance between two probability distributions: \n", + "in our case, between the probability distribution output by our network, and the true distribution of the labels. By minimizing the \n", + "distance between these two distributions, we train our network to output something as close as possible to the true labels." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createRMSprop\n", + "creating: createZooKerasSparseCategoricalCrossEntropy\n", + "creating: createZooKerasSparseCategoricalAccuracy\n" + ] + } + ], + "source": [ + "model.compile(optimizer='rmsprop',\n", + " loss='sparse_categorical_crossentropy',\n", + " metrics=['accuracy'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Validating our approach\n", + "\n", + "Let's set apart 1,000 samples in our training data to use as a validation set:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "x_val = x_train[:1000]\n", + "partial_x_train = x_train[1000:]\n", + "\n", + "y_val = train_labels[:1000]\n", + "partial_y_train = train_labels[1000:] # this line would return list\n", + "partial_y_train = np.array(partial_y_train) # convert list to ndarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's train our network for 20 epochs:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "dir_name = '3-5 ' + str(time.ctime())\n", + "model.set_tensorboard('./', dir_name)\n", + "model.fit(partial_x_train,\n", + " partial_y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_val, y_val))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 512 records in 0.03322949 seconds. Throughput is 15408.001 records/second. Loss is 0.36856997.\n", + "Top1Accuracy is Accuracy(correct: 808, count: 1000, accuracy: 0.808)_" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "train_loss = np.array(model.get_train_summary('Loss'))\n", + "val_loss = np.array(model.get_validation_summary('Loss'))\n", + "\n", + "import matplotlib.pyplot as plt\n", + "plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')\n", + "plt.plot(val_loss[:,0],val_loss[:,1],label='validation loss',color='green')\n", + "plt.title('Training and validation loss')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It seems that the network starts overfitting after 8 epochs. Let's train a new network from scratch for 8 epochs, then let's evaluate it on \n", + "the test set:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasSparseCategoricalCrossEntropy\n", + "creating: createZooKerasSparseCategoricalAccuracy\n" + ] + } + ], + "source": [ + "model = models.Sequential()\n", + "model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))\n", + "model.add(layers.Dense(64, activation='relu'))\n", + "model.add(layers.Dense(46, activation='softmax'))\n", + "\n", + "model.compile(optimizer='rmsprop',\n", + " loss='sparse_categorical_crossentropy',\n", + " metrics=['accuracy'])\n", + "model.fit(partial_x_train,\n", + " partial_y_train,\n", + " nb_epoch=8,\n", + " batch_size=512,\n", + " validation_data=(x_val, y_val))\n", + "y_test = np.array(test_labels).astype('float32')\n", + "results = model.evaluate(x_test, y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.9659086465835571, 0.8032057285308838]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our approach reaches an accuracy of ~80%. With a balanced binary classification problem, the accuracy reached by a purely random classifier \n", + "would be 50%, but in our case it is closer to 19%, so our results seem pretty good, at least when compared to a random baseline:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.19011576135351738" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import copy\n", + "\n", + "test_labels_copy = copy.copy(test_labels)\n", + "np.random.shuffle(test_labels_copy)\n", + "float(np.sum(np.array(test_labels) == np.array(test_labels_copy))) / len(test_labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generating predictions on new data\n", + "\n", + "We can verify that the `predict` method of our model instance returns a probability distribution over all 46 topics. Let's generate topic \n", + "predictions for all of the test data:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "predictions = model.predict(x_test).collect()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each entry in `predictions` is a vector of length 46:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(46,)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "predictions[0].shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The coefficients in this vector sum to 1:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.99999994" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.sum(predictions[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The largest entry is the predicted class, i.e. the class with the highest probability:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "4" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.argmax(predictions[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Further experiments\n", + "\n", + "* Try using larger or smaller layers: 32 units, 128 units...\n", + "* We were using two hidden layers. Now try to use a single hidden layer, or three hidden layers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Wrapping up\n", + "\n", + "\n", + "Here's what you should take away from this example:\n", + "\n", + "* If you are trying to classify data points between N classes, your network should end with a `Dense` layer of size N.\n", + "* In a single-label, multi-class classification problem, your network should end with a `softmax` activation, so that it will output a \n", + "probability distribution over the N output classes.\n", + "* _Categorical crossentropy_ is almost always the loss function you should use for such problems. It minimizes the distance between the \n", + "probability distributions output by the network, and the true distribution of the targets.\n", + "* There are two ways to handle labels in multi-class classification:\n", + " ** Encoding the labels via \"categorical encoding\" (also known as \"one-hot encoding\") and using `categorical_crossentropy` as your loss \n", + "function.\n", + " ** Encoding the labels as integers and using the `sparse_categorical_crossentropy` loss function.\n", + "* If you need to classify data into a large number of categories, then you should avoid creating information bottlenecks in your network by having \n", + "intermediate layers that are too small." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/keras/3.7-predicting-house-prices.ipynb b/keras/3.7-predicting-house-prices.ipynb new file mode 100644 index 0000000..0f37622 --- /dev/null +++ b/keras/3.7-predicting-house-prices.ipynb @@ -0,0 +1,797 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First of all, set environment variables and initialize spark context:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: SPARK_DRIVER_MEMORY=8g\n", + "env: PYSPARK_PYTHON=/usr/bin/python3.5\n", + "env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n" + ] + } + ], + "source": [ + "%env SPARK_DRIVER_MEMORY=8g\n", + "%env PYSPARK_PYTHON=/usr/bin/python3.5\n", + "%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "\n", + "from zoo.common.nncontext import *\n", + "sc = init_nncontext(init_spark_conf().setMaster(\"local[4]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Predicting house prices: a regression example\n", + "\n", + "\n", + "----\n", + "\n", + "\n", + "In our two previous examples, we were considering classification problems, where the goal was to predict a single discrete label of an \n", + "input data point. Another common type of machine learning problem is \"regression\", which consists of predicting a continuous value instead \n", + "of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a \n", + "software project will take to complete, given its specifications.\n", + "\n", + "Do not mix up \"regression\" with the algorithm \"logistic regression\": confusingly, \"logistic regression\" is not a regression algorithm, \n", + "it is a classification algorithm." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The Boston Housing Price dataset\n", + "\n", + "\n", + "We will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the \n", + "suburb at the time, such as the crime rate, the local property tax rate, etc.\n", + "\n", + "The dataset we will be using has another interesting difference from our two previous examples: it has very few data points, only 506 in \n", + "total, split between 404 training samples and 102 test samples, and each \"feature\" in the input data (e.g. the crime rate is a feature) has \n", + "a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, \n", + "others between 0 and 100...\n", + "\n", + "Let's take a look at the data:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras.datasets import boston_housing\n", + "(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(404, 13)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_data.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(102, 13)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_data.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, we have 404 training samples and 102 test samples. The data comprises 13 features. The 13 features in the input data are as \n", + "follow:\n", + "\n", + "1. Per capita crime rate.\n", + "2. Proportion of residential land zoned for lots over 25,000 square feet.\n", + "3. Proportion of non-retail business acres per town.\n", + "4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n", + "5. Nitric oxides concentration (parts per 10 million).\n", + "6. Average number of rooms per dwelling.\n", + "7. Proportion of owner-occupied units built prior to 1940.\n", + "8. Weighted distances to five Boston employment centres.\n", + "9. Index of accessibility to radial highways.\n", + "10. Full-value property-tax rate per $10,000.\n", + "11. Pupil-teacher ratio by town.\n", + "12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.\n", + "13. % lower status of the population.\n", + "\n", + "The targets are the median values of owner-occupied homes, in thousands of dollars:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([22.6, 50. , 23. , 8.3, 21.2, 19.9, 20.6, 18.7, 16.1, 18.6, 8.8,\n", + " 17.2, 14.9, 10.5, 50. , 29. , 23. , 33.3, 29.4, 21. , 23.8, 19.1,\n", + " 20.4, 29.1, 19.3, 23.1, 19.6, 19.4, 38.7, 18.7, 14.6, 20. , 20.5,\n", + " 20.1, 23.6, 16.8, 5.6, 50. , 14.5, 13.3, 23.9, 20. , 19.8, 13.8,\n", + " 16.5, 21.6, 20.3, 17. , 11.8, 27.5, 15.6, 23.1, 24.3, 42.8, 15.6,\n", + " 21.7, 17.1, 17.2, 15. , 21.7, 18.6, 21. , 33.1, 31.5, 20.1, 29.8,\n", + " 15.2, 15. , 27.5, 22.6, 20. , 21.4, 23.5, 31.2, 23.7, 7.4, 48.3,\n", + " 24.4, 22.6, 18.3, 23.3, 17.1, 27.9, 44.8, 50. , 23. , 21.4, 10.2,\n", + " 23.3, 23.2, 18.9, 13.4, 21.9, 24.8, 11.9, 24.3, 13.8, 24.7, 14.1,\n", + " 18.7, 28.1, 19.8, 26.7, 21.7, 22. , 22.9, 10.4, 21.9, 20.6, 26.4,\n", + " 41.3, 17.2, 27.1, 20.4, 16.5, 24.4, 8.4, 23. , 9.7, 50. , 30.5,\n", + " 12.3, 19.4, 21.2, 20.3, 18.8, 33.4, 18.5, 19.6, 33.2, 13.1, 7.5,\n", + " 13.6, 17.4, 8.4, 35.4, 24. , 13.4, 26.2, 7.2, 13.1, 24.5, 37.2,\n", + " 25. , 24.1, 16.6, 32.9, 36.2, 11. , 7.2, 22.8, 28.7, 14.4, 24.4,\n", + " 18.1, 22.5, 20.5, 15.2, 17.4, 13.6, 8.7, 18.2, 35.4, 31.7, 33. ,\n", + " 22.2, 20.4, 23.9, 25. , 12.7, 29.1, 12. , 17.7, 27. , 20.6, 10.2,\n", + " 17.5, 19.7, 29.8, 20.5, 14.9, 10.9, 19.5, 22.7, 19.5, 24.6, 25. ,\n", + " 24.5, 50. , 14.3, 11.8, 31. , 28.7, 16.2, 43.5, 25. , 22. , 19.9,\n", + " 22.1, 46. , 22.9, 20.2, 43.1, 34.6, 13.8, 24.3, 21.5, 24.4, 21.2,\n", + " 23.8, 26.6, 25.1, 9.6, 19.4, 19.4, 9.5, 14. , 26.5, 13.8, 34.7,\n", + " 16.3, 21.7, 17.5, 15.6, 20.9, 21.7, 12.7, 18.5, 23.7, 19.3, 12.7,\n", + " 21.6, 23.2, 29.6, 21.2, 23.8, 17.1, 22. , 36.5, 18.8, 21.9, 23.1,\n", + " 20.2, 17.4, 37. , 24.1, 36.2, 15.7, 32.2, 13.5, 17.9, 13.3, 11.7,\n", + " 41.7, 18.4, 13.1, 25. , 21.2, 16. , 34.9, 25.2, 24.8, 21.5, 23.4,\n", + " 18.9, 10.8, 21. , 27.5, 17.5, 13.5, 28.7, 14.8, 19.1, 28.6, 13.1,\n", + " 19. , 11.3, 13.3, 22.4, 20.1, 18.2, 22.9, 20.6, 25. , 12.8, 34.9,\n", + " 23.7, 50. , 29. , 30.1, 22. , 15.6, 23.3, 30.1, 14.3, 22.8, 50. ,\n", + " 20.8, 6.3, 34.9, 32.4, 19.9, 20.3, 17.8, 23.1, 20.4, 23.2, 7. ,\n", + " 16.8, 46.7, 50. , 22.9, 23.9, 21.4, 21.7, 15.4, 15.3, 23.1, 23.9,\n", + " 19.4, 11.9, 17.8, 31.5, 33.8, 20.8, 19.8, 22.4, 5. , 24.5, 19.4,\n", + " 15.1, 18.2, 19.3, 27.1, 20.7, 37.6, 11.7, 33.4, 30.1, 21.4, 45.4,\n", + " 20.1, 20.8, 26.4, 10.4, 21.8, 32. , 21.7, 18.4, 37.9, 17.8, 28. ,\n", + " 28.2, 36. , 18.9, 15. , 22.5, 30.7, 20. , 19.1, 23.3, 26.6, 21.1,\n", + " 19.7, 20. , 12.1, 7.2, 14.2, 17.3, 27.5, 22.2, 10.9, 19.2, 32. ,\n", + " 14.5, 24.7, 12.6, 24. , 24.1, 50. , 16.1, 43.8, 26.6, 36.1, 21.8,\n", + " 29.9, 50. , 44. , 20.6, 19.6, 28.4, 19.1, 22.3, 20.9, 28.4, 14.4,\n", + " 32.7, 13.8, 8.5, 22.5, 35.1, 31.6, 17.8, 15.6])" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_targets" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The prices are typically between \\$10,000 and \\$50,000. If that sounds cheap, remember this was the mid-1970s, and these prices are not \n", + "inflation-adjusted." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing the data\n", + "\n", + "\n", + "It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to \n", + "automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal \n", + "with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we \n", + "will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a \n", + "unit standard deviation. This is easily done in Numpy:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "mean = train_data.mean(axis=0)\n", + "train_data -= mean\n", + "std = train_data.std(axis=0)\n", + "train_data /= std\n", + "\n", + "test_data -= mean\n", + "test_data /= std" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the quantities that we use for normalizing the test data have been computed using the training data. We should never use in our \n", + "workflow any quantity computed on the test data, even for something as simple as data normalization." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Building our network\n", + "\n", + "\n", + "Because so few samples are available, we will be using a very small network with two \n", + "hidden layers, each with 64 units. In general, the less training data you have, the worse overfitting will be, and using \n", + "a small network is one way to mitigate overfitting." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras import models\n", + "from zoo.pipeline.api.keras import layers\n", + "\n", + "def build_model():\n", + " # Because we will need to instantiate\n", + " # the same model multiple times,\n", + " # we use a function to construct it.\n", + " model = models.Sequential()\n", + " model.add(layers.Dense(64, activation='relu',\n", + " input_shape=(train_data.shape[1],)))\n", + " model.add(layers.Dense(64, activation='relu'))\n", + " model.add(layers.Dense(1))\n", + " model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])\n", + " return model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our network ends with a single unit, and no activation (i.e. it will be linear layer). \n", + "This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value). \n", + "Applying an activation function would constrain the range that the output can take; for instance if \n", + "we applied a `sigmoid` activation function to our last layer, the network could only learn to predict values between 0 and 1. Here, because \n", + "the last layer is purely linear, the network is free to learn to predict values in any range.\n", + "\n", + "Note that we are compiling the network with the `mse` loss function -- Mean Squared Error, the square of the difference between the \n", + "predictions and the targets, a widely used loss function for regression problems.\n", + "\n", + "We are also monitoring a new metric during training: `mae`. This stands for Mean Absolute Error. It is simply the absolute value of the \n", + "difference between the predictions and the targets. For instance, a MAE of 0.5 on this problem would mean that our predictions are off by \n", + "\\$500 on average." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Validating our approach using K-fold validation\n", + "\n", + "\n", + "To evaluate our network while we keep adjusting its parameters (such as the number of epochs used for training), we could simply split the \n", + "data into a training set and a validation set, as we were doing in our previous examples. However, because we have so few data points, the \n", + "validation set would end up being very small (e.g. about 100 examples). A consequence is that our validation scores may change a lot \n", + "depending on _which_ data points we choose to use for validation and which we choose for training, i.e. the validation scores may have a \n", + "high _variance_ with regard to the validation split. This would prevent us from reliably evaluating our model.\n", + "\n", + "The best practice in such situations is to use K-fold cross-validation. It consists of splitting the available data into K partitions \n", + "(typically K=4 or 5), then instantiating K identical models, and training each one on K-1 partitions while evaluating on the remaining \n", + "partition. The validation score for the model used would then be the average of the K validation scores obtained." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then let's start our training:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "processing fold # 0\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n", + "processing fold # 1\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n", + "processing fold # 2\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n", + "processing fold # 3\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "\n", + "k = 4\n", + "num_val_samples = len(train_data) // k\n", + "num_nb_epoch = 50\n", + "all_scores = []\n", + "for i in range(k):\n", + " print('processing fold #', i)\n", + " # Prepare the validation data: data from partition # k\n", + " val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]\n", + " val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]\n", + "\n", + " # Prepare the training data: data from all other partitions\n", + " partial_train_data = np.concatenate(\n", + " [train_data[:i * num_val_samples],\n", + " train_data[(i + 1) * num_val_samples:]],\n", + " axis=0)\n", + " partial_train_targets = np.concatenate(\n", + " [train_targets[:i * num_val_samples],\n", + " train_targets[(i + 1) * num_val_samples:]],\n", + " axis=0)\n", + "\n", + " # Build the model (already compiled)\n", + " model = build_model()\n", + " # Train the model (in silent mode, verbose=0)\n", + " #model.fit(partial_train_data, partial_train_targets,\n", + " # nb_epoch=num_nb_epoch, batch_size=1, verbose=0)\n", + " model.fit(partial_train_data, partial_train_targets,\n", + " nb_epoch=num_nb_epoch, batch_size=16)\n", + "\n", + " # Evaluate the model on the validation data\n", + " #val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)\n", + " val_mse, val_mae = model.evaluate(val_data, val_targets)\n", + " all_scores.append(val_mae)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 16 records in 0.011235845 seconds. Throughput is 1424.0139 records/second. Loss is 8.708786._\n", + "\n", + "_INFO - Trained 16 records in 0.009535034 seconds. Throughput is 1678.0223 records/second. Loss is 5.3613434._\n", + "\n", + "_INFO - Trained 16 records in 0.008636178 seconds. Throughput is 1852.6713 records/second. Loss is 18.106756._\n", + "\n", + "_INFO - Trained 16 records in 0.009207628 seconds. Throughput is 1737.6897 records/second. Loss is 7.0931993._" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[3.291872501373291, 2.496018171310425, 2.221175193786621, 2.6994853019714355]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "all_scores" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.677137792110443" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.mean(all_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can notice, the different runs do indeed show rather different validation scores, from 2.1 to 2.9. Their average (2.4) is a much more \n", + "reliable metric than any single of these scores -- that's the entire point of K-fold cross-validation. In this case, we are off by \\\\$2,400 on \n", + "average, which is still significant considering that the prices range from \\\\$10,000 to \\\\$50,000. \n", + "\n", + "Let's try training the network for a bit longer: 500 epochs. To keep a record of how well the model did at each epoch, we will modify our training loop \n", + "to save the per-epoch validation score log:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "processing fold # 0\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n", + "processing fold # 1\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n", + "processing fold # 2\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n", + "processing fold # 3\n", + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n" + ] + } + ], + "source": [ + "num_epochs = 500\n", + "all_mae_histories = []\n", + "for i in range(k):\n", + " print('processing fold #', i)\n", + " # Prepare the validation data: data from partition # k\n", + " val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]\n", + " val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]\n", + "\n", + " # Prepare the training data: data from all other partitions\n", + " partial_train_data = np.concatenate(\n", + " [train_data[:i * num_val_samples],\n", + " train_data[(i + 1) * num_val_samples:]],\n", + " axis=0)\n", + " partial_train_targets = np.concatenate(\n", + " [train_targets[:i * num_val_samples],\n", + " train_targets[(i + 1) * num_val_samples:]],\n", + " axis=0)\n", + "\n", + " # Build the model (already compiled)\n", + " model = build_model()\n", + " # Train the model (in silent mode, verbose=0)\n", + " import time\n", + " dir_name = '3-7 ' + str(time.ctime())\n", + " model.set_tensorboard('./', dir_name)\n", + " history = model.fit(partial_train_data, partial_train_targets,\n", + " validation_data=(val_data, val_targets),\n", + " nb_epoch=num_epochs, batch_size=16)\n", + " \n", + " #mae_history = history.history['val_mean_absolute_error']\n", + " mae_history = model.get_validation_summary(\"Loss\")\n", + " all_mae_histories.append(mae_history)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can then compute the average of the per-epoch MAE scores for all folds:" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[1.90000000e+01, 4.05375427e+02, 1.55307042e+09],\n", + " [3.80000000e+01, 2.64351837e+02, 1.55307042e+09],\n", + " [5.70000000e+01, 1.50977859e+02, 1.55307042e+09],\n", + " ...,\n", + " [9.46200000e+03, 2.07635689e+01, 1.55307053e+09],\n", + " [9.48100000e+03, 2.02473850e+01, 1.55307053e+09],\n", + " [9.50000000e+03, 2.02105141e+01, 1.55307053e+09]],\n", + "\n", + " [[1.90000000e+01, 4.76980957e+02, 1.55307053e+09],\n", + " [3.80000000e+01, 3.29584198e+02, 1.55307053e+09],\n", + " [5.70000000e+01, 1.80655548e+02, 1.55307053e+09],\n", + " ...,\n", + " [9.46200000e+03, 1.73588219e+01, 1.55307064e+09],\n", + " [9.48100000e+03, 1.78555279e+01, 1.55307064e+09],\n", + " [9.50000000e+03, 1.73744106e+01, 1.55307064e+09]],\n", + "\n", + " [[1.90000000e+01, 4.62182434e+02, 1.55307064e+09],\n", + " [3.80000000e+01, 3.34037567e+02, 1.55307064e+09],\n", + " [5.70000000e+01, 2.06141006e+02, 1.55307064e+09],\n", + " ...,\n", + " [9.46200000e+03, 1.72124062e+01, 1.55307075e+09],\n", + " [9.48100000e+03, 1.75751667e+01, 1.55307075e+09],\n", + " [9.50000000e+03, 1.74055386e+01, 1.55307075e+09]],\n", + "\n", + " [[1.90000000e+01, 5.21177673e+02, 1.55307075e+09],\n", + " [3.80000000e+01, 3.99685974e+02, 1.55307075e+09],\n", + " [5.70000000e+01, 2.67611786e+02, 1.55307075e+09],\n", + " ...,\n", + " [9.46200000e+03, 1.75390892e+01, 1.55307085e+09],\n", + " [9.48100000e+03, 1.76337471e+01, 1.55307085e+09],\n", + " [9.50000000e+03, 1.91227703e+01, 1.55307085e+09]]])" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "all_mae_histories = np.array(all_mae_histories)\n", + "all_mae_histories" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, the `all_mae_histories` is a 3-d array, the last dimension are 3-element tuples. This 3-d array is built up with four 2-d arrays and all the first element of every 2-d array are equal. The first element of tuple stands for the training step and the third element stands for time stamp. You do need to worry about them, let's just calculate the average value through the first axis of this 3-d array. Actually we just want the second elements of this array, which stand for the MAE results. " + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[1.90000000e+01, 4.66429123e+02, 1.55307058e+09],\n", + " [3.80000000e+01, 3.31914894e+02, 1.55307058e+09],\n", + " [5.70000000e+01, 2.01346550e+02, 1.55307058e+09],\n", + " ...,\n", + " [9.46200000e+03, 1.82184715e+01, 1.55307069e+09],\n", + " [9.48100000e+03, 1.83279567e+01, 1.55307069e+09],\n", + " [9.50000000e+03, 1.85283084e+01, 1.55307069e+09]])" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "average_mae_history = np.mean(all_mae_histories, axis=0)\n", + "average_mae_history" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, this operation does not mess up the first elements since they are all equal through the first axis. And we do not need to care about the third element because it is useless at this time.\n", + "\n", + "Let's plot this:" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "plt.plot(average_mae_history[:,0],average_mae_history[:,1])\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Validation MAE')\n", + "plt.ylim((14, 20))\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's plot this:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "According to this plot, it seems that validation MAE stops improving significantly after 150 epochs. Past that point, we start overfitting.\n", + "\n", + "Once we are done tuning other parameters of our model (besides the number of epochs, we could also adjust the size of the hidden layers), we \n", + "can train a final \"production\" model on all of the training data, with the best parameters, then look at its performance on the test data:" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasMeanSquaredError\n", + "creating: createZooKerasMAE\n" + ] + } + ], + "source": [ + "# Get a fresh, compiled model.\n", + "model = build_model()\n", + "# Train it on the entirety of the data.\n", + "model.fit(train_data, train_targets,\n", + " nb_epoch=150, batch_size=16)\n", + "test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.7991065979003906" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_mae_score" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are still off by about \\$1,800." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Wrapping up\n", + "\n", + "\n", + "Here's what you should take away from this example:\n", + "\n", + "* Regression is done using different loss functions from classification; Mean Squared Error (MSE) is a commonly used loss function for \n", + "regression.\n", + "* Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally the concept of \"accuracy\" \n", + "does not apply for regression. A common regression metric is Mean Absolute Error (MAE).\n", + "* When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.\n", + "* When there is little data available, using K-Fold validation is a great way to reliably evaluate a model.\n", + "* When little training data is available, it is preferable to use a small network with very few hidden layers (typically only one or two), \n", + "in order to avoid severe overfitting.\n", + "\n", + "This example concludes our series of three introductory practical examples. You are now able to handle common types of problems with vector data input:\n", + "\n", + "* Binary (2-class) classification.\n", + "* Multi-class, single-label classification.\n", + "* Scalar regression.\n", + "\n", + "In the next chapter, you will acquire a more formal understanding of some of the concepts you have encountered in these first examples, \n", + "such as data preprocessing, model evaluation, and overfitting." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/keras/4.4-overfitting-and-underfitting.ipynb b/keras/4.4-overfitting-and-underfitting.ipynb new file mode 100644 index 0000000..3e9c188 --- /dev/null +++ b/keras/4.4-overfitting-and-underfitting.ipynb @@ -0,0 +1,711 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First of all, set environment variables and initialize spark context:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: SPARK_DRIVER_MEMORY=32g\n", + "env: PYSPARK_PYTHON=/usr/bin/python3.5\n", + "env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n" + ] + } + ], + "source": [ + "%env SPARK_DRIVER_MEMORY=32g\n", + "%env PYSPARK_PYTHON=/usr/bin/python3.5\n", + "%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "\n", + "from zoo.common.nncontext import *\n", + "sc = init_nncontext(init_spark_conf().setMaster(\"local[4]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that you have to allocate 32g memory to `SPARK_DRIVER_MEMORY` if you are about to finish the contents in this notebook. Perhaps there is no such memory left on your machine, see memory saving approach at [Chapter 3.5](https://github.com/intel-analytics/zoo-tutorials/blob/master/keras/3.7-predicting-house-prices.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Overfitting and underfitting\n", + "\n", + "\n", + "----\n", + "\n", + "\n", + "In all the examples we saw in the previous chapter -- movie review sentiment prediction, topic classification, and house price regression -- \n", + "we could notice that the performance of our model on the held-out validation data would always peak after a few epochs and would then start \n", + "degrading, i.e. our model would quickly start to _overfit_ to the training data. Overfitting happens in every single machine learning \n", + "problem. Learning how to deal with overfitting is essential to mastering machine learning.\n", + "\n", + "The fundamental issue in machine learning is the tension between optimization and generalization. \"Optimization\" refers to the process of \n", + "adjusting a model to get the best performance possible on the training data (the \"learning\" in \"machine learning\"), while \"generalization\" \n", + "refers to how well the trained model would perform on data it has never seen before. The goal of the game is to get good generalization, of \n", + "course, but you do not control generalization; you can only adjust the model based on its training data.\n", + "\n", + "At the beginning of training, optimization and generalization are correlated: the lower your loss on training data, the lower your loss on \n", + "test data. While this is happening, your model is said to be _under-fit_: there is still progress to be made; the network hasn't yet \n", + "modeled all relevant patterns in the training data. But after a certain number of iterations on the training data, generalization stops \n", + "improving, validation metrics stall then start degrading: the model is then starting to over-fit, i.e. is it starting to learn patterns \n", + "that are specific to the training data but that are misleading or irrelevant when it comes to new data.\n", + "\n", + "To prevent a model from learning misleading or irrelevant patterns found in the training data, _the best solution is of course to get \n", + "more training data_. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution \n", + "is to modulate the quantity of information that your model is allowed to store, or to add constraints on what information it is allowed to \n", + "store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most \n", + "prominent patterns, which have a better chance of generalizing well.\n", + "\n", + "The processing of fighting overfitting in this way is called _regularization_. Let's review some of the most common regularization \n", + "techniques, and let's apply them in practice to improve our movie classification model from the previous chapter." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: in this notebook we will be using the IMDB test set as our validation set. It doesn't matter in this context.\n", + "\n", + "Let's prepare the data using the code from Chapter 3, Section 5:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras.datasets import imdb\n", + "import numpy as np\n", + "(train_data, train_labels), (test_data, test_labels) = imdb.load_data(nb_words=10000)\n", + "\n", + "def vectorize_sequences(sequences, dimension=10000):\n", + " # Create an all-zero matrix of shape (len(sequences), dimension)\n", + " results = np.zeros((len(sequences), dimension))\n", + " for i, sequence in enumerate(sequences):\n", + " results[i, sequence] = 1. # set specific indices of results[i] to 1s\n", + " return results\n", + "\n", + "x_train = vectorize_sequences(train_data)\n", + "x_test = vectorize_sequences(test_data)\n", + "\n", + "y_train = np.asarray(train_labels).astype('float32')\n", + "y_test = np.asarray(test_labels).astype('float32')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fighting overfitting\n", + "\n", + "## Reducing the network's size\n", + "\n", + "\n", + "The simplest way to prevent overfitting is to reduce the size of the model, i.e. the number of learnable parameters in the model (which is \n", + "determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is \n", + "often referred to as the model's \"capacity\". Intuitively, a model with more parameters will have more \"memorization capacity\" and therefore \n", + "will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any \n", + "generalization power. For instance, a model with 500,000 binary parameters could easily be made to learn the class of every digits in the \n", + "MNIST training set: we would only need 10 binary parameters for each of the 50,000 digits. Such a model would be useless for classifying \n", + "new digit samples. Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge \n", + "is generalization, not fitting.\n", + "\n", + "On the other hand, if the network has limited memorization resources, it will not be able to learn this mapping as easily, and thus, in \n", + "order to minimize its loss, it will have to resort to learning compressed representations that have predictive power regarding the targets \n", + "-- precisely the type of representations that we are interested in. At the same time, keep in mind that you should be using models that have \n", + "enough parameters that they won't be underfitting: your model shouldn't be starved for memorization resources. There is a compromise to be \n", + "found between \"too much capacity\" and \"not enough capacity\".\n", + "\n", + "Unfortunately, there is no magical formula to determine what the right number of layers is, or what the right size for each layer is. You \n", + "will have to evaluate an array of different architectures (on your validation set, not on your test set, of course) in order to find the \n", + "right model size for your data. The general workflow to find an appropriate model size is to start with relatively few layers and \n", + "parameters, and start increasing the size of the layers or adding new layers until you see diminishing returns with regard to the \n", + "validation loss.\n", + "\n", + "Let's try this on our movie review classification network. Our original network was as such:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "from zoo.pipeline.api.keras import models\n", + "from zoo.pipeline.api.keras import layers\n", + "\n", + "original_model = models.Sequential()\n", + "original_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))\n", + "original_model.add(layers.Dense(16, activation='relu'))\n", + "original_model.add(layers.Dense(1, activation='sigmoid'))\n", + "\n", + "original_model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['acc'])\n", + "\n", + "import time\n", + "dir_name = '4-4 ' + str(time.ctime())\n", + "original_model.set_tensorboard('./', dir_name)\n", + "original_model.fit(x_train, y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_test, y_test))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 512 records in 0.024455326 seconds. Throughput is 20936.135 records/second. Loss is 0.01585226.\n", + "Top1Accuracy is Accuracy(correct: 21341, count: 25000, accuracy: 0.85364)_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's try to replace it with this smaller network:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "smaller_model = models.Sequential()\n", + "smaller_model.add(layers.Dense(4, activation='relu', input_shape=(10000,)))\n", + "smaller_model.add(layers.Dense(4, activation='relu'))\n", + "smaller_model.add(layers.Dense(1, activation='sigmoid'))\n", + "\n", + "smaller_model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['acc'])\n", + "\n", + "dir_name = '4-4 ' + str(time.ctime())\n", + "smaller_model.set_tensorboard('./', dir_name)\n", + "smaller_model.fit(x_train, y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_test, y_test))" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "original_val_loss = np.array(original_model.get_validation_summary(\"Loss\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, the smaller network starts overfitting later than the reference one (after 6 epochs rather than 4) and its performance \n", + "degrades much more slowly once it starts overfitting.\n", + "\n", + "Now, for kicks, let's add to this benchmark a network that has much more capacity, far more than the problem would warrant:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "smaller_val_loss = np.array(smaller_model.get_validation_summary(\"Loss\"))\n", + "\n", + "plt.plot(original_val_loss[:,0], original_val_loss[:,1], label='original model')\n", + "plt.plot(smaller_val_loss[:,0], smaller_val_loss[:,1],label='smaller model',color='green')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, the smaller network starts overfitting later than the reference one (the original one starts overfitting at about 150 to 200 steps, which is about 3 or 4 epochs) and its performance \n", + "degrades much more slowly once it starts overfitting.\n", + "\n", + "Now, for kicks, let's add to this benchmark a network that has much more capacity, far more than the problem would warrant:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "bigger_model = models.Sequential()\n", + "bigger_model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))\n", + "bigger_model.add(layers.Dense(512, activation='relu'))\n", + "bigger_model.add(layers.Dense(1, activation='sigmoid'))\n", + "\n", + "bigger_model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['acc'])" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "dir_name = '4-4 ' + str(time.ctime())\n", + "bigger_model.set_tensorboard('./', dir_name)\n", + "bigger_model.fit(x_train, y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_test, y_test))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's how the bigger network fares compared to the reference one. The dots are the validation loss values of the bigger network, and the \n", + "crosses are the initial network." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "bigger_val_loss = np.array(bigger_model.get_validation_summary(\"Loss\"))\n", + "\n", + "plt.plot(original_val_loss[:,0], original_val_loss[:,1], label='original model')\n", + "plt.plot(bigger_val_loss[:,0], bigger_val_loss[:,1],label='bigger model',color='green')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The bigger network starts overfitting almost right away, after just one epoch, and overfits much more severely. Its validation loss is also \n", + "more noisy.\n", + "\n", + "Meanwhile, here are the training losses for our two networks:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, the bigger network gets its training loss near zero very quickly. The more capacity the network has, the quicker it will be \n", + "able to model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large \n", + "difference between the training and validation loss)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding weight regularization\n", + "\n", + "\n", + "You may be familiar with _Occam's Razor_ principle: given two explanations for something, the explanation most likely to be correct is the \n", + "\"simplest\" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some \n", + "training data and a network architecture, there are multiple sets of weights values (multiple _models_) that could explain the data, and \n", + "simpler models are less likely to overfit than complex ones.\n", + "\n", + "A \"simple model\" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer \n", + "parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity \n", + "of a network by forcing its weights to only take small values, which makes the distribution of weight values more \"regular\". This is called \n", + "\"weight regularization\", and it is done by adding to the loss function of the network a _cost_ associated with having large weights. This \n", + "cost comes in two flavors:\n", + "\n", + "* L1 regularization, where the cost added is proportional to the _absolute value of the weights coefficients_ (i.e. to what is called the \n", + "\"L1 norm\" of the weights).\n", + "* L2 regularization, where the cost added is proportional to the _square of the value of the weights coefficients_ (i.e. to what is called \n", + "the \"L2 norm\" of the weights). L2 regularization is also called _weight decay_ in the context of neural networks. Don't let the different \n", + "name confuse you: weight decay is mathematically the exact same as L2 regularization.\n", + "\n", + "In Keras API of Analytics Zoo, weight regularization is added by passing _weight regularizer instances_ to layers as keyword arguments. Let's add L2 weight \n", + "regularization to our movie review classification network:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createL2Regularizer\n", + "creating: createZooKerasDense\n", + "creating: createL2Regularizer\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "from zoo.pipeline.api.keras import regularizers\n", + "\n", + "l2_model = models.Sequential()\n", + "l2_model.add(layers.Dense(16, W_regularizer=regularizers.l2(0.001),\n", + " activation='relu', input_shape=(10000,)))\n", + "l2_model.add(layers.Dense(16, W_regularizer=regularizers.l2(0.001),\n", + " activation='relu'))\n", + "l2_model.add(layers.Dense(1, activation='sigmoid'))\n", + "\n", + "l2_model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['acc'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`l2(0.001)` means that every coefficient in the weight matrix of the layer will add `0.001 * weight_coefficient_value` to the total loss of \n", + "the network. Note that because this penalty is _only added at training time_, the loss for this network will be much higher at training \n", + "than at test time.\n", + "\n", + "Here's the impact of our L2 regularization penalty:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "dir_name = '4-4 ' + str(time.ctime())\n", + "l2_model.set_tensorboard('./', dir_name)\n", + "l2_model.fit(x_train, y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_test, y_test))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 512 records in 0.024366594 seconds. Throughput is 21012.373 records/second. Loss is 0.13651785.\n", + "Top1Accuracy is Accuracy(correct: 21684, count: 25000, accuracy: 0.86736)_" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEKCAYAAAD9xUlFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XdYFOf6//H3QxdFJIoIIoINQUQpii0ajT1qYonfmKLGGDXR9GN+OSmmnZyTk5NjTEzTGI2mauztJJYk9qgoYBfsgNhQmkjbfX5/7LpBRUVl2QXu13Xtxe7M7Oy9s8N+duaZeUZprRFCCCEAHGxdgBBCCPshoSCEEMJCQkEIIYSFhIIQQggLCQUhhBAWEgpCCCEsJBSEEEJYSCgIIYSwkFAQQghh4WTrAm5VnTp1dGBgoK3LEEKICmXHjh3ntNbeN5uuwoVCYGAgsbGxti5DCCEqFKXU8dJMJ7uPhBBCWEgoCCGEsJBQEEIIYVHh2hRKUlhYSEpKCnl5ebYuRVQQbm5u+Pv74+zsbOtShLArlSIUUlJS8PDwIDAwEKWUrcsRdk5rTXp6OikpKQQFBdm6HCHsSqXYfZSXl0ft2rUlEESpKKWoXbu2bFkKUYJKEQqABIK4JbK+CFGyShMKQghRWRmMmvdW7CM145LVX0tCoZz17duXjIyMG04zadIk1qxZc1vz/+OPP+jXr99tPbe0jh07RlhY2B1PI4S4Oa01f1+4i682HOWPg2es/nqVoqG5ItBao7Vm5cqVN532nXfeKYeKhBD2TmvNP1bsZ15sCs92a8IjMQ2t/pqypVBGJk+eTFhYGGFhYUyZMgUw/VoODg5m+PDhhIWFkZycTGBgIOfOnQPg3XffJTg4mE6dOjFs2DA+/PBDAEaOHMn8+fMBU7ceb775JpGRkbRs2ZIDBw4AsG3bNtq3b09ERAQdOnTg4MGDN6zvm2++4YEHHqBHjx4EBgby6aefMnnyZCIiImjXrh3nz58HID4+nnbt2hEeHs7AgQO5cOECADt27KBVq1a0atWKzz77zDJfg8HAxIkTadOmDeHh4UybNq0Ml6oQVdvU3w7x9cajjOwQyAs9mpXLa1a6LYW3l+1l38msMp1nqF9N3uzf4rrjd+zYwaxZs9i6dStaa2JiYujSpQteXl4kJSUxe/Zs2rVrd8Vztm/fzoIFC0hISKCwsJDIyEiioqJKnH+dOnXYuXMnn3/+OR9++CEzZsygefPmbNiwAScnJ9asWcOrr77KggULbvg+9uzZQ1xcHHl5eTRp0oR///vfxMXF8cILLzBnzhyef/55hg8fztSpU+nSpQuTJk3i7bffZsqUKTz++ON8+umndO7cmYkTJ1rm+fXXX+Pp6cn27dvJz8+nY8eO9OzZUxpyhbhDszYdZfLqRAZF1mdSv9By+5+SLYUysHHjRgYOHEj16tWpUaMGgwYNYsOGDQA0bNjwmkAA2LRpE/fffz9ubm54eHjQv3//685/0KBBAERFRXHs2DEAMjMzefDBBwkLC+OFF15g7969N62za9eueHh44O3tjaenp+U1W7ZsybFjx8jMzCQjI4MuXboAMGLECNavX09GRgYZGRl07twZgMcee8wyz1WrVjFnzhxat25NTEwM6enpJCUllWKpCSGuZ8GOFN5eto+eoT58MDgcB4fy+5FV6bYUbvSL3haqV69+x/NwdXUFwNHRkaKiIgDeeOMNunbtyqJFizh27Bj33HNPqecD4ODgYHns4OBgme+t0lozdepUevXqdcXwy+ElhLg1v+w5xcT5CXRqUoepD0fg5Fi+v91lS6EM3H333SxevJjc3FwuXrzIokWLuPvuu2/4nI4dO7Js2TLy8vLIyclh+fLlt/SamZmZ1K9fHzC1F5QFT09PvLy8LFs53377LV26dKFWrVrUqlWLjRs3AvD9999bntOrVy+++OILCgsLAUhMTOTixYtlUo8QVc3GpHM8+2McrRrUYtpjUbg6OZZ7DZVuS8EWIiMjGTlyJG3btgVg9OjRRERE3PDXcps2bRgwYADh4eH4+PjQsmVLPD09S/2aL7/8MiNGjOAf//gH9913352+BYvZs2czbtw4cnNzadSoEbNmzQJg1qxZjBo1CqUUPXv2tEw/evRojh07RmRkJFprvL29Wbx4cZnVI0RVseP4BcZ8G0sj7+rMGtmG6q62+XpWWmubvPDtio6O1ldfZGf//v2EhITYqKLbl5OTQ40aNcjNzaVz585Mnz6dyMhIW5dVZVTU9UZUPvvTsvi/aVvwqu7Cz+PaU9fDrcxfQym1Q2sdfbPpZEvBhsaMGcO+ffvIy8tjxIgREghCVEHHzl3ksa+34e7ixHdPxFglEG6FhIIN/fDDD7YuQQhhQ2mZl3hkxlaMWvPd6Bga3OVu65KkoVkIIWwhPSefR2dsJetSIXNGtaVJXQ9blwTIloIQQpS7rLxCRszaRsqFS8wZ1Zaw+qU/yMTaZEtBCCHK0aUCA6O/ieVAWjZfPhpFTKPati7pCrKlIIQQ5aSgyMhT3+9g+/HzfPJQBF2b17V1SdeQLYUyUqNGjWuGTZ48mdDQUMLDw7n33ns5fvx4udf11ltvWTraK62lS5fy/vvv3/Fr33PPPVx9+HBZK9554J1MI4S1GYyaF+fF88fBs/xzYEv6t/KzdUklklCwooiICGJjY9m1axdDhgzh5Zdfvulzbre7ibJSVFTEgAEDeOWVV2xahxCVidaa1xfvZvmuNF7t25xhbQNsXdJ1SShYUdeuXXF3Nx1i1q5dO1JSUkqcbuTIkYwbN46YmBhefvllLl68yKhRo2jbti0REREsWbIEgNzcXIYOHUpoaCgDBw4kJibG8ku8+JbK/PnzGTly5DWv89VXX9GmTRtatWrF4MGDyc3NLfH1v/nmGyZMmABA69atLbdq1aqxbt2669Z36dIlHnroIUJCQhg4cCCXLpV8lajAwED+/ve/07p1a6Kjo9m5cye9evWicePGfPnll4Dpn2jixImEhYXRsmVL5s6daxk+YcIEgoOD6d69O2fO/HXRkR07dtClSxeioqLo1asXaWlppfughLAirTWTluzlx23JjO/amDGdG9u6pBuqdG0Kz//yPPGn4st0nq3rtWZK7yl3NI+vv/6aPn36XHd8SkoKmzdvxtHRkVdffZVu3boxc+ZMMjIyaNu2Ld27d+eLL77Ay8uLffv2sWfPHlq3bn1LNQwaNIgnn3wSgNdff52vv/6aZ5555prXL96XUny8aVkuW7aMDz74gA4dOvDmm2+WWN+0adNwd3dn//797Nq164Yn4wUEBBAfH88LL7zAyJEj2bRpE3l5eYSFhTFu3DgWLlxIfHw8CQkJnDt3jjZt2tC5c2e2bNnCwYMH2bdvH6dPnyY0NJRRo0ZRWFjIM888w5IlS/D29mbu3Lm89tprzJw585aWkRBl6XIgfPvnccZ2bsTfegbbuqSbqnShYI++++47YmNjWbdu3XWnefDBB3F0NHV+tWrVKpYuXWppC8jLy+PEiRNs3LiR5557DoCwsDDCw8NvqY49e/bw+uuvk5GRQU5OzhU9mxZ//aslJSUxceJEfv/9d5ydna9b3/r163n22WcBCA8Pv2F9AwYMAEzddufk5ODh4YGHhweurq5kZGSwceNGhg0bhqOjIz4+PnTp0oXt27ezfv16y3A/Pz+6desGwMGDB9mzZw89evQATBf/8fX1vaXlI0RZujoQXunTvEJcZ6TShcKd/qIva2vWrOG9995j3bp1lq6qX3vtNVasWAH89Uu8eBfbWmsWLFhAcHDpf1UUX9ny8vJKnGbkyJEsXryYVq1a8c033/DHH39Yxl2vi++cnByGDh3KV199ZfmSvZ36rla82+6ru/S+nXYVrTUtWrRgy5Ytt12TEGWlogYCSJuCVcXFxTF27FiWLl1K3bp/HXr23nvvER8fbwmEq/Xq1YupU6dyubPCuLg4wNTd9rx58wDYt28fu3fvtjzHx8eH/fv3YzQaWbRoUYnzzc7OxtfXl8LCwiu6v76RUaNG8fjjj1/RFfj16uvcubOl6449e/awa9euUr1GSe6++27mzp2LwWDg7NmzrF+/nrZt29K5c2fL8LS0NH7//XcAgoODOXv2rCUUCgsLS3XhISHKWkUOBKiEWwq2kpubi7+/v+Xxiy++yMqVK8nJyeHBBx8ETPvRly5detN5vfHGGzz//POEh4djNBoJCgpi+fLlPP3004wYMYLQ0FCaN29OixYtLN1tv//++/Tr1w9vb2+io6PJycm5Zr7vvvsuMTExeHt7ExMTQ3Z29g3rOH78OPPnzycxMdGyb37GjBnXre+pp57i8ccfJyQkhJCQkOteXrQ0Bg4cyJYtW2jVqhVKKT744APq1avHwIED+e233wgNDSUgIID27dsD4OLiwvz583n22WfJzMykqKiI559/nhYt7OuiS6Jyq+iBAFbuOlsp1Rv4GHAEZmit379q/EdAV/NDd6Cu1rrWjeZZmbrOvlUGg4HCwkLc3Nw4fPgw3bt35+DBg7i4uNi6tAqpqqw3onzYeyDYvOtspZQj8BnQA0gBtiullmqt912eRmv9QrHpnwEirFVPZZCbm0vXrl0pLCxEa83nn38ugSCEHbD3QLgV1tx91BY4pLU+AqCU+gm4H9h3nemHAW9asZ4Kz8PDw+pnCAshbk1lCgSwbkNzfSC52OMU87BrKKUaAkHAb7f7YhXtCnLCtmR9EWWhsgUC2M/RRw8B87XWhpJGKqXGKKVilVKxZ8+evWa8m5sb6enp8o8uSkVrTXp6Om5utr3ClajYKmMggHV3H6UCDYo99jcPK8lDwPjrzUhrPR2YDqaG5qvH+/v7k5KSQkmBIURJ3NzcrjhaTIhbUVkDAawbCtuBpkqpIExh8BDw8NUTKaWaA17AbZ915OzsTFBQ0O0+XQghSq0yBwJYcfeR1roImAD8CuwH5mmt9yql3lFKDSg26UPAT1r2/Qgh7FxlDwSw8slrWuuVwMqrhk266vFb1qxBCCHKQlUIBLCfhmYhhLBbRmPVCASQbi6EEOKGLuYX8eK8eH7de7rSBwJIKAghxHWdzLjE6NmxHDiVxZv9QxnZIbBSBwJIKAghRIniTlzgyTk7yC80MHNkG+4JrnvzJ1UCEgpCCHGVJfGpTJy/i3o13fjxyRia+njYuqRyI6EghBBmRqNmyppEPvntEG2D7uLLR6O4q3rV6nRSQkEIIYDcgiJempfA//acYmi0P/94oCUuTlXvAE0JBSFElXcqM4/Rc7az92QWr98XwhOdgip9g/L1SCgIIaq0hOQMnpwTy8X8ImYMj+beEB9bl2RTEgpCiCpr+a6TvDQvAW8PV759oiPB9apOg/L1SCgIIaocrTVT1iTx8dokoht68eVjUdSp4WrrsuyChIIQokrJKzTw0s8JrNiVxuBIf/45KAxXJ0dbl2U3JBSEEFXG6aw8xsyJZVdqJq/0ac7Yzo2qbIPy9UgoCCGqhD2pmYyeHUtWXiHTHo2iZ4t6ti7JLkkoCCEqLa018ckZfL/1BEsTTlKnugvzx3Ug1K+mrUuzWxIKQohK52J+EUviT/L91uPsPZmFu4sjgyP9ebFHM7w9pEH5RiQUhBCVxv60LL7fepzFcSfJyS+ieT0P3n0gjAda++Hh5mzr8ioECQUhRIWWV2hg5e40vvvzODtPZODi5EC/cF8eiWlIZEAtaUi+RRIKQogK6cjZHH7YeoL5O1PIyC0kqE51Xr8vhMGR/nhVsU7sypKEghCiwig0GFm97zTf/XmczYfTcXJQ9GpRj0diAmjfuLZsFZQBCQUhhN0rMhj59PdDfL/1BGez86lfqxp/69mMoW0aUNfDzdblVSoSCkIIuzd7y3GmrEmia7A3j7VvSJdmdXF0kK0Ca5BQEELYtTNZeXy0OpEuzbyZObKN7CKysqp3BQkhRIXyr/8doKDIyFsDWkgglAMJBSGE3dp6JJ1FcamM6dyIoDrVbV1OlSChIISwS4UGI5OW7KV+rWqM79rE1uVUGRIKQgi7NGfLcQ6ezmZS/1CquUjX1uVFQkEIYXfOZOUxxdy43DO0al8es7xJKAgh7M6//neAfGlctgkJBSGEXZHGZduSUBBC2I0ig5E3l0rjsi1JKAgh7MacLcc5cCqbN/pJ47KtSCgIIexC8TOXe7WQxmVbsWooKKV6K6UOKqUOKaVeuc40Q5VS+5RSe5VSP1izHiGE/ZLGZftgtb6PlFKOwGdADyAF2K6UWqq13ldsmqbA34GOWusLSqm61qpHCGG/LjcuT+jaRBqXbcyaWwptgUNa6yNa6wLgJ+D+q6Z5EvhMa30BQGt9xor1CCHskDQu2xdrhkJ9ILnY4xTzsOKaAc2UUpuUUn8qpXqXNCOl1BilVKxSKvbs2bNWKlcIYQvSuGxfbN3Q7AQ0Be4BhgFfKaVqXT2R1nq61jpaax3t7e1dziUKIazlTLapcbmzNC7bDWuGQirQoNhjf/Ow4lKApVrrQq31USARU0gIIaqA91eaGpfflsZlu2HNUNgONFVKBSmlXICHgKVXTbMY01YCSqk6mHYnHbFiTUIIO7Ht6HkWxqXyZOcgaVy2I1YLBa11ETAB+BXYD8zTWu9VSr2jlBpgnuxXIF0ptQ/4HZiotU63Vk1CCPtQZDAyackeaVy2Q1a9HKfWeiWw8qphk4rd18CL5psQooq43Lj85aNRuLvIVYHtia0bmoUQVYw0Lts3CQUhRLmSxmX7JqEghCg30rhs/yQUhBDlQhqXKwYJBSFEufjrzOUQaVy2YxIKQgiriz12vljjcj1blyNuQOJaCGE1eYUGJq9O5KsNR/DzrMa790vjsr2TUBBCWEV8cgYvzYvn8NmLDGsbwGv3hVDDVb5y7J18QkKIMlVQZOSTtUl8se4w3jVcmT2qLV2aSUeWFYWEghCizOw9mclL8xI4cCqbIVH+vNEvFM9qzrYuS9wCCQUhxB0rNBj54o/DfLI2Ca/qLswYHk33UDlbuSKSUBBC3JHE09m8NC+B3amZDGjlx9sDWuBV3cXWZYnbJKEghLgtBqPmqw1HmLwqkRpuTnzxSCR9WvrauixxhyQUhBC37MjZHF76OYG4Exn0blGPfwwMo04NV1uXJcqAhIIQotSMRs2szcf44JcDuDk78vFDrRnQyk/OPahEJBSEEKVyIj2Xv81PYNvR83RrXpd/DWqJT003W5clylipQkEp1RhI0VrnK6XuAcKBOVrrDGsWJ4SwD+sSz/LUdztwVIoPhoTzYJS/bB1UUqXt+2gBYFBKNQGmAw2AH6xWlRDCbvyy5xSjZ28nsHZ1fn2hM0OjG0ggVGKlDQWj+ZrLA4GpWuuJgBxmIEQltyguhfE/7KRlfU9+HNMOv1rVbF2SsLLStikUKqWGASOA/uZhcpqiEJXY91uP8/riPbRvVJuvhkdTXfotqhJKu6XwONAeeE9rfVQpFQR8a72yhBC2NH39YV5btIduwXWZObKNBEIVUqpPWmu9D3gWQCnlBXhorf9tzcKEEOVPa82UNUl8vDaJfuG+fPR/rXF2lMuuVCWl+rSVUn8opWoqpe4CdgJfKaUmW7c0IUR50lrz3or9fLw2iaHR/nz8UIQEQhVU2k/cU2udBQzCdChqDNDdemUJIcqTwah5ddEeZmw8ysgOgbw/KBxHBznCqCoqbSg4KaV8gaHAcivWI4QoZ4UGIy/Oi+fHbSeY0LUJb/YPxUECocoqbevRO8CvwCat9XalVCMgyXplCSHKQ36RgQk/xLF632le7h3M0/c0sXVJwsZK29D8M/BzscdHgMHWKkoIYX25BUWM/XYHG5LO8c79LRjePtDWJQk7UNqGZn+l1CKl1BnzbYFSyt/axQkhrCMrr5ARM7ex6dA5/jMkXAJBWJS2TWEWsBTwM9+WmYcJISqY8xcLeOSrrcSdyGDqsEgejG5g65KEHSltKHhrrWdprYvMt28AuRK3EBXMmaw8Hpq+hcTT2Xw1PJr7wqW3GnGl0oZCulLqUaWUo/n2KJBuzcKEEGUr5UIuD07bQsqFS8x6vA1dm9e1dUnCDpU2FEZhOhz1FJAGDAFGWqkmIUQZMhg1K3al8eCXW7hwsYDvRsfQoXEdW5cl7FSpQkFrfVxrPUBr7a21rqu1foBSHH2klOqtlDqolDqklHqlhPEjlVJnlVLx5tvo23gPQogSFBQZmRebTI/J6xj/w06quTjy05j2RAZ42bo0YcfupJerF4Ep1xuplHIEPgN6ACnAdqXUUnM/SsXN1VpPuIM6hBDFXCowMHf7CaavP8LJzDxa+NXk80ci6dWinpylLG7qTkLhZmtXW+CQ+ZwGlFI/AfcDV4eCEKIMZOUV8u2W48zceJT0iwW0CfTin4Na0qWZt1wUR5TanYSCvsn4+kBysccpQEwJ0w1WSnUGEoEXtNbJJUwjhLiO9Jx8Zm46ypzNx8nOL6JLM2/Gd21C26C7bF2aqIBuGApKqWxK/vJXQFlcgmkZ8KP52s9jgdlAtxLqGAOMAQgICCiDlxWi4juZcYmvNhzhx20nyC8y0iesHk/f04Sw+p62Lk1UYDcMBa21xx3MOxXTtZwv8zcPKz7/4oe1zgA+uE4d0zFdG5ro6OibbaEIUakdPXeRL/84zMK4FLSGByLqM65LY5rUrWHr0kQlYM3LKW0Hmpqv0pYKPAQ8XHwCpZSv1jrN/HAAsN+K9QhRoe07mcXnfxxi5e40nBwdGNY2gDGdG+Hv5W7r0kQlYrVQ0FoXKaUmYOpd1RGYqbXeq5R6B4jVWi8FnlVKDQCKgPPIuQ9CXONsdj7v/+8AC3amUMPViTGdGzOqUyB1PdxsXZqohJTWFWtvTHR0tI6NjbV1GUJYXZHByLd/Hmfy6kTyCg2M6hTE012a4OnubOvSRAWklNqhtY6+2XRyNW4h7NC2o+eZtGQPB05lc3fTOrw1oAWNvaXNQFifhIIQduRMdh7vrzzAwrhU/Dzd+PJR00lncp6BKC8SCkLYgUKDkTlbjjNldSL5RUbGd23M+K5NcHeRf1FRvmSNE8LG/jySzptL9nLwdDZdmnnz1oAWBNWpbuuyRBUloSCEjZzOyuOfK/ezJP4k9WtVY/pjUfQI9ZFdRcKmJBSEKGeFBiPfbDrGlDWJFBo1z3ZrwlP3NKGai6OtSxNCQkGI8rT58DneXLKXpDM5dGtelzf7h9KwtuwqEvZDQkGIcnDhYgFvLNnD8l1pNLirGjOGR9M91MfWZQlxDQkFIaws6XQ2T8yO5VRmHs93b8q4Lo1xc5ZdRcI+SSgIYUW/HTjNsz/G4+bsyE9j28lVz4Tdk1AQwgq01kxff4T3fzlAqG9NvhoejV+tsuhtXgjrklAQoozlFRp4ddFuFu5M5b6WvvznwXA5CU1UGLKmClGGzmTnMe7bHew8kcEL3Zvx7L1N5LwDUaFIKAhRRvakZjJmTiwXcgv54pFI+rT0tXVJQtwyCQUhysDK3Wm8NC8BL3dn5j/VnhZ+cklMUTFJKAhxB7TWfLL2EB+tSSQyoBbTHovG28PV1mUJcdskFIS4TZcKDPzt5wRW7E5jcKQ//xwUhquTnH8gKjYJBSFuQ1rmJZ6cE8vek1m81jeE0XcHSYOyqBQkFIS4RTtPXGDstzvIKzAwc0Qbujava+uShJ3LKcjhg00fcP7SeZwdnHF2dLb8dXJwKtUwZwdnWvq0JMAzwKq1SigIcQsW7kzhlYW78fV044fRMTT18bB1ScLOncw+Sb8f+pFwOoFabrUoNBRSZCyi0Gj6eyu+uO8LxkWPs1KlJhIKokLIyC3gh20nGNDKD38v93J//UKDkf+uSuTLdYdp36g2nz8SiVd1l3KvQ1Qse87soe/3fTl/6TzLhi2jb9O+V4zXWlsCotBQaAmKy/ev/tvQs6HVa5ZQEBXC5NWJzNlynI9WJzKsbQATujahbk03q79uocHIop2pTP09ieTzl3isXUMm9Q/F2dHB6q8tKra1R9YyaN4gqjtXZ/3j64n0jbxmGqWUadeQozM426DIEkgoCLuXfD6XH7edoH8rP2q4OvH91hPMi01mRIdAxnVubJVf7EUGI4viUpn62yFOnM8l3N+TdwaESfuBKJXZ8bMZvWw0wbWDWfnISqu3A5QlCQVh96b+loRSilf7NsfXsxpjOzdiyppEpq8/wg9/nuCJu4N4olMQHm53/lOryGBkcfxJpv6WxPH0XFrW9+TrEdF0a15Xji4SN6W15p117/DWure4N+heFgxdgKdbxTqRUUJB2LUjZ3NYsDOVEe0D8fU09TIaWKc6Ux6K4Kl7mjB59UGmrEli9uZjjOvSmOHtA2/rspZFBiNLzGFwLD2XFn41mTE8mntDJAxE6RQYChi7fCzfxH/DiFYjmN5/Oi6OFa/dSUJB2LUpa5JwcXTg6a6NrxkXXM+DaY9Fsyslgw9XJfKv/x3g641HmdCtCQ+1CcDF6eb7/Q1GzdKEVD5Ze4ij5y4S6luT6Y9F0SPUR8JAlFpmXiaD5w1m7dG1vNXlLSZ1mVRh1x8JBWG3DpzKYtmukzzVpTF1aly/64hw/1rMGdWWrUfS+XDVQSYt2cv09Ud47t6mDIyoj1MJjcIGo2ZZwkk+WZvEkXMXCfGtybTHougpYSBu0YnME/T9vi8H0w/yzf3fMKL1CFuXdEeU1trWNdyS6OhoHRsba+syRDkYMyeWLUfS2fhyNzzdS9deoLVmXeJZ/rsqkd2pmTTyrs6LPZrRN8wXBweFwahZvssUBofPXqR5PQ+e796MnqE+ODhIGIhbszNtJ/1+6MfFwossHLqQexvda+uSrksptUNrHX2z6WRLQdilhOQMVu07zYs9mpU6EMB0iN89wXXp0sybX/ee4r+rEpnwQxyhvocZGFGfubHJHDqTQ7CPB188EkmvFvUkDMRtWZm0kqE/D6W2e202PbaJsLphti6pTEgoCLv04aqDeLk7M6pT0G09XylF7zBfeoTWY2lCKh+tTuK9lftp5lODzx+JpLeEgbgDX8Z+yfiV42ldrzXLhy3H16PyXDtDQkHYna1H0tmQdI5X+zanhuudraKODoqBEf70C/ezbCFIGIjbZdRG/r7IHITuAAAalUlEQVTm73yw+QP6Nu3L3CFzqeFSw9ZllSkJBWFXtNb8d1UidT1ceaxdYJnN19nRgRDfmmU2P1H15BXlMXLxSObuncu4qHFM7TsVJ4fK9xVa+d6RqNA2JJ1j27HzvHN/i9s630CIkmTlZ7EzbSfZ+dkUGAquuBUaC68ZVtIt4XQC8afi+Xf3fzOxw8RKe5SaVUNBKdUb+BhwBGZord+/znSDgflAG621HFpURWmt+XDVQerXqsZDbSpOtwDC/pzKOcWG4xvYcGIDG09sJOF0AkZtLPXzXRxdrrlVd67O3CFzGdpiqBUrtz2rhYJSyhH4DOgBpADblVJLtdb7rprOA3gO2GqtWkTFsHrfaXalZPLBkPBSnXgmBJh+TCSdT2LjiY1sOLGBDcc3cPjCYQDcnd1p59+ONzq/QXv/9tRxr3PFF72zo/M1X/6OyrHSbgWUhjW3FNoCh7TWRwCUUj8B9wP7rpruXeDfwEQr1iLsnNGombw6kUZ1qjMoor6tyxF2rMhYRMKpBFMAmLcEzlw8A0Ad9zp0CujEU9FPcXfDu4moF2HqgVSUmjVDoT6QXOxxChBTfAKlVCTQQGu9Qil13VBQSo0BxgAEBMhuhcpo+e40DpzK5pNhESWegSxsLyMvg5quNXFQ5fv55BTksDVlK5uTN7PhxAa2pGwhpyAHgMBagfRq3Iu7A+6mU0AnmtdpXqV/5ZcFmzU0K6UcgMnAyJtNq7WeDkwH0xnN1q1MlLcig5EpqxNpXs+Dfi0rz/HelUV6bjoTV09kVvwsPFw8iPKLoo1fG9r4tSHaL5rAWoFl9kWsteZE5gk2J29mU/ImNidvtrQHKBQtfVoyPHw4dzc0hYB/Tf8yeV3xF2uGQirQoNhjf/OwyzyAMOAP8wpVD1iqlBogjc1Vy8K4VI6cu8j0x6LkHAI7orXm+93f88KvL3Dh0gUmtJmAURuJTYvl460fU2AoAEy7bKL9oon2jaZNfVNYlPZkrkJDIXGn4ticvNkSBCezTwJQ3bk67fzb8drdr9GxQUdi/GOo5VbLau9XmFgzFLYDTZVSQZjC4CHg4csjtdaZQJ3Lj5VSfwB/k0CoWvKLDHy8JolW/p70CPWxdTnC7ND5Qzy14inWHFlDTP0Ypg+fTrhPuGV8gaGA3ad3s/3kdmJPxrL95Hb+dfhfGLQBAD8PP8uWxOW/td1rk56bzpaULWw6sYnNKZvZnrqdS0WXAGjo2ZAuDbvQsUFHOjToQEuflpXyPAB7Z7UlrrUuUkpNAH7FdEjqTK31XqXUO0Cs1nqptV5bVBzztieTmnGJfw1qKfuC7UCBoYAPN3/Iu+vfxcXRhc/6fsbYqLE4Olx5zoiLowtRflFE+UVZhuUW5hKXFmcJie0nt7Pk4BLL+LrV61oahJ0cnIj0jWRs1Fg6NOhAhwYdqF9TDjCwB9JLqrCZSwUGuvzndwJrV2fu2HYSCja26cQmxi4fy96zexkSOoSPe3+Mn4ffHc0zMy+THWk72J66nf3n9hNcO5iOAR2J9ovG3dm9jCoXpSG9pAq7992fxzmTnc/UYRESCDaUkZfBK2teYdqOaTSo2YBlw5bRr1m/Mpm3p5sn3YK60S2oW5nMT1ifhIKwiZz8Ir5Yd5i7m9YhplFtW5dTJWmtmbd3Hs/98hxnc8/yQrsXeKfrO5WugzdxayQUhE3M3HiU8xcL+FvPYFuXUiUdyzjG+JXjWZm0kijfKFY+spJI30hblyXsgISCKHcZuQV8tf4IPUN9aNVADjEsT0XGIqb8OYU3/3gTheKjXh8xoe0EOcpHWMiaIMrd9PVHyCko4sWezWxdSpWyLXUbY5aNIeF0AgOCB/Bpn09p4Nng5k8UVYqEgihX53LymbXpGP3D/Wher+pc3yAzL5NtqdvwqeFDUK0gPFw9rPZaBYYC9p/dz67Tu0y3M6a/p3JO4efhx4KhCxjYfKA07osSSSiIcvX574cpMBh5vntTW5didem56Sw5uIQF+xew+vBqCo2FlnF13OvQyKuR6Var0V/3vRrhX9P/mvMCSqK1Ji0n7a8vf/Nt/7n9FBmLAHB1dKVF3Rb0btKbiHoRjGg1Ak83T6u9Z1HxSSiIcpOWeYnvth5ncGR9GnlXziNcTuWcYvGBxczfN58/jv2BQRto6NmQZ9o+Q68mvcjMy+TIhSMcuXCEoxlH2Z66nfn75lu+xMF0YldgrUCCagVdERY+1X04dP4QCacTLAGQfind8rwGNRsQ7hNO/2b9CfcJJ9wnnKa1m0p7gbglsraIcjP1t0NorXmmW+XaSkjOTGbh/oUs2L+AjSc2otE0q92Mlzu+zOCQwUT6Rt5wV02RsYiUrBRLWBS/zd83/4ovfjBdI6Bl3ZYMChlk+fJvWbclXtW8rP1WRRUgoSDKxY7jF5i3PZmHYwJocFf5n8m67OAyXv3tVTxcPPCv6U+Dmg1Mfz0bWB7Xq1GvVLttAI5cOMKCfQtYsH8BW1NN14cKqxvGpC6TGBI6hBbeLUq9z/7ylkFgrcAST/LKzMvkaMZR0rLTaHJXExp5NSp1nULcKunmQlhVZm4h/1l1gO+3nsDHw42lEzpSt6ZbudYwJ2EOo5aMolntZvh6+JKcmUxKVoqlI7bLHJUjvh6+fwXGVcHh6ujKiqQVLNi/gPhT8QBE+UYxOGQwg0MH06y2HE0l7Jd0c3GV7PxsVh9ZzaCQQbYupUrQWrMoLpV/rtzP+YsFjOwQyIs9muHhVr5Xwfpoy0e8uOpF7g26l0X/t8hy1I/WmvOXzpOSlUJKVgrJWclX3E84ncDyxOXXBAdAe//2fNjjQwaFDCLIK6hc348Q1lZlQuGDTR/w3ob32P7k9it6dhRlL+l0Nq8v3sPWo+eJCKjF7FFtaeFXvke8aK154/c3eG/DewwKGcQPg37A1cnVMl4pRW332tR2r02req2uO48LeRdMQZGZTEZeBvcE3iO9eYpKrcrsPsrMyyT402ACawWy+YnN5X5Jwaogt6CIT9YeYsaGI1R3deKVPs35v+gG5X7hHIPRwPiV45m2YxqjI0bzZb8vZR+8qPJKu/uoynwzerp58p8e/2Fr6lZmxc2ydTmVitaaX/eeosfk9Xy57jCDIuvz20tdGNY2oNwDocBQwMMLH2bajmm80vEVpvefLoEgxC2oMlsKYPry6vxNZw6cO8DBCQe5q9pdZVxd1ZN8Ppe3lu5l7YEzBPt48I+BYbQJtM1yzSnIYfC8waw6vIr/9PgPf+vwN5vUIYQ9ki2FEiil+KzvZ1y4dIE3fnvD1uVUaPlFBj77/RA9PlrHliPpvNY3hOXPdrJZIKTnptN9TnfWHFnDzAEzJRCEuE1VpqH5snCfcMa3Gc/UbVN5IvIJ6S74Nmw6dI43luzhyNmL9G1Zjzf6heLrWc1m9aRmpdLzu54cPn+YBUMX8EDzB2xWixAVXZXaUrjs7a5v413dm/Erx2PURluXU2Gcycrj2R/jeGTGVooMmlmPt+HzR6JsGgiJ6Yl0nNmR5Mxk/vfI/yQQhLhDVW5LAaCWWy0+6P4BI5eMZHb8bB6PeNzWJVldWuYl5semkFtooMhgpNCgKTQYKTL/LTRqCouMFBmvGmc0Wu4nn8+l0KB59t6mPH1PY9ycbduAG5cWR6/veqHR/D7idznUWIgyUKUamoszaiOdZ3XmYPpBEickVup+YzYfOseEH+M4f7EAZ0eFk4MDzo4KZ0cHnMx/Tbfrj3NyUHi5uzDunsYE1alu67fEumPr6P9jf7yqebHq0VUE15EruAlxI3JG8004KAc+6/sZkdMjeeP3N/i076e2LqnMaa2Ztv4IH/xygEbeNZg3tj1N6lb83kmXHlzK0J+H0sirEb8++qtcKEaIMlRl2hT2p2Xx3Z/HOZWZZxnWql4rno5+mi9iv7D0ZVNZZOcVMu67Hbz/vwP0CfNl8fiOlSIQZsfPZtBcU++g6x9fL4EgRBmrMruPPlmbxOTViQCE1a9J9xAfuof4UP8uI8GfBtO0dlM2PL6hUpzpnHg6m3Hf7uD4+Vz+3qc5T3QKqvBX2dJa89GfH/HSqpfo3qg7C4cutOrVy4SobEq7+6jKhILWmkNncli9/zRr9p0mLjkDrcHX0406PltYnvwG0/t9zZNRo6xQdflZlnCS/7dgF+4ujnz6cCTtGtW2dUm3pdBQSNypODad2MTG5I1sOrGJ0xdPMyR0CN8N/O6KfoyEEDcnoXAT53Ly+e3AGdbuP826xDMcUy9R5JDGY4GL6RvWmG7N61K7RsX54ik0GHn/fwf4euNRohp68dnDkdTzLN8uqu9EZl4mf6b8ycYTG9mUvImtqVvJLcwFIKhWEJ0COtE1sCvDWw2XbiuEuA0SCrcgr9DAnNg/GLeqJ3UdBuB2cTRKQVSAF/eG+NAjtC6NvWvY7S6YM9l5TPg+jm3HzjOyQyCv9g3Bxcm+d4MlZyZbAmDjiY3sOr0LjcZBORBRL4KODTrSKaATHQM64ufhZ+tyhajwJBRuw/gV4/lyx5f8dP8fpJzxYc3+0+w9mQVAw9ru9G3py1P3NKZmOV8T4Ea2HzvP+O93kpVXyPuDwnkgwv66dc4tzGX36d3Enoy1hEByVjIANVxq0M6/HZ0adKJTQCdi/GOo4VLxG8SFsDcSCrfhwqULNPu0GcG1g9nw+AaUUpzMuMTaA2dYs+80G5LO4lPTjX8OaknX4LpWqaG0tNZ8s/kY763Yj79XNb54NIoQ35o2rQng7MWzxJ+KJ+5UnOVvYnqi5cxxPw8/OgV0olMD01ZAuE+4XFheiHIgoXCbZsbN5ImlTzD7gdkMbzX8inHxyRlM/DmBpDM5DI70Z1K/UDzdy3+rIbegiFcW7GZpwkm6h/jw36Gt8KxWvnVorTly4Qjxp+KvCIHU7FTLNAGeAbSu15rWPq2J8I0gol4EAZ4BdrsbTojKTELhNhm1kQ5fd+BoxlEOTjhILbdaV4zPLzIwde0hvlh3mLuqu/DeA2H0bFHPavVc7ei5i4z7dgeJZ7L5W89gnurSuFyuWXAi8wS/Hf2NuLQ44k7FkXA6gax80641R+VI8zrNifCNsARAK59W1HavmEc+CVEZSSjcgZ1pO4meHs0zbZ/h4z4flzjNntRMJs7fxf60LPq38uPtAS24q7qLVetatfcUL81LwMlR8fFDEXRu5m211yoyFrEleQsrklawImkFe87sAcDd2Z1WPq1oXa81EfUiaF2vNWF1w6jmbLtO8YQQN2cXoaCU6g18DDgCM7TW7181fhwwHjAAOcAYrfW+G82zPEIB4OkVTzNtxzTixsYR7hNe4jQFRUa+XHeYqb8lUdPNmbfvb8F9LX3LdPdIXqGBX/eeYsHOVNYnniXc35PPH4nE38u9zF7jsvTcdH459Asrklbwy6FfuJB3AScHJzoFdOK+pvfRu0lvQuqEyCGhQlRANg8FpZQjkAj0AFKA7cCw4l/6SqmaWuss8/0BwNNa6943mm95hcL5S+dpNrUZId4hrB+5/oZf9AdOZfHy/F3sSsmkd4t6vPNAC+p63P45AlprYo9fYMGOFFbsSiM7v4j6tarxYLQ/47qUXe+kWmt2n9nNisQVLE9azp8pf2LURrzdvenTtA/9mvajZ+OeeLp5lsnrCSFsxx46xGsLHNJaHzEX9BNwP2AJhcuBYFYdsJt9WXdVu4v3u7/Pk8ue5Ltd3/FYq8euO23zejVZ+FQHvtpwlI/WJPLnR+m82T+UB1rXv6WthuTzuSzcmcrCuBSOp+fi7uJInzBfBkfVJybwLi7kn+dSURYaN1ydXG+rS47cwlx+O/obyxOXszJppeXQ0EjfSF67+zXua3ofbeq3qRTdfQghbp01txSGAL211qPNjx8DYrTWE66abjzwIuACdNNaJ91ovuW1pQCmRuf2X7fneMZxDk44WKpfzIfO5PDy/AR2nsjg3uZ1eW9gyxueWZyTX8TK3Wks2JHC1qPnAejQuDaDI/3p1cKHg+d3MX/ffObvn8+h84eueK6TgxNuTm64Orri6uSKq6Or6fF17mfmZ7L++HryivKo4VKDHo16cF/T++jTtI+cICZEJWcPu49KFQrFpn8Y6KW1HlHCuDHAGICAgICo48ePW6XmksSejKXtV215LuY5Pur9UameYzCaziH4z68HcHZw4PV+IQyNbmDZajAaNVuOpDN/Rwq/7DnFpUIDgbXdGRzpzwMRfpzM3WMJgmMZx3BUjtzb6F56NuqJo4MjeUV55Bflk2/Iv/a+If+645wdnOkW1I37mt5H54adpf8gIaoQewiF9sBbWute5sd/B9Ba/+s60zsAF7TWN/w5Xp5bCpeNWz6OGTtnEDc2jpY+LUv9vGPnLvLygl1sO3qeu5vWYULXJqxPOsuinamczMzDw82JfuF+DIr0JV8dYMH+BSzYv4DkrGScHZzp0bgHQ0KGMCB4gBzeKYS4I/YQCk6YGprvBVIxNTQ/rLXeW2yappd3Fyml+gNv3qxoW4RCem46zT5tRmCtQEa1HkX9mvXxr+lPfY/61K1e94ZH4xiNmu+3Hudf/ztAboEBBwWdm3nzQIQv1WsksTRxEQsPLORk9klcHF3o3aQ3Q0KG0D+4/zXnSAghxO2yeSiYi+gLTMF0SOpMrfV7Sql3gFit9VKl1MdAd6AQuABMKB4aJbFFKAD8tOcnRi4eSb4h/4rhjsoRXw9f6nv8FRT1a9a/5m96Nmw6dBon9wOsPbaUhQcWcubiGdyc3OjTpA9DQofQr1k/arravqsKIUTlYxehYA22CgUwNTyfuXiGlKwUUrNSSc1OtfxNyUqxPM4uyL7muV5uXmg0GXkZuDu7c1/T+xgSOoS+TftKB3BCCKuzh0NSKx0H5UC9GvWoV6Me0X7XX7ZZ+VnXhEZqVir5hnz6Nu1L7ya9cXcu+5PPhBDiTkkoWEFN15rU9K5JiHeIrUsRQohbImcoCSGEsJBQEEIIYSGhIIQQwkJCQQghhIWEghBCCAsJBSGEEBYSCkIIISwkFIQQQlhUuG4ulFJngfLrO9s+1QHO2boIG5NlIMvgMlkOpVsGDbXWN72we4ULBQFKqdjS9GFSmckykGVwmSyHsl0GsvtICCGEhYSCEEIICwmFimm6rQuwA7IMZBlcJsuhDJeBtCkIIYSwkC0FIYQQFhIKdkYp1UAp9btSap9Saq9S6jnz8LuUUquVUknmv17m4Uop9YlS6pBSapdSKtK276DsKKUclVJxSqnl5sdBSqmt5vc6VynlYh7uan58yDw+0JZ1lyWlVC2l1Hyl1AGl1H6lVPuqti4opV4w/y/sUUr9qJRyqwrrglJqplLqjFJqT7Fht/zZK6VGmKdPUkqNuNnrSijYnyLgJa11KNAOGK+UCgVeAdZqrZsCa82PAfoATc23McAX5V+y1TwH7C/2+N/AR1rrJpiu6f2EefgTwAXz8I/M01UWHwO/aK2bA60wLY8qsy4opeoDzwLRWuswTNd7f4iqsS58A/S+atgtffZKqbuAN4EYoC3w5uUguS6ttdzs+AYsAXoABwFf8zBf4KD5/jRgWLHpLdNV5Bvgb17puwHLAYXp5Bwn8/j2wK/m+78C7c33nczTKVu/hzJYBp7A0avfS1VaF4D6QDJwl/mzXQ70qirrAhAI7Lndzx4YBkwrNvyK6Uq6yZaCHTNv+kYAWwEfrXWaedQpwMd8//I/zWUp5mEV3RTgZcBoflwbyNBaF5kfF3+flmVgHp9pnr6iCwLOArPMu9FmKKWqU4XWBa11KvAhcAJIw/TZ7qDqrQuX3epnf8vrhISCnVJK1QAWAM9rrbOKj9OmyK+0h40ppfoBZ7TWO2xdi405AZHAF1rrCOAif+0uAKrEuuAF3I8pIP2A6ly7S6VKstZnL6Fgh5RSzpgC4Xut9ULz4NNKKV/zeF/gjHl4KtCg2NP9zcMqso7AAKXUMeAnTLuQPgZqKaWczNMUf5+WZWAe7wmkl2fBVpICpGitt5ofz8cUElVpXegOHNVan9VaFwILMa0fVW1duOxWP/tbXickFOyMUkoBXwP7tdaTi41aClw+cmAEpraGy8OHm48+aAdkFtu8rJC01n/XWvtrrQMxNSr+prV+BPgdGGKe7OplcHnZDDFPX+F/PWutTwHJSqlg86B7gX1UoXUB026jdkopd/P/xuVlUKXWhWJu9bP/FeiplPIyb3X1NA+7Pls3pMjtmoalTpg2CXcB8eZbX0z7RdcCScAa4C7z9Ar4DDgM7MZ0lIbN30cZLo97gOXm+42AbcAh4GfA1Tzczfz4kHl8I1vXXYbvvzUQa14fFgNeVW1dAN4GDgB7gG8B16qwLgA/YmpHKcS01fjE7Xz2wCjz8jgEPH6z15UzmoUQQljI7iMhhBAWEgpCCCEsJBSEEEJYSCgIIYSwkFAQQghhIaEgxHUopV4z9865SykVr5SKUUo9r5Ryt3VtQliLHJIqRAmUUu2BycA9Wut8pVQdwAXYjOkY8HM2LVAIK5EtBSFK5guc01rnA5hDYAim/nd+V0r9DqCU6qmU2qKU2qmU+tncZxVKqWNKqQ+UUruVUtuUUk3Mwx80XxcgQSm13jZvTYjrky0FIUpg/nLfCLhjOnN0rtZ6nbk/pmit9Tnz1sNCoI/W+qJS6v9hOrP2HfN0X2mt31NKDQeGaq37KaV2A7211qlKqVpa6wybvEEhrkO2FIQogdY6B4jCdMGSs8BcpdTIqyZrB4QCm5RS8Zj6omlYbPyPxf62N9/fBHyjlHoS0wVjhLArTjefRIiqSWttAP4A/jD/wr/6UoYKWK21Hna9WVx9X2s9TikVA9wH7FBKRWmtK1MvnqKCky0FIUqglApWSjUtNqg1cBzIBjzMw/4EOhZrL6iulGpW7Dn/V+zvFvM0jbXWW7XWkzBtgRTv1lgIm5MtBSFKVgOYqpSqhem62Ycw7UoaBvyilDqpte5q3qX0o1LK1fy814FE830vpdQuIN/8PID/mMNGYertMqFc3o0QpSQNzUJYQfEGaVvXIsStkN1HQgghLGRLQQghhIVsKQghhLCQUBBCCGEhoSCEEMJCQkEIIYSFhIIQQggLCQUhhBAW/x96cXiBYECZ4wAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "l2_val_loss = np.array(l2_model.get_validation_summary(\"Loss\"))\n", + "\n", + "plt.plot(original_val_loss[:,0], original_val_loss[:,1], label='original model')\n", + "plt.plot(l2_val_loss[:,0], l2_val_loss[:,1],label='L2-regularized model',color='green')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, the model with L2 regularization (dots) has become much more resistant to overfitting than the reference model (crosses), \n", + "even though both models have the same number of parameters.\n", + "\n", + "As alternatives to L2 regularization, you could use one of the following Keras API of Analytics Zoo weight regularizers: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras import regularizers\n", + "\n", + "# L1 regularization\n", + "regularizers.l1(0.001)\n", + "\n", + "# L1 and L2 regularization at the same time\n", + "regularizers.l1_l2(l1=0.001, l2=0.001)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding dropout\n", + "\n", + "\n", + "Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his \n", + "students at the University of Toronto. Dropout, applied to a layer, consists of randomly \"dropping out\" (i.e. setting to zero) a number of \n", + "output features of the layer during training. Let's say a given layer would normally have returned a vector `[0.2, 0.5, 1.3, 0.8, 1.1]` for a \n", + "given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. `[0, 0.5, \n", + "1.3, 0, 1.1]`. The \"dropout rate\" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test \n", + "time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to \n", + "balance for the fact that more units are active than at training time.\n", + "\n", + "Consider a Numpy matrix containing the output of a layer, `layer_output`, of shape `(batch_size, features)`. At training time, we would be \n", + "zero-ing out at random a fraction of the values in the matrix:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This technique may seem strange and arbitrary. Why would this help reduce overfitting? Geoff Hinton has said that he was inspired, among \n", + "other things, by a fraud prevention mechanism used by banks -- in his own words: _\"I went to my bank. The tellers kept changing and I asked \n", + "one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation \n", + "between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each \n", + "example would prevent conspiracies and thus reduce overfitting\"_.\n", + "\n", + "The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that are not significant (what \n", + "Hinton refers to as \"conspiracies\"), which the network would start memorizing if no noise was present. \n", + "\n", + "In Keras API of Analytics Zoo you can introduce dropout in a network via the `Dropout` layer, which gets applied to the output of layer right before it, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model.add(layers.Dropout(0.5))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's add two `Dropout` layers in our IMDB network to see how well they do at reducing overfitting:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDropout\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDropout\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "dpt_model = models.Sequential()\n", + "dpt_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))\n", + "dpt_model.add(layers.Dropout(0.5))\n", + "dpt_model.add(layers.Dense(16, activation='relu'))\n", + "dpt_model.add(layers.Dropout(0.5))\n", + "dpt_model.add(layers.Dense(1, activation='sigmoid'))\n", + "\n", + "dpt_model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['acc'])\n", + "\n", + "dir_name = '4-4 ' + str(time.ctime())\n", + "dpt_model.set_tensorboard('./', dir_name)\n", + "dpt_model.fit(x_train, y_train,\n", + " nb_epoch=20,\n", + " batch_size=512,\n", + " validation_data=(x_test, y_test))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 512 records in 0.017992654 seconds. Throughput is 28456.057 records/second. Loss is 0.112769656. \n", + "Top1Accuracy is Accuracy(correct: 21871, count: 25000, accuracy: 0.87484)_" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "dpt_val_loss = np.array(dpt_model.get_validation_summary(\"Loss\"))\n", + "\n", + "plt.plot(original_val_loss[:,0], original_val_loss[:,1], label='original model')\n", + "plt.plot(dpt_val_loss[:,0], dpt_val_loss[:,1],label='Dropout-regularized model',color='green')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, a clear improvement over the reference network.\n", + "\n", + "To recap: here the most common ways to prevent overfitting in neural networks:\n", + "\n", + "* Getting more training data.\n", + "* Reducing the capacity of the network.\n", + "* Adding weight regularization.\n", + "* Adding dropout." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/keras/5.1-introduction-to-convnets.ipynb b/keras/5.1-introduction-to-convnets.ipynb new file mode 100644 index 0000000..0fbedc1 --- /dev/null +++ b/keras/5.1-introduction-to-convnets.ipynb @@ -0,0 +1,300 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First of all, set environment variables and initialize spark context:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: SPARK_DRIVER_MEMORY=8g\n", + "env: PYSPARK_PYTHON=/usr/bin/python3.5\n", + "env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n" + ] + } + ], + "source": [ + "%env SPARK_DRIVER_MEMORY=8g\n", + "%env PYSPARK_PYTHON=/usr/bin/python3.5\n", + "%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "\n", + "from zoo.common.nncontext import *\n", + "sc = init_nncontext(init_spark_conf().setMaster(\"local[4]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 5.1 - Introduction to convnets\n", + "\n", + "\n", + "----\n", + "\n", + "First, let's take a practical look at a very simple convnet example. We will use our convnet to classify MNIST digits, a task that you've already been \n", + "through in Chapter 2, using a densely-connected network (our test accuracy then was 97.8%). Even though our convnet will be very basic, its \n", + "accuracy will still blow out of the water that of the densely-connected model from Chapter 2.\n", + "\n", + "The 6 lines of code below show you what a basic convnet looks like. It's a stack of `Conv2D` and `MaxPooling2D` layers. We'll see in a \n", + "minute what they do concretely.\n", + "Importantly, a convnet takes as input tensors of shape `(image_height, image_width, image_channels)` (not including the batch dimension). \n", + "In our case, we will configure our convnet to process inputs of size `(28, 28, 1)`, which is the format of MNIST images. We do this via \n", + "passing the argument `input_shape=(28, 28, 1)` to our first layer." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasConvolution2D\n", + "creating: createZooKerasMaxPooling2D\n", + "creating: createZooKerasConvolution2D\n", + "creating: createZooKerasMaxPooling2D\n", + "creating: createZooKerasConvolution2D\n" + ] + } + ], + "source": [ + "from zoo.pipeline.api.keras import layers\n", + "from zoo.pipeline.api.keras import models\n", + "\n", + "model = models.Sequential()\n", + "model.add(layers.Conv2D(32, nb_col=3, nb_row=3, activation='relu', input_shape=(1,28,28)))\n", + "model.add(layers.MaxPooling2D((2, 2)))\n", + "model.add(layers.Conv2D(64, nb_col=3, nb_row=3, activation='relu'))\n", + "model.add(layers.MaxPooling2D((2, 2)))\n", + "model.add(layers.Conv2D(64, nb_col=3, nb_row=3, activation='relu'))\n", + "\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_In Keras one could see model summary directly in output, in Keras API of Analytics Zoo, summary is printed in console, the same as INFO._" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From summary you can see that the output of every `Conv2D` and `MaxPooling2D` layer is a 3D tensor of shape `(height, width, channels)`. The width \n", + "and height dimensions tend to shrink as we go deeper in the network. The number of channels is controlled by the first argument passed to \n", + "the `Conv2D` layers (e.g. 32 or 64).\n", + "\n", + "The next step would be to feed our last output tensor (of shape `(3, 3, 64)`) into a densely-connected classifier network like those you are \n", + "already familiar with: a stack of `Dense` layers. These classifiers process vectors, which are 1D, whereas our current output is a 3D tensor. \n", + "So first, we will have to flatten our 3D outputs to 1D, and then add a few `Dense` layers on top:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasFlatten\n", + "creating: createZooKerasDense\n", + "creating: createZooKerasDense\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.add(layers.Flatten())\n", + "model.add(layers.Dense(64, activation='relu'))\n", + "model.add(layers.Dense(10, activation='softmax'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are going to do 10-way classification, so we use a final layer with 10 outputs and a softmax activation. Now here's what our network \n", + "looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, our `(3, 3, 64)` outputs were flattened into vectors of shape `(576,)`, before going through two `Dense` layers.\n", + "\n", + "Now, let's train our convnet on the MNIST digits. We will reuse a lot of the code we have already covered in the MNIST example from Chapter \n", + "2." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### CNN input shape\n", + "_Once we get the dataset, we need to reshape the images. In Keras the shape of the dataset is `(sample_size, height, width, channel)`, like the Keras code below:\n", + " \n", + " train_images = train_images.reshape((60000, 28, 28, 1))\n", + "In Keras API of Analytics Zoo, the default order is theano-style NCHW `(sample_size, channel, height, width)`, so you can process data like following:\n", + "\n", + "Alternatively, you can also use tensorflow-style NHWC as Keras default just by setting `Convolution2D(dim_ordering=\"tf\")`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, our `(3, 3, 64)` outputs were flattened into vectors of shape `(576,)`, before going through two `Dense` layers.\n", + "\n", + "Now, let's train our convnet on the MNIST digits. We will reuse a lot of the code we have already covered in the MNIST example from Chapter \n", + "2." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Using TensorFlow backend.\n" + ] + } + ], + "source": [ + "from keras.datasets import mnist\n", + "(train_images, train_labels), (test_images, test_labels) = mnist.load_data()\n", + "\n", + "train_images = train_images.reshape((60000, 1, 28, 28))\n", + "train_images = train_images.astype('float32') / 255\n", + "\n", + "test_images = test_images.reshape((10000, 1, 28, 28))\n", + "test_images = test_images.astype('float32') / 255" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createRMSprop\n", + "creating: createZooKerasSparseCategoricalCrossEntropy\n", + "creating: createZooKerasSparseCategoricalAccuracy\n" + ] + } + ], + "source": [ + "model.compile(optimizer='rmsprop',\n", + " loss='sparse_categorical_crossentropy',\n", + " metrics=['acc'])\n", + "\n", + "model.fit(train_images, train_labels, nb_epoch=5, batch_size=64)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Trained 64 records in 0.03212866 seconds. Throughput is 1991.9911 records/second. Loss is 0.0023578003." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "test_loss, test_acc = model.evaluate(test_images, test_labels)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9912999868392944" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_acc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While our densely-connected network from Chapter 2 had a test accuracy of 97.8%, our basic convnet has a test accuracy of 99.1%: we \n", + "decreased our error rate by over 50% (relative). Not bad! " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/keras/6.2-understanding-recurrent-neural-networks.ipynb b/keras/6.2-understanding-recurrent-neural-networks.ipynb new file mode 100644 index 0000000..5b19bf2 --- /dev/null +++ b/keras/6.2-understanding-recurrent-neural-networks.ipynb @@ -0,0 +1,441 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First of all, set environment variables and initialize spark context:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: SPARK_DRIVER_MEMORY=16g\n", + "env: PYSPARK_PYTHON=/usr/bin/python3.5\n", + "env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "Prepending /home/litchy/.local/lib/python3.5/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path\n" + ] + } + ], + "source": [ + "%env SPARK_DRIVER_MEMORY=16g\n", + "%env PYSPARK_PYTHON=/usr/bin/python3.5\n", + "%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5\n", + "\n", + "from zoo.common.nncontext import *\n", + "sc = init_nncontext(init_spark_conf().setMaster(\"local[4]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Understanding recurrent neural networks\n", + "\n", + "----\n", + "\n", + "In this section we will build recurrent neural networks to finish the same task as we did in chapter 3." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A first recurrent layer in Keras API of Analytics Zoo\n", + "\n", + "The process we just naively implemented in Numpy corresponds to an actual layer: the `SimpleRNN` layer:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras.layers import SimpleRNN" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is just one minor difference: `SimpleRNN` processes batches of sequences, like all other Keras API of Analytics Zoo layers, not just a single sequence like \n", + "in our Numpy example. This means that it takes inputs of shape `(batch_size, timesteps, input_features)`, rather than `(timesteps, \n", + "input_features)`.\n", + "\n", + "Like all recurrent layers in Keras API of Analytics Zoo, `SimpleRNN` can be run in two different modes: it can return either the full sequences of successive \n", + "outputs for each timestep (a 3D tensor of shape `(batch_size, timesteps, output_features)`), or it can return only the last output for each \n", + "input sequence (a 2D tensor of shape `(batch_size, output_features)`). These two modes are controlled by the `return_sequences` constructor \n", + "argument. Let's take a look at an example:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from zoo.pipeline.api.keras.models import Sequential\n", + "from zoo.pipeline.api.keras.layers import Embedding, SimpleRNN" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Following is the preprocessing method. You do not need to care about the detail of its implementation. Basically this `pad_sequences` method fix all the sequences to a same length." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def pad_sequences(sequences, maxlen=None, dtype='int32',\n", + " padding='pre', truncating='pre', value=0.): \n", + " lengths = [len(s) for s in sequences]\n", + "\n", + " nb_samples = len(sequences)\n", + " if maxlen is None:\n", + " maxlen = np.max(lengths)\n", + "\n", + " # take the sample shape from the first non empty sequence\n", + " # checking for consistency in the main loop below.\n", + " sample_shape = tuple()\n", + " for s in sequences:\n", + " if len(s) > 0:\n", + " sample_shape = np.asarray(s).shape[1:]\n", + " break\n", + "\n", + " x = (np.ones((nb_samples, maxlen) + sample_shape) * value).astype(dtype)\n", + " for idx, s in enumerate(sequences):\n", + " if not len(s):\n", + " continue # empty list/array was found\n", + " if truncating == 'pre':\n", + " trunc = s[-maxlen:]\n", + " elif truncating == 'post':\n", + " trunc = s[:maxlen]\n", + " else:\n", + " raise ValueError('Truncating type \"%s\" not understood' % truncating)\n", + "\n", + " # check `trunc` has expected shape\n", + " trunc = np.asarray(trunc, dtype=dtype)\n", + " if trunc.shape[1:] != sample_shape:\n", + " raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %\n", + " (trunc.shape[1:], idx, sample_shape))\n", + "\n", + " if padding == 'post':\n", + " x[idx, :len(trunc)] = trunc\n", + " elif padding == 'pre':\n", + " x[idx, -len(trunc):] = trunc\n", + " else:\n", + " raise ValueError('Padding type \"%s\" not understood' % padding)\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's try to use such a model on the IMDB movie review classification problem. First, let's preprocess the data. " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "input_train shape: (25000, 500)\n", + "input_test shape: (25000, 500)\n" + ] + } + ], + "source": [ + "from zoo.pipeline.api.keras.datasets import imdb\n", + "\n", + "max_features = 10000 # number of words to consider as features\n", + "maxlen = 500 # cut texts after this number of words (among top max_features most common words)\n", + "batch_size = 32\n", + "\n", + "(input_train, y_train), (input_test, y_test) = imdb.load_data(nb_words=max_features)\n", + "input_train = pad_sequences(input_train, maxlen=maxlen)\n", + "input_test = pad_sequences(input_test, maxlen=maxlen)\n", + "print('input_train shape:', input_train.shape)\n", + "print('input_test shape:', input_test.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is sometimes useful to stack several recurrent layers one after the other in order to increase the representational power of a network. \n", + "In such a setup, you have to get all intermediate layers to return full sequences:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Specify input shape\n", + "_One could add an embedding layer as our first layer in Keras as following:_\n", + " \n", + " model = Sequential()\n", + " model.add(Embedding(10000, 32))\n", + "_In Keras API of Analytics Zoo, you need to specify the input shape of first layer, in this example, the sequence length is 500, as is shown above, so we could build our model as following:_" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasEmbedding\n", + "creating: createZooKerasSimpleRNN\n", + "creating: createZooKerasSimpleRNN\n", + "creating: createZooKerasSimpleRNN\n", + "creating: createZooKerasSimpleRNN\n" + ] + } + ], + "source": [ + "model = Sequential()\n", + "model.add(Embedding(10000, 32, input_shape=(500,)))\n", + "model.add(SimpleRNN(32, return_sequences=True))\n", + "model.add(SimpleRNN(32, return_sequences=True))\n", + "model.add(SimpleRNN(32, return_sequences=True))\n", + "model.add(SimpleRNN(32)) # This last layer only returns the last outputs.\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's train a simple recurrent network using an `Embedding` layer and a `SimpleRNN` layer:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasEmbedding\n", + "creating: createZooKerasSimpleRNN\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "from zoo.pipeline.api.keras.layers import Dense\n", + "\n", + "model = Sequential()\n", + "model.add(Embedding(max_features, 32, input_shape=(500,)))\n", + "model.add(SimpleRNN(32))\n", + "model.add(Dense(1, activation='sigmoid'))\n", + "\n", + "model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])\n", + "\n", + "import time\n", + "dir_name = '6-2 ' + str(time.ctime())\n", + "model.set_tensorboard('./', dir_name)\n", + "model.fit(input_train, y_train,\n", + " nb_epoch=10,\n", + " batch_size=128,\n", + " validation_split=0.2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 128 records in 0.046239497 seconds. Throughput is 2768.1963 records/second. Loss is 0.16970885.\n", + "\n", + "Top1Accuracy is Accuracy(correct: 4167, count: 5000, accuracy: 0.8334)_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's display the training and validation loss:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "train_loss = np.array(model.get_train_summary('Loss'))\n", + "val_loss = np.array(model.get_validation_summary('Loss'))\n", + "\n", + "import matplotlib.pyplot as plt\n", + "plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')\n", + "plt.plot(val_loss[:,0],val_loss[:,1],label='validation loss',color='green')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a reminder, in chapter 3, our very first naive approach to this very dataset got us to 88% test accuracy. Unfortunately, our small \n", + "recurrent network doesn't perform very well at all compared to this baseline (only up to 85% validation accuracy). Part of the problem is \n", + "that our inputs only consider the first 500 words rather the full sequences -- \n", + "hence our RNN has access to less information than our earlier baseline model. The remainder of the problem is simply that `SimpleRNN` isn't very good at processing long sequences, like text. Other types of recurrent layers perform much better. Let's take a look at some \n", + "more advanced layers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A concrete LSTM example in Keras API of Analytics Zoo\n", + "\n", + "Now let's switch to more practical concerns: we will set up a model using a LSTM layer and train it on the IMDB data. Here's the network, \n", + "similar to the one with `SimpleRNN` that we just presented. We only specify the output dimensionality of the LSTM layer, and leave every \n", + "other argument (there are lots) to the Keras API of Analytics Zoo defaults, which has good defaults, and things will almost always \"just work\" without you \n", + "having to spend time tuning parameters by hand." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "creating: createZooKerasSequential\n", + "creating: createZooKerasEmbedding\n", + "creating: createZooKerasLSTM\n", + "creating: createZooKerasDense\n", + "creating: createRMSprop\n", + "creating: createZooKerasBinaryCrossEntropy\n", + "creating: createZooKerasBinaryAccuracy\n" + ] + } + ], + "source": [ + "from zoo.pipeline.api.keras.layers import LSTM\n", + "\n", + "model = Sequential()\n", + "model.add(Embedding(max_features, 32, input_shape=(500,)))\n", + "model.add(LSTM(32))\n", + "model.add(Dense(1, activation='sigmoid'))\n", + "\n", + "model.compile(optimizer='rmsprop',\n", + " loss='binary_crossentropy',\n", + " metrics=['acc'])\n", + "\n", + "dir_name = '6-2 ' + str(time.ctime())\n", + "model.set_tensorboard('./', dir_name)\n", + "model.fit(input_train, y_train,\n", + " nb_epoch=10,\n", + " batch_size=128,\n", + " validation_split=0.2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_INFO - Trained 128 records in 0.335889472 seconds. Throughput is 381.07776 records/second. Loss is 0.14791179.\n", + "\n", + "Top1Accuracy is Accuracy(correct: 4358, count: 5000, accuracy: 0.8716)_" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "train_loss = np.array(model.get_train_summary('Loss'))\n", + "val_loss = np.array(model.get_validation_summary('Loss'))\n", + "\n", + "plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')\n", + "plt.plot(val_loss[:,0],val_loss[:,1],label='validation loss',color='green')\n", + "plt.xlabel('Steps')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "plt.show()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/tensorflow/bert/run_classifier.py b/tensorflow/bert/run_classifier.py new file mode 100644 index 0000000..817b147 --- /dev/null +++ b/tensorflow/bert/run_classifier.py @@ -0,0 +1,981 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""BERT finetuning runner.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import collections +import csv +import os +import modeling +import optimization +import tokenization +import tensorflow as tf + +flags = tf.flags + +FLAGS = flags.FLAGS + +## Required parameters +flags.DEFINE_string( + "data_dir", None, + "The input data dir. Should contain the .tsv files (or other data files) " + "for the task.") + +flags.DEFINE_string( + "bert_config_file", None, + "The config json file corresponding to the pre-trained BERT model. " + "This specifies the model architecture.") + +flags.DEFINE_string("task_name", None, "The name of the task to train.") + +flags.DEFINE_string("vocab_file", None, + "The vocabulary file that the BERT model was trained on.") + +flags.DEFINE_string( + "output_dir", None, + "The output directory where the model checkpoints will be written.") + +## Other parameters + +flags.DEFINE_string( + "init_checkpoint", None, + "Initial checkpoint (usually from a pre-trained BERT model).") + +flags.DEFINE_bool( + "do_lower_case", True, + "Whether to lower case the input text. Should be True for uncased " + "models and False for cased models.") + +flags.DEFINE_integer( + "max_seq_length", 128, + "The maximum total input sequence length after WordPiece tokenization. " + "Sequences longer than this will be truncated, and sequences shorter " + "than this will be padded.") + +flags.DEFINE_bool("do_train", False, "Whether to run training.") + +flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") + +flags.DEFINE_bool( + "do_predict", False, + "Whether to run the model in inference mode on the test set.") + +flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") + +flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") + +flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") + +flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") + +flags.DEFINE_float("num_train_epochs", 3.0, + "Total number of training epochs to perform.") + +flags.DEFINE_float( + "warmup_proportion", 0.1, + "Proportion of training to perform linear learning rate warmup for. " + "E.g., 0.1 = 10% of training.") + +flags.DEFINE_integer("save_checkpoints_steps", 1000, + "How often to save the model checkpoint.") + +flags.DEFINE_integer("iterations_per_loop", 1000, + "How many steps to make in each estimator call.") + +flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") + +tf.flags.DEFINE_string( + "tpu_name", None, + "The Cloud TPU to use for training. This should be either the name " + "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " + "url.") + +tf.flags.DEFINE_string( + "tpu_zone", None, + "[Optional] GCE zone where the Cloud TPU is located in. If not " + "specified, we will attempt to automatically detect the GCE project from " + "metadata.") + +tf.flags.DEFINE_string( + "gcp_project", None, + "[Optional] Project name for the Cloud TPU-enabled project. If not " + "specified, we will attempt to automatically detect the GCE project from " + "metadata.") + +tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") + +flags.DEFINE_integer( + "num_tpu_cores", 8, + "Only used if `use_tpu` is True. Total number of TPU cores to use.") + + +class InputExample(object): + """A single training/test example for simple sequence classification.""" + + def __init__(self, guid, text_a, text_b=None, label=None): + """Constructs a InputExample. + + Args: + guid: Unique id for the example. + text_a: string. The untokenized text of the first sequence. For single + sequence tasks, only this sequence must be specified. + text_b: (Optional) string. The untokenized text of the second sequence. + Only must be specified for sequence pair tasks. + label: (Optional) string. The label of the example. This should be + specified for train and dev examples, but not for test examples. + """ + self.guid = guid + self.text_a = text_a + self.text_b = text_b + self.label = label + + +class PaddingInputExample(object): + """Fake example so the num input examples is a multiple of the batch size. + + When running eval/predict on the TPU, we need to pad the number of examples + to be a multiple of the batch size, because the TPU requires a fixed batch + size. The alternative is to drop the last batch, which is bad because it means + the entire output data won't be generated. + + We use this class instead of `None` because treating `None` as padding + battches could cause silent errors. + """ + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__(self, + input_ids, + input_mask, + segment_ids, + label_id, + is_real_example=True): + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids + self.label_id = label_id + self.is_real_example = is_real_example + + +class DataProcessor(object): + """Base class for data converters for sequence classification data sets.""" + + def get_train_examples(self, data_dir): + """Gets a collection of `InputExample`s for the train set.""" + raise NotImplementedError() + + def get_dev_examples(self, data_dir): + """Gets a collection of `InputExample`s for the dev set.""" + raise NotImplementedError() + + def get_test_examples(self, data_dir): + """Gets a collection of `InputExample`s for prediction.""" + raise NotImplementedError() + + def get_labels(self): + """Gets the list of labels for this data set.""" + raise NotImplementedError() + + @classmethod + def _read_tsv(cls, input_file, quotechar=None): + """Reads a tab separated value file.""" + with tf.gfile.Open(input_file, "r") as f: + reader = csv.reader(f, delimiter="\t", quotechar=quotechar) + lines = [] + for line in reader: + lines.append(line) + return lines + + +class XnliProcessor(DataProcessor): + """Processor for the XNLI data set.""" + + def __init__(self): + self.language = "zh" + + def get_train_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv( + os.path.join(data_dir, "multinli", + "multinli.train.%s.tsv" % self.language)) + examples = [] + for (i, line) in enumerate(lines): + if i == 0: + continue + guid = "train-%d" % (i) + text_a = tokenization.convert_to_unicode(line[0]) + text_b = tokenization.convert_to_unicode(line[1]) + label = tokenization.convert_to_unicode(line[2]) + if label == tokenization.convert_to_unicode("contradictory"): + label = tokenization.convert_to_unicode("contradiction") + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_dev_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")) + examples = [] + for (i, line) in enumerate(lines): + if i == 0: + continue + guid = "dev-%d" % (i) + language = tokenization.convert_to_unicode(line[0]) + if language != tokenization.convert_to_unicode(self.language): + continue + text_a = tokenization.convert_to_unicode(line[6]) + text_b = tokenization.convert_to_unicode(line[7]) + label = tokenization.convert_to_unicode(line[1]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_labels(self): + """See base class.""" + return ["contradiction", "entailment", "neutral"] + + +class MnliProcessor(DataProcessor): + """Processor for the MultiNLI data set (GLUE version).""" + + def get_train_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + + def get_dev_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), + "dev_matched") + + def get_test_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") + + def get_labels(self): + """See base class.""" + return ["contradiction", "entailment", "neutral"] + + def _create_examples(self, lines, set_type): + """Creates examples for the training and dev sets.""" + examples = [] + for (i, line) in enumerate(lines): + if i == 0: + continue + guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0])) + text_a = tokenization.convert_to_unicode(line[8]) + text_b = tokenization.convert_to_unicode(line[9]) + if set_type == "test": + label = "contradiction" + else: + label = tokenization.convert_to_unicode(line[-1]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + +class MrpcProcessor(DataProcessor): + """Processor for the MRPC data set (GLUE version).""" + + def get_train_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + + def get_dev_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") + + def get_test_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") + + def get_labels(self): + """See base class.""" + return ["0", "1"] + + def _create_examples(self, lines, set_type): + """Creates examples for the training and dev sets.""" + examples = [] + for (i, line) in enumerate(lines): + if i == 0: + continue + guid = "%s-%s" % (set_type, i) + text_a = tokenization.convert_to_unicode(line[3]) + text_b = tokenization.convert_to_unicode(line[4]) + if set_type == "test": + label = "0" + else: + label = tokenization.convert_to_unicode(line[0]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + +class ColaProcessor(DataProcessor): + """Processor for the CoLA data set (GLUE version).""" + + def get_train_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + + def get_dev_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") + + def get_test_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") + + def get_labels(self): + """See base class.""" + return ["0", "1"] + + def _create_examples(self, lines, set_type): + """Creates examples for the training and dev sets.""" + examples = [] + for (i, line) in enumerate(lines): + # Only the test set has a header + if set_type == "test" and i == 0: + continue + guid = "%s-%s" % (set_type, i) + if set_type == "test": + text_a = tokenization.convert_to_unicode(line[1]) + label = "0" + else: + text_a = tokenization.convert_to_unicode(line[3]) + label = tokenization.convert_to_unicode(line[1]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) + return examples + + +def convert_single_example(ex_index, example, label_list, max_seq_length, + tokenizer): + """Converts a single `InputExample` into a single `InputFeatures`.""" + + if isinstance(example, PaddingInputExample): + return InputFeatures( + input_ids=[0] * max_seq_length, + input_mask=[0] * max_seq_length, + segment_ids=[0] * max_seq_length, + label_id=0, + is_real_example=False) + + label_map = {} + for (i, label) in enumerate(label_list): + label_map[label] = i + + tokens_a = tokenizer.tokenize(example.text_a) + tokens_b = None + if example.text_b: + tokens_b = tokenizer.tokenize(example.text_b) + + if tokens_b: + # Modifies `tokens_a` and `tokens_b` in place so that the total + # length is less than the specified length. + # Account for [CLS], [SEP], [SEP] with "- 3" + _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) + else: + # Account for [CLS] and [SEP] with "- 2" + if len(tokens_a) > max_seq_length - 2: + tokens_a = tokens_a[0:(max_seq_length - 2)] + + # The convention in BERT is: + # (a) For sequence pairs: + # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] + # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 + # (b) For single sequences: + # tokens: [CLS] the dog is hairy . [SEP] + # type_ids: 0 0 0 0 0 0 0 + # + # Where "type_ids" are used to indicate whether this is the first + # sequence or the second sequence. The embedding vectors for `type=0` and + # `type=1` were learned during pre-training and are added to the wordpiece + # embedding vector (and position vector). This is not *strictly* necessary + # since the [SEP] token unambiguously separates the sequences, but it makes + # it easier for the model to learn the concept of sequences. + # + # For classification tasks, the first vector (corresponding to [CLS]) is + # used as the "sentence vector". Note that this only makes sense because + # the entire model is fine-tuned. + tokens = [] + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in tokens_a: + tokens.append(token) + segment_ids.append(0) + tokens.append("[SEP]") + segment_ids.append(0) + + if tokens_b: + for token in tokens_b: + tokens.append(token) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1] * len(input_ids) + + # Zero-pad up to the sequence length. + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + + label_id = label_map[example.label] + if ex_index < 5: + tf.logging.info("*** Example ***") + tf.logging.info("guid: %s" % (example.guid)) + tf.logging.info("tokens: %s" % " ".join( + [tokenization.printable_text(x) for x in tokens])) + tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) + tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) + tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) + tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) + + feature = InputFeatures( + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids, + label_id=label_id, + is_real_example=True) + return feature + + +def file_based_convert_examples_to_features( + examples, label_list, max_seq_length, tokenizer, output_file): + """Convert a set of `InputExample`s to a TFRecord file.""" + + writer = tf.python_io.TFRecordWriter(output_file) + + for (ex_index, example) in enumerate(examples): + if ex_index % 10000 == 0: + tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) + + feature = convert_single_example(ex_index, example, label_list, + max_seq_length, tokenizer) + + def create_int_feature(values): + f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) + return f + + features = collections.OrderedDict() + features["input_ids"] = create_int_feature(feature.input_ids) + features["input_mask"] = create_int_feature(feature.input_mask) + features["segment_ids"] = create_int_feature(feature.segment_ids) + features["label_ids"] = create_int_feature([feature.label_id]) + features["is_real_example"] = create_int_feature( + [int(feature.is_real_example)]) + + tf_example = tf.train.Example(features=tf.train.Features(feature=features)) + writer.write(tf_example.SerializeToString()) + writer.close() + + +def file_based_input_fn_builder(input_file, seq_length, is_training, + drop_remainder): + """Creates an `input_fn` closure to be passed to TPUEstimator.""" + + name_to_features = { + "input_ids": tf.FixedLenFeature([seq_length], tf.int64), + "input_mask": tf.FixedLenFeature([seq_length], tf.int64), + "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), + "label_ids": tf.FixedLenFeature([], tf.int64), + "is_real_example": tf.FixedLenFeature([], tf.int64), + } + + def _decode_record(record, name_to_features): + """Decodes a record to a TensorFlow example.""" + example = tf.parse_single_example(record, name_to_features) + + # tf.Example only supports tf.int64, but the TPU only supports tf.int32. + # So cast all int64 to int32. + for name in list(example.keys()): + t = example[name] + if t.dtype == tf.int64: + t = tf.to_int32(t) + example[name] = t + + return example + + def input_fn(params): + """The actual input function.""" + batch_size = params["batch_size"] + + # For training, we want a lot of parallel reading and shuffling. + # For eval, we want no shuffling and parallel reading doesn't matter. + d = tf.data.TFRecordDataset(input_file) + if is_training: + d = d.repeat() + d = d.shuffle(buffer_size=100) + + d = d.apply( + tf.contrib.data.map_and_batch( + lambda record: _decode_record(record, name_to_features), + batch_size=batch_size, + drop_remainder=drop_remainder)) + + return d + + return input_fn + + +def _truncate_seq_pair(tokens_a, tokens_b, max_length): + """Truncates a sequence pair in place to the maximum length.""" + + # This is a simple heuristic which will always truncate the longer sequence + # one token at a time. This makes more sense than truncating an equal percent + # of tokens from each, since if one sequence is very short then each token + # that's truncated likely contains more information than a longer sequence. + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + + +def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, + labels, num_labels, use_one_hot_embeddings): + """Creates a classification model.""" + model = modeling.BertModel( + config=bert_config, + is_training=is_training, + input_ids=input_ids, + input_mask=input_mask, + token_type_ids=segment_ids, + use_one_hot_embeddings=use_one_hot_embeddings) + + # In the demo, we are doing a simple classification task on the entire + # segment. + # + # If you want to use the token-level output, use model.get_sequence_output() + # instead. + output_layer = model.get_pooled_output() + + hidden_size = output_layer.shape[-1].value + + output_weights = tf.get_variable( + "output_weights", [num_labels, hidden_size], + initializer=tf.truncated_normal_initializer(stddev=0.02)) + + output_bias = tf.get_variable( + "output_bias", [num_labels], initializer=tf.zeros_initializer()) + + with tf.variable_scope("loss"): + if is_training: + # I.e., 0.1 dropout + output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) + + logits = tf.matmul(output_layer, output_weights, transpose_b=True) + logits = tf.nn.bias_add(logits, output_bias) + probabilities = tf.nn.softmax(logits, axis=-1) + log_probs = tf.nn.log_softmax(logits, axis=-1) + + one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) + + per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) + loss = tf.reduce_mean(per_example_loss) + + return (loss, per_example_loss, logits, probabilities) + + +def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, + num_train_steps, num_warmup_steps, use_tpu, + use_one_hot_embeddings): + """Returns `model_fn` closure for TPUEstimator.""" + + def model_fn(features, labels, mode, params): # pylint: disable=unused-argument + """The `model_fn` for TPUEstimator.""" + + tf.logging.info("*** Features ***") + for name in sorted(features.keys()): + tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) + + input_ids = features["input_ids"] + input_mask = features["input_mask"] + segment_ids = features["segment_ids"] + label_ids = features["label_ids"] + is_real_example = None + if "is_real_example" in features: + is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) + else: + is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) + + is_training = (mode == tf.estimator.ModeKeys.TRAIN) + + (total_loss, per_example_loss, logits, probabilities) = create_model( + bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, + num_labels, use_one_hot_embeddings) + + tvars = tf.trainable_variables() + initialized_variable_names = {} + scaffold_fn = None + if init_checkpoint: + (assignment_map, initialized_variable_names + ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) + if use_tpu: + + def tpu_scaffold(): + tf.train.init_from_checkpoint(init_checkpoint, assignment_map) + return tf.train.Scaffold() + + scaffold_fn = tpu_scaffold + else: + tf.train.init_from_checkpoint(init_checkpoint, assignment_map) + + tf.logging.info("**** Trainable Variables ****") + for var in tvars: + init_string = "" + if var.name in initialized_variable_names: + init_string = ", *INIT_FROM_CKPT*" + tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, + init_string) + + output_spec = None + if mode == tf.estimator.ModeKeys.TRAIN: + + train_op = optimization.create_optimizer( + total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) + + output_spec = tf.contrib.tpu.TPUEstimatorSpec( + mode=mode, + loss=total_loss, + train_op=train_op, + scaffold_fn=scaffold_fn) + elif mode == tf.estimator.ModeKeys.EVAL: + + def metric_fn(per_example_loss, label_ids, logits, is_real_example): + predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) + accuracy = tf.metrics.accuracy( + labels=label_ids, predictions=predictions, weights=is_real_example) + loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) + return { + "eval_accuracy": accuracy, + "eval_loss": loss, + } + + eval_metrics = (metric_fn, + [per_example_loss, label_ids, logits, is_real_example]) + output_spec = tf.contrib.tpu.TPUEstimatorSpec( + mode=mode, + loss=total_loss, + eval_metrics=eval_metrics, + scaffold_fn=scaffold_fn) + else: + output_spec = tf.contrib.tpu.TPUEstimatorSpec( + mode=mode, + predictions={"probabilities": probabilities}, + scaffold_fn=scaffold_fn) + return output_spec + + return model_fn + + +# This function is not used by this file but is still used by the Colab and +# people who depend on it. +def input_fn_builder(features, seq_length, is_training, drop_remainder): + """Creates an `input_fn` closure to be passed to TPUEstimator.""" + + all_input_ids = [] + all_input_mask = [] + all_segment_ids = [] + all_label_ids = [] + + for feature in features: + all_input_ids.append(feature.input_ids) + all_input_mask.append(feature.input_mask) + all_segment_ids.append(feature.segment_ids) + all_label_ids.append(feature.label_id) + + def input_fn(params): + """The actual input function.""" + batch_size = params["batch_size"] + + num_examples = len(features) + + # This is for demo purposes and does NOT scale to large data sets. We do + # not use Dataset.from_generator() because that uses tf.py_func which is + # not TPU compatible. The right way to load data is with TFRecordReader. + d = tf.data.Dataset.from_tensor_slices({ + "input_ids": + tf.constant( + all_input_ids, shape=[num_examples, seq_length], + dtype=tf.int32), + "input_mask": + tf.constant( + all_input_mask, + shape=[num_examples, seq_length], + dtype=tf.int32), + "segment_ids": + tf.constant( + all_segment_ids, + shape=[num_examples, seq_length], + dtype=tf.int32), + "label_ids": + tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), + }) + + if is_training: + d = d.repeat() + d = d.shuffle(buffer_size=100) + + d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) + return d + + return input_fn + + +# This function is not used by this file but is still used by the Colab and +# people who depend on it. +def convert_examples_to_features(examples, label_list, max_seq_length, + tokenizer): + """Convert a set of `InputExample`s to a list of `InputFeatures`.""" + + features = [] + for (ex_index, example) in enumerate(examples): + if ex_index % 10000 == 0: + tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) + + feature = convert_single_example(ex_index, example, label_list, + max_seq_length, tokenizer) + + features.append(feature) + return features + + +def main(_): + tf.logging.set_verbosity(tf.logging.INFO) + + processors = { + "cola": ColaProcessor, + "mnli": MnliProcessor, + "mrpc": MrpcProcessor, + "xnli": XnliProcessor, + } + + tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, + FLAGS.init_checkpoint) + + if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: + raise ValueError( + "At least one of `do_train`, `do_eval` or `do_predict' must be True.") + + bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) + + if FLAGS.max_seq_length > bert_config.max_position_embeddings: + raise ValueError( + "Cannot use sequence length %d because the BERT model " + "was only trained up to sequence length %d" % + (FLAGS.max_seq_length, bert_config.max_position_embeddings)) + + tf.gfile.MakeDirs(FLAGS.output_dir) + + task_name = FLAGS.task_name.lower() + + if task_name not in processors: + raise ValueError("Task not found: %s" % (task_name)) + + processor = processors[task_name]() + + label_list = processor.get_labels() + + tokenizer = tokenization.FullTokenizer( + vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) + + tpu_cluster_resolver = None + if FLAGS.use_tpu and FLAGS.tpu_name: + tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( + FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) + + is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 + run_config = tf.contrib.tpu.RunConfig( + cluster=tpu_cluster_resolver, + master=FLAGS.master, + model_dir=FLAGS.output_dir, + save_checkpoints_steps=FLAGS.save_checkpoints_steps, + tpu_config=tf.contrib.tpu.TPUConfig( + iterations_per_loop=FLAGS.iterations_per_loop, + num_shards=FLAGS.num_tpu_cores, + per_host_input_for_training=is_per_host)) + + train_examples = None + num_train_steps = None + num_warmup_steps = None + if FLAGS.do_train: + train_examples = processor.get_train_examples(FLAGS.data_dir) + num_train_steps = int( + len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) + num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) + + model_fn = model_fn_builder( + bert_config=bert_config, + num_labels=len(label_list), + init_checkpoint=FLAGS.init_checkpoint, + learning_rate=FLAGS.learning_rate, + num_train_steps=num_train_steps, + num_warmup_steps=num_warmup_steps, + use_tpu=FLAGS.use_tpu, + use_one_hot_embeddings=FLAGS.use_tpu) + + # If TPU is not available, this will fall back to normal Estimator on CPU + # or GPU. + estimator = tf.contrib.tpu.TPUEstimator( + use_tpu=FLAGS.use_tpu, + model_fn=model_fn, + config=run_config, + train_batch_size=FLAGS.train_batch_size, + eval_batch_size=FLAGS.eval_batch_size, + predict_batch_size=FLAGS.predict_batch_size) + + if FLAGS.do_train: + train_file = os.path.join(FLAGS.output_dir, "train.tf_record") + file_based_convert_examples_to_features( + train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) + tf.logging.info("***** Running training *****") + tf.logging.info(" Num examples = %d", len(train_examples)) + tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) + tf.logging.info(" Num steps = %d", num_train_steps) + train_input_fn = file_based_input_fn_builder( + input_file=train_file, + seq_length=FLAGS.max_seq_length, + is_training=True, + drop_remainder=True) + estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) + + if FLAGS.do_eval: + eval_examples = processor.get_dev_examples(FLAGS.data_dir) + num_actual_eval_examples = len(eval_examples) + if FLAGS.use_tpu: + # TPU requires a fixed batch size for all batches, therefore the number + # of examples must be a multiple of the batch size, or else examples + # will get dropped. So we pad with fake examples which are ignored + # later on. These do NOT count towards the metric (all tf.metrics + # support a per-instance weight, and these get a weight of 0.0). + while len(eval_examples) % FLAGS.eval_batch_size != 0: + eval_examples.append(PaddingInputExample()) + + eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") + file_based_convert_examples_to_features( + eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) + + tf.logging.info("***** Running evaluation *****") + tf.logging.info(" Num examples = %d (%d actual, %d padding)", + len(eval_examples), num_actual_eval_examples, + len(eval_examples) - num_actual_eval_examples) + tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) + + # This tells the estimator to run through the entire set. + eval_steps = None + # However, if running eval on the TPU, you will need to specify the + # number of steps. + if FLAGS.use_tpu: + assert len(eval_examples) % FLAGS.eval_batch_size == 0 + eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) + + eval_drop_remainder = True if FLAGS.use_tpu else False + eval_input_fn = file_based_input_fn_builder( + input_file=eval_file, + seq_length=FLAGS.max_seq_length, + is_training=False, + drop_remainder=eval_drop_remainder) + + result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) + + output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") + with tf.gfile.GFile(output_eval_file, "w") as writer: + tf.logging.info("***** Eval results *****") + for key in sorted(result.keys()): + tf.logging.info(" %s = %s", key, str(result[key])) + writer.write("%s = %s\n" % (key, str(result[key]))) + + if FLAGS.do_predict: + predict_examples = processor.get_test_examples(FLAGS.data_dir) + num_actual_predict_examples = len(predict_examples) + if FLAGS.use_tpu: + # TPU requires a fixed batch size for all batches, therefore the number + # of examples must be a multiple of the batch size, or else examples + # will get dropped. So we pad with fake examples which are ignored + # later on. + while len(predict_examples) % FLAGS.predict_batch_size != 0: + predict_examples.append(PaddingInputExample()) + + predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") + file_based_convert_examples_to_features(predict_examples, label_list, + FLAGS.max_seq_length, tokenizer, + predict_file) + + tf.logging.info("***** Running prediction*****") + tf.logging.info(" Num examples = %d (%d actual, %d padding)", + len(predict_examples), num_actual_predict_examples, + len(predict_examples) - num_actual_predict_examples) + tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) + + predict_drop_remainder = True if FLAGS.use_tpu else False + predict_input_fn = file_based_input_fn_builder( + input_file=predict_file, + seq_length=FLAGS.max_seq_length, + is_training=False, + drop_remainder=predict_drop_remainder) + + result = estimator.predict(input_fn=predict_input_fn) + + output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") + with tf.gfile.GFile(output_predict_file, "w") as writer: + num_written_lines = 0 + tf.logging.info("***** Predict results *****") + for (i, prediction) in enumerate(result): + probabilities = prediction["probabilities"] + if i >= num_actual_predict_examples: + break + output_line = "\t".join( + str(class_probability) + for class_probability in probabilities) + "\n" + writer.write(output_line) + num_written_lines += 1 + assert num_written_lines == num_actual_predict_examples + + +if __name__ == "__main__": + flags.mark_flag_as_required("data_dir") + flags.mark_flag_as_required("task_name") + flags.mark_flag_as_required("vocab_file") + flags.mark_flag_as_required("bert_config_file") + flags.mark_flag_as_required("output_dir") + tf.app.run()