Merge pull request #20 from josephch405/master

Version updates and reformatting
cs230-stanford · Oct 8, 2019 · 96ac6fd · 96ac6fd
2 parents 159df10 + 192216e
commit 96ac6fd
Show file tree

Hide file tree

Showing 13 changed files with 222 additions and 137 deletions.
diff --git a/pytorch/nlp/README.md b/pytorch/nlp/README.md
@@ -1,6 +1,6 @@
 # Named Entity Recognition with PyTorch
 
-*Authors: Surag Nair, Guillaume Genthial and Olivier Moindrot*
+_Authors: Surag Nair, Guillaume Genthial and Olivier Moindrot_
 
 Take the time to read the [tutorials](https://cs230-stanford.github.io/project-starter-code.html).
 
@@ -31,75 +31,88 @@ B-PER  O     O  B-LOC I-LOC
 
 We provide a small subset of the kaggle dataset (30 sentences) for testing in `data/small` but you are encouraged to download the original version on the [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) website.
 
-1. __Download the dataset__ `ner_dataset.csv` on [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) and save it under the `nlp/data/kaggle` directory. Make sure you download the simple version `ner_dataset.csv` and NOT the full version `ner.csv`.
+1. **Download the dataset** `ner_dataset.csv` on [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) and save it under the `nlp/data/kaggle` directory. Make sure you download the simple version `ner_dataset.csv` and NOT the full version `ner.csv`.
+
+2. **Build the dataset** Run the following script
 
-2. __Build the dataset__ Run the following script
 ```
 python build_kaggle_dataset.py
 ```
+
 It will extract the sentences and labels from the dataset, split it into train/val/test and save it in a convenient format for our model.
 
-*Debug* If you get some errors, check that you downloaded the right file and saved it in the right directory. If you have issues with encoding, try running the script with python 2.7.
+_Debug_ If you get some errors, check that you downloaded the right file and saved it in the right directory. If you have issues with encoding, try running the script with python 2.7.
 
 3. In the next section, change `data/small` by `data/kaggle`
 
-
 ## Quickstart (~10 min)
 
-1. __Build__ vocabularies and parameters for your dataset by running
+1. **Build** vocabularies and parameters for your dataset by running
+
 ```
 python build_vocab.py --data_dir data/small
 ```
+
 It will write vocabulary files `words.txt` and `tags.txt` containing the words and tags in the dataset. It will also save a `dataset_params.json` with some extra information.
 
-2. __Your first experiment__ We created a `base_model` directory for you under the `experiments` directory. It contains a file `params.json` which sets the hyperparameters for the experiment. It looks like
+2. **Your first experiment** We created a `base_model` directory for you under the `experiments` directory. It contains a file `params.json` which sets the hyperparameters for the experiment. It looks like
+
 ```json
 {
-    "learning_rate": 1e-3,
-    "batch_size": 5,
-    "num_epochs": 2
+  "learning_rate": 1e-3,
+  "batch_size": 5,
+  "num_epochs": 2
 }
 ```
+
 For every new experiment, you will need to create a new directory under `experiments` with a `params.json` file.
 
-3. __Train__ your experiment. Simply run
+3. **Train** your experiment. Simply run
+
 ```
 python train.py --data_dir data/small --model_dir experiments/base_model
 ```
+
 It will instantiate a model and train it on the training set following the hyperparameters specified in `params.json`. It will also evaluate some metrics on the development set.
 
-4. __Your first hyperparameters search__ We created a new directory `learning_rate` in `experiments` for you. Now, run
+4. **Your first hyperparameters search** We created a new directory `learning_rate` in `experiments` for you. Now, run
+
 ```
 python search_hyperparams.py --data_dir data/small --parent_dir experiments/learning_rate
 ```
+
 It will train and evaluate a model with different values of learning rate defined in `search_hyperparams.py` and create a new directory for each experiment under `experiments/learning_rate/`.
 
-5. __Display the results__ of the hyperparameters search in a nice format
+5. **Display the results** of the hyperparameters search in a nice format
+
 ```
 python synthesize_results.py --parent_dir experiments/learning_rate
 ```
 
-6. __Evaluation on the test set__ Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set. Run
+6. **Evaluation on the test set** Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set. Run
+
 ```
 python evaluate.py --data_dir data/small --model_dir experiments/base_model
 ```
 
-
 ## Guidelines for more advanced use
 
 We recommend reading through `train.py` to get a high-level overview of the training loop steps:
+
 - loading the hyperparameters for the experiment (the `params.json`)
 - loading the training and validation data
 - creating the model, loss_fn and metrics
 - training the model for a given number of epochs by calling `train_and_evaluate(...)`
 
 You can then go through `model/data_loader.py` to understand the following steps:
+
 - loading the vocabularies from the `words.txt` and `tags.txt` files
 - creating the sentences/labels datasets from the text files
 - how the vocabulary is used to map tokens to their indices
 - how the `data_iterator` creates a batch of data and labels and pads sentences
 
 Once you get the high-level idea, depending on your dataset, you might want to modify
+
 - `model/model.py` to change the neural network, loss function and metrics
 - `model/data_loader.py` to suit the data loader to your specific needs
 - `train.py` for changing the optimizer
@@ -109,6 +122,6 @@ Once you get something working for your dataset, feel free to edit any part of t
 
 ## Resources
 
-- [PyTorch documentation](http://pytorch.org/docs/0.3.0/)
+- [PyTorch documentation](http://pytorch.org/docs/1.2.0/)
 - [Tutorials](http://pytorch.org/tutorials/)
 - [PyTorch warm-up](https://github.com/jcjohnson/pytorch-examples)
diff --git a/pytorch/nlp/evaluate.py b/pytorch/nlp/evaluate.py
@@ -51,7 +51,7 @@ def evaluate(model, loss_fn, data_iterator, metrics, params, num_steps):
         # compute all metrics on this batch
         summary_batch = {metric: metrics[metric](output_batch, labels_batch)
                          for metric in metrics}
-        summary_batch['loss'] = loss.data[0]
+        summary_batch['loss'] = loss.item()
         summ.append(summary_batch)
 
     # compute mean of all metrics in summary

diff --git a/pytorch/nlp/model/net.py b/pytorch/nlp/model/net.py
@@ -38,11 +38,12 @@ def __init__(self, params):
 
         # the LSTM takes as input the size of its input (embedding_dim), its hidden size
         # for more details on how to use it, check out the documentation
-        self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True)
+        self.lstm = nn.LSTM(params.embedding_dim,
+                            params.lstm_hidden_dim, batch_first=True)
 
         # the fully connected layer transforms the output to give the final output layer
         self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags)
-        
+
     def forward(self, s):
         """
         This function defines how we use the components of our network to operate on an input batch.
@@ -61,16 +62,19 @@ def forward(self, s):
         """
         #                                -> batch_size x seq_len
         # apply the embedding layer that maps each token to its embedding
-        s = self.embedding(s)            # dim: batch_size x seq_len x embedding_dim
+        # dim: batch_size x seq_len x embedding_dim
+        s = self.embedding(s)
 
         # run the LSTM along the sentences of length seq_len
-        s, _ = self.lstm(s)              # dim: batch_size x seq_len x lstm_hidden_dim
+        # dim: batch_size x seq_len x lstm_hidden_dim
+        s, _ = self.lstm(s)
 
         # make the Variable contiguous in memory (a PyTorch artefact)
         s = s.contiguous()
 
         # reshape the Variable so that each row contains one token
-        s = s.view(-1, s.shape[2])       # dim: batch_size*seq_len x lstm_hidden_dim
+        # dim: batch_size*seq_len x lstm_hidden_dim
+        s = s.view(-1, s.shape[2])
 
         # apply the fully connected layer and obtain the output (before softmax) for each token
         s = self.fc(s)                   # dim: batch_size*seq_len x num_tags
@@ -107,12 +111,12 @@ def loss_fn(outputs, labels):
     # number. This does not affect training, since we ignore the PADded tokens with the mask.
     labels = labels % outputs.shape[1]
 
-    num_tokens = int(torch.sum(mask).data[0])
+    num_tokens = int(torch.sum(mask))
 
     # compute cross entropy loss for all tokens (except PADding tokens), by multiplying with mask.
     return -torch.sum(outputs[range(outputs.shape[0]), labels]*mask)/num_tokens
-    
-    
+
+
 def accuracy(outputs, labels):
     """
     Compute the accuracy, given the outputs and labels for all tokens. Exclude PADding terms.
@@ -135,7 +139,7 @@ def accuracy(outputs, labels):
     outputs = np.argmax(outputs, axis=1)
 
     # compare outputs with labels and divide by number of tokens (excluding PADding tokens)
-    return np.sum(outputs==labels)/float(np.sum(mask))
+    return np.sum(outputs == labels)/float(np.sum(mask))
 
 
 # maintain all metrics required in this dictionary- these are used in the training and evaluation loops

diff --git a/pytorch/nlp/requirements.txt b/pytorch/nlp/requirements.txt
@@ -1,5 +1,5 @@
 numpy
 Pillow
-torch>=0.3
+torch>=1.2
 tabulate
 tqdm
diff --git a/pytorch/nlp/train.py b/pytorch/nlp/train.py
@@ -16,8 +16,10 @@
 
 
 parser = argparse.ArgumentParser()
-parser.add_argument('--data_dir', default='data/small', help="Directory containing the dataset")
-parser.add_argument('--model_dir', default='experiments/base_model', help="Directory containing params.json")
+parser.add_argument('--data_dir', default='data/small',
+                    help="Directory containing the dataset")
+parser.add_argument('--model_dir', default='experiments/base_model',
+                    help="Directory containing params.json")
 parser.add_argument('--restore_file', default=None,
                     help="Optional, name of the file in --model_dir containing weights to reload before \
                     training")  # 'best' or 'train'
@@ -42,9 +44,9 @@ def train(model, optimizer, loss_fn, data_iterator, metrics, params, num_steps):
     # summary for current training loop and a running average object for loss
     summ = []
     loss_avg = utils.RunningAverage()
-    
+
     # Use tqdm for progress bar
-    t = trange(num_steps) 
+    t = trange(num_steps)
     for i in t:
         # fetch the next training batch
         train_batch, labels_batch = next(data_iterator)
@@ -67,20 +69,22 @@ def train(model, optimizer, loss_fn, data_iterator, metrics, params, num_steps):
             labels_batch = labels_batch.data.cpu().numpy()
 
             # compute all metrics on this batch
-            summary_batch = {metric:metrics[metric](output_batch, labels_batch)
+            summary_batch = {metric: metrics[metric](output_batch, labels_batch)
                              for metric in metrics}
-            summary_batch['loss'] = loss.data[0]
+            summary_batch['loss'] = loss.item()
             summ.append(summary_batch)
 
         # update the average loss
-        loss_avg.update(loss.data[0])
+        loss_avg.update(loss.item())
         t.set_postfix(loss='{:05.3f}'.format(loss_avg()))
 
     # compute mean of all metrics in summary
-    metrics_mean = {metric:np.mean([x[metric] for x in summ]) for metric in summ[0]} 
-    metrics_string = " ; ".join("{}: {:05.3f}".format(k, v) for k, v in metrics_mean.items())
+    metrics_mean = {metric: np.mean([x[metric]
+                                     for x in summ]) for metric in summ[0]}
+    metrics_string = " ; ".join("{}: {:05.3f}".format(k, v)
+                                for k, v in metrics_mean.items())
     logging.info("- Train metrics: " + metrics_string)
-    
+
 
 def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics, params, model_dir, restore_file=None):
     """Train the model and evaluate every epoch.
@@ -98,10 +102,11 @@ def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics,
     """
     # reload weights from restore_file if specified
     if restore_file is not None:
-        restore_path = os.path.join(args.model_dir, args.restore_file + '.pth.tar')
+        restore_path = os.path.join(
+            args.model_dir, args.restore_file + '.pth.tar')
         logging.info("Restoring parameters from {}".format(restore_path))
         utils.load_checkpoint(restore_path, model, optimizer)
-        
+
     best_val_acc = 0.0
 
     for epoch in range(params.num_epochs):
@@ -110,59 +115,67 @@ def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics,
 
         # compute number of batches in one epoch (one full pass over the training set)
         num_steps = (params.train_size + 1) // params.batch_size
-        train_data_iterator = data_loader.data_iterator(train_data, params, shuffle=True)
-        train(model, optimizer, loss_fn, train_data_iterator, metrics, params, num_steps)
-
+        train_data_iterator = data_loader.data_iterator(
+            train_data, params, shuffle=True)
+        train(model, optimizer, loss_fn, train_data_iterator,
+              metrics, params, num_steps)
+
         # Evaluate for one epoch on validation set
         num_steps = (params.val_size + 1) // params.batch_size
-        val_data_iterator = data_loader.data_iterator(val_data, params, shuffle=False)
-        val_metrics = evaluate(model, loss_fn, val_data_iterator, metrics, params, num_steps)
-
+        val_data_iterator = data_loader.data_iterator(
+            val_data, params, shuffle=False)
+        val_metrics = evaluate(
+            model, loss_fn, val_data_iterator, metrics, params, num_steps)
+
         val_acc = val_metrics['accuracy']
         is_best = val_acc >= best_val_acc
 
         # Save weights
         utils.save_checkpoint({'epoch': epoch + 1,
                                'state_dict': model.state_dict(),
-                               'optim_dict' : optimizer.state_dict()}, 
-                               is_best=is_best,
-                               checkpoint=model_dir)
-            
-        # If best_eval, best_save_path        
+                               'optim_dict': optimizer.state_dict()},
+                              is_best=is_best,
+                              checkpoint=model_dir)
+
+        # If best_eval, best_save_path
         if is_best:
             logging.info("- Found new best accuracy")
             best_val_acc = val_acc
-            
+
             # Save best val metrics in a json file in the model directory
-            best_json_path = os.path.join(model_dir, "metrics_val_best_weights.json")
+            best_json_path = os.path.join(
+                model_dir, "metrics_val_best_weights.json")
             utils.save_dict_to_json(val_metrics, best_json_path)
 
         # Save latest val metrics in a json file in the model directory
-        last_json_path = os.path.join(model_dir, "metrics_val_last_weights.json")
+        last_json_path = os.path.join(
+            model_dir, "metrics_val_last_weights.json")
         utils.save_dict_to_json(val_metrics, last_json_path)
-    
+
 
 if __name__ == '__main__':
 
     # Load the parameters from json file
     args = parser.parse_args()
     json_path = os.path.join(args.model_dir, 'params.json')
-    assert os.path.isfile(json_path), "No json configuration file found at {}".format(json_path)
+    assert os.path.isfile(
+        json_path), "No json configuration file found at {}".format(json_path)
     params = utils.Params(json_path)
 
     # use GPU if available
     params.cuda = torch.cuda.is_available()
-    
+
     # Set the random seed for reproducible experiments
     torch.manual_seed(230)
-    if params.cuda: torch.cuda.manual_seed(230)
-
+    if params.cuda:
+        torch.cuda.manual_seed(230)
+
     # Set the logger
     utils.set_logger(os.path.join(args.model_dir, 'train.log'))
 
     # Create the input data pipeline
     logging.info("Loading the datasets...")
-    
+
     # load data
     data_loader = DataLoader(args.data_dir, params)
     data = data_loader.load_data(['train', 'val'], args.data_dir)
@@ -178,7 +191,7 @@ def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics,
     # Define the model and optimizer
     model = net.Net(params).cuda() if params.cuda else net.Net(params)
     optimizer = optim.Adam(model.parameters(), lr=params.learning_rate)
-    
+
     # fetch loss function and metrics
     loss_fn = net.loss_fn
     metrics = net.metrics