Skip to content

Commit

Permalink
Merge pull request #20 from josephch405/master
Browse files Browse the repository at this point in the history
Version updates and reformatting
  • Loading branch information
CS230 Deep Learning authored Oct 8, 2019
2 parents 159df10 + 192216e commit 96ac6fd
Show file tree
Hide file tree
Showing 13 changed files with 222 additions and 137 deletions.
45 changes: 29 additions & 16 deletions pytorch/nlp/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Named Entity Recognition with PyTorch

*Authors: Surag Nair, Guillaume Genthial and Olivier Moindrot*
_Authors: Surag Nair, Guillaume Genthial and Olivier Moindrot_

Take the time to read the [tutorials](https://cs230-stanford.github.io/project-starter-code.html).

Expand Down Expand Up @@ -31,75 +31,88 @@ B-PER O O B-LOC I-LOC

We provide a small subset of the kaggle dataset (30 sentences) for testing in `data/small` but you are encouraged to download the original version on the [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) website.

1. __Download the dataset__ `ner_dataset.csv` on [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) and save it under the `nlp/data/kaggle` directory. Make sure you download the simple version `ner_dataset.csv` and NOT the full version `ner.csv`.
1. **Download the dataset** `ner_dataset.csv` on [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) and save it under the `nlp/data/kaggle` directory. Make sure you download the simple version `ner_dataset.csv` and NOT the full version `ner.csv`.

2. **Build the dataset** Run the following script

2. __Build the dataset__ Run the following script
```
python build_kaggle_dataset.py
```

It will extract the sentences and labels from the dataset, split it into train/val/test and save it in a convenient format for our model.

*Debug* If you get some errors, check that you downloaded the right file and saved it in the right directory. If you have issues with encoding, try running the script with python 2.7.
_Debug_ If you get some errors, check that you downloaded the right file and saved it in the right directory. If you have issues with encoding, try running the script with python 2.7.

3. In the next section, change `data/small` by `data/kaggle`


## Quickstart (~10 min)

1. __Build__ vocabularies and parameters for your dataset by running
1. **Build** vocabularies and parameters for your dataset by running

```
python build_vocab.py --data_dir data/small
```

It will write vocabulary files `words.txt` and `tags.txt` containing the words and tags in the dataset. It will also save a `dataset_params.json` with some extra information.

2. __Your first experiment__ We created a `base_model` directory for you under the `experiments` directory. It contains a file `params.json` which sets the hyperparameters for the experiment. It looks like
2. **Your first experiment** We created a `base_model` directory for you under the `experiments` directory. It contains a file `params.json` which sets the hyperparameters for the experiment. It looks like

```json
{
"learning_rate": 1e-3,
"batch_size": 5,
"num_epochs": 2
"learning_rate": 1e-3,
"batch_size": 5,
"num_epochs": 2
}
```

For every new experiment, you will need to create a new directory under `experiments` with a `params.json` file.

3. __Train__ your experiment. Simply run
3. **Train** your experiment. Simply run

```
python train.py --data_dir data/small --model_dir experiments/base_model
```

It will instantiate a model and train it on the training set following the hyperparameters specified in `params.json`. It will also evaluate some metrics on the development set.

4. __Your first hyperparameters search__ We created a new directory `learning_rate` in `experiments` for you. Now, run
4. **Your first hyperparameters search** We created a new directory `learning_rate` in `experiments` for you. Now, run

```
python search_hyperparams.py --data_dir data/small --parent_dir experiments/learning_rate
```

It will train and evaluate a model with different values of learning rate defined in `search_hyperparams.py` and create a new directory for each experiment under `experiments/learning_rate/`.

5. __Display the results__ of the hyperparameters search in a nice format
5. **Display the results** of the hyperparameters search in a nice format

```
python synthesize_results.py --parent_dir experiments/learning_rate
```

6. __Evaluation on the test set__ Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set. Run
6. **Evaluation on the test set** Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set. Run

```
python evaluate.py --data_dir data/small --model_dir experiments/base_model
```


## Guidelines for more advanced use

We recommend reading through `train.py` to get a high-level overview of the training loop steps:

- loading the hyperparameters for the experiment (the `params.json`)
- loading the training and validation data
- creating the model, loss_fn and metrics
- training the model for a given number of epochs by calling `train_and_evaluate(...)`

You can then go through `model/data_loader.py` to understand the following steps:

- loading the vocabularies from the `words.txt` and `tags.txt` files
- creating the sentences/labels datasets from the text files
- how the vocabulary is used to map tokens to their indices
- how the `data_iterator` creates a batch of data and labels and pads sentences

Once you get the high-level idea, depending on your dataset, you might want to modify

- `model/model.py` to change the neural network, loss function and metrics
- `model/data_loader.py` to suit the data loader to your specific needs
- `train.py` for changing the optimizer
Expand All @@ -109,6 +122,6 @@ Once you get something working for your dataset, feel free to edit any part of t

## Resources

- [PyTorch documentation](http://pytorch.org/docs/0.3.0/)
- [PyTorch documentation](http://pytorch.org/docs/1.2.0/)
- [Tutorials](http://pytorch.org/tutorials/)
- [PyTorch warm-up](https://github.com/jcjohnson/pytorch-examples)
2 changes: 1 addition & 1 deletion pytorch/nlp/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def evaluate(model, loss_fn, data_iterator, metrics, params, num_steps):
# compute all metrics on this batch
summary_batch = {metric: metrics[metric](output_batch, labels_batch)
for metric in metrics}
summary_batch['loss'] = loss.data[0]
summary_batch['loss'] = loss.item()
summ.append(summary_batch)

# compute mean of all metrics in summary
Expand Down
22 changes: 13 additions & 9 deletions pytorch/nlp/model/net.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,12 @@ def __init__(self, params):

# the LSTM takes as input the size of its input (embedding_dim), its hidden size
# for more details on how to use it, check out the documentation
self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True)
self.lstm = nn.LSTM(params.embedding_dim,
params.lstm_hidden_dim, batch_first=True)

# the fully connected layer transforms the output to give the final output layer
self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags)

def forward(self, s):
"""
This function defines how we use the components of our network to operate on an input batch.
Expand All @@ -61,16 +62,19 @@ def forward(self, s):
"""
# -> batch_size x seq_len
# apply the embedding layer that maps each token to its embedding
s = self.embedding(s) # dim: batch_size x seq_len x embedding_dim
# dim: batch_size x seq_len x embedding_dim
s = self.embedding(s)

# run the LSTM along the sentences of length seq_len
s, _ = self.lstm(s) # dim: batch_size x seq_len x lstm_hidden_dim
# dim: batch_size x seq_len x lstm_hidden_dim
s, _ = self.lstm(s)

# make the Variable contiguous in memory (a PyTorch artefact)
s = s.contiguous()

# reshape the Variable so that each row contains one token
s = s.view(-1, s.shape[2]) # dim: batch_size*seq_len x lstm_hidden_dim
# dim: batch_size*seq_len x lstm_hidden_dim
s = s.view(-1, s.shape[2])

# apply the fully connected layer and obtain the output (before softmax) for each token
s = self.fc(s) # dim: batch_size*seq_len x num_tags
Expand Down Expand Up @@ -107,12 +111,12 @@ def loss_fn(outputs, labels):
# number. This does not affect training, since we ignore the PADded tokens with the mask.
labels = labels % outputs.shape[1]

num_tokens = int(torch.sum(mask).data[0])
num_tokens = int(torch.sum(mask))

# compute cross entropy loss for all tokens (except PADding tokens), by multiplying with mask.
return -torch.sum(outputs[range(outputs.shape[0]), labels]*mask)/num_tokens


def accuracy(outputs, labels):
"""
Compute the accuracy, given the outputs and labels for all tokens. Exclude PADding terms.
Expand All @@ -135,7 +139,7 @@ def accuracy(outputs, labels):
outputs = np.argmax(outputs, axis=1)

# compare outputs with labels and divide by number of tokens (excluding PADding tokens)
return np.sum(outputs==labels)/float(np.sum(mask))
return np.sum(outputs == labels)/float(np.sum(mask))


# maintain all metrics required in this dictionary- these are used in the training and evaluation loops
Expand Down
2 changes: 1 addition & 1 deletion pytorch/nlp/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
numpy
Pillow
torch>=0.3
torch>=1.2
tabulate
tqdm
79 changes: 46 additions & 33 deletions pytorch/nlp/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@


parser = argparse.ArgumentParser()
parser.add_argument('--data_dir', default='data/small', help="Directory containing the dataset")
parser.add_argument('--model_dir', default='experiments/base_model', help="Directory containing params.json")
parser.add_argument('--data_dir', default='data/small',
help="Directory containing the dataset")
parser.add_argument('--model_dir', default='experiments/base_model',
help="Directory containing params.json")
parser.add_argument('--restore_file', default=None,
help="Optional, name of the file in --model_dir containing weights to reload before \
training") # 'best' or 'train'
Expand All @@ -42,9 +44,9 @@ def train(model, optimizer, loss_fn, data_iterator, metrics, params, num_steps):
# summary for current training loop and a running average object for loss
summ = []
loss_avg = utils.RunningAverage()

# Use tqdm for progress bar
t = trange(num_steps)
t = trange(num_steps)
for i in t:
# fetch the next training batch
train_batch, labels_batch = next(data_iterator)
Expand All @@ -67,20 +69,22 @@ def train(model, optimizer, loss_fn, data_iterator, metrics, params, num_steps):
labels_batch = labels_batch.data.cpu().numpy()

# compute all metrics on this batch
summary_batch = {metric:metrics[metric](output_batch, labels_batch)
summary_batch = {metric: metrics[metric](output_batch, labels_batch)
for metric in metrics}
summary_batch['loss'] = loss.data[0]
summary_batch['loss'] = loss.item()
summ.append(summary_batch)

# update the average loss
loss_avg.update(loss.data[0])
loss_avg.update(loss.item())
t.set_postfix(loss='{:05.3f}'.format(loss_avg()))

# compute mean of all metrics in summary
metrics_mean = {metric:np.mean([x[metric] for x in summ]) for metric in summ[0]}
metrics_string = " ; ".join("{}: {:05.3f}".format(k, v) for k, v in metrics_mean.items())
metrics_mean = {metric: np.mean([x[metric]
for x in summ]) for metric in summ[0]}
metrics_string = " ; ".join("{}: {:05.3f}".format(k, v)
for k, v in metrics_mean.items())
logging.info("- Train metrics: " + metrics_string)


def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics, params, model_dir, restore_file=None):
"""Train the model and evaluate every epoch.
Expand All @@ -98,10 +102,11 @@ def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics,
"""
# reload weights from restore_file if specified
if restore_file is not None:
restore_path = os.path.join(args.model_dir, args.restore_file + '.pth.tar')
restore_path = os.path.join(
args.model_dir, args.restore_file + '.pth.tar')
logging.info("Restoring parameters from {}".format(restore_path))
utils.load_checkpoint(restore_path, model, optimizer)

best_val_acc = 0.0

for epoch in range(params.num_epochs):
Expand All @@ -110,59 +115,67 @@ def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics,

# compute number of batches in one epoch (one full pass over the training set)
num_steps = (params.train_size + 1) // params.batch_size
train_data_iterator = data_loader.data_iterator(train_data, params, shuffle=True)
train(model, optimizer, loss_fn, train_data_iterator, metrics, params, num_steps)

train_data_iterator = data_loader.data_iterator(
train_data, params, shuffle=True)
train(model, optimizer, loss_fn, train_data_iterator,
metrics, params, num_steps)

# Evaluate for one epoch on validation set
num_steps = (params.val_size + 1) // params.batch_size
val_data_iterator = data_loader.data_iterator(val_data, params, shuffle=False)
val_metrics = evaluate(model, loss_fn, val_data_iterator, metrics, params, num_steps)

val_data_iterator = data_loader.data_iterator(
val_data, params, shuffle=False)
val_metrics = evaluate(
model, loss_fn, val_data_iterator, metrics, params, num_steps)

val_acc = val_metrics['accuracy']
is_best = val_acc >= best_val_acc

# Save weights
utils.save_checkpoint({'epoch': epoch + 1,
'state_dict': model.state_dict(),
'optim_dict' : optimizer.state_dict()},
is_best=is_best,
checkpoint=model_dir)
# If best_eval, best_save_path
'optim_dict': optimizer.state_dict()},
is_best=is_best,
checkpoint=model_dir)

# If best_eval, best_save_path
if is_best:
logging.info("- Found new best accuracy")
best_val_acc = val_acc

# Save best val metrics in a json file in the model directory
best_json_path = os.path.join(model_dir, "metrics_val_best_weights.json")
best_json_path = os.path.join(
model_dir, "metrics_val_best_weights.json")
utils.save_dict_to_json(val_metrics, best_json_path)

# Save latest val metrics in a json file in the model directory
last_json_path = os.path.join(model_dir, "metrics_val_last_weights.json")
last_json_path = os.path.join(
model_dir, "metrics_val_last_weights.json")
utils.save_dict_to_json(val_metrics, last_json_path)


if __name__ == '__main__':

# Load the parameters from json file
args = parser.parse_args()
json_path = os.path.join(args.model_dir, 'params.json')
assert os.path.isfile(json_path), "No json configuration file found at {}".format(json_path)
assert os.path.isfile(
json_path), "No json configuration file found at {}".format(json_path)
params = utils.Params(json_path)

# use GPU if available
params.cuda = torch.cuda.is_available()

# Set the random seed for reproducible experiments
torch.manual_seed(230)
if params.cuda: torch.cuda.manual_seed(230)

if params.cuda:
torch.cuda.manual_seed(230)

# Set the logger
utils.set_logger(os.path.join(args.model_dir, 'train.log'))

# Create the input data pipeline
logging.info("Loading the datasets...")

# load data
data_loader = DataLoader(args.data_dir, params)
data = data_loader.load_data(['train', 'val'], args.data_dir)
Expand All @@ -178,7 +191,7 @@ def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics,
# Define the model and optimizer
model = net.Net(params).cuda() if params.cuda else net.Net(params)
optimizer = optim.Adam(model.parameters(), lr=params.learning_rate)

# fetch loss function and metrics
loss_fn = net.loss_fn
metrics = net.metrics
Expand Down
Loading

0 comments on commit 96ac6fd

Please sign in to comment.