Skip to content

Latest commit

 

History

History
352 lines (261 loc) · 14.4 KB

README.md

File metadata and controls

352 lines (261 loc) · 14.4 KB

BERT_Pytorch_fastNLP

A PyTorch & fastNLP implementation of Google AI's BERT model.

  • Stable Version: The folder of bert_pytorch is the stable version of BERT, where we organized the codes based on Pytorch-pretrained-BERT as the same code framework as fastNLP.
  • Developing Version: The folder of bert_fastNLP is our developing version of BERT, where we implemented our BERT model on fastNLP, the code is concise and we can use converting script to access pre-trained parameters for these implementations. In this version, we realized three specific BERT models for different tasks.

Environment:

python >= 3.5

pytorch == 1.0

Dataset:

GLUE Datasets

The General Language Understanding Evaluation (GLUE) benchmark is a collection of diversenatural language understanding tasks. Most of the GLUE datasets have already existed for a numberof years, but the purpose of GLUE is to:

  • Distribute these datasets with canonical Train, Dev and Test splits.
  • Set up an evaluation server to mitigate issues with evaluation inconsistencies and Test set overfitting.

MRPC: Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extractedfrom online news sources with human annotations for whether the sentences in the pair are semanti-cally equivalent.

CoLA: The Corpus of Linguistic Acceptability is a binary single-sentence classification task, wherethe goal is to predict whether an English sentence is linguistically “acceptable” or not.

SWAG Datasets

The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded common-sense inference.

SQuAD v1.1 Datasets

The Stanford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced question/answer pairs. Given a question and a paragraph from Wikipedia containing the answer, thetask is to predict the answer text span in the paragraph.

BERT-PyTorch

This version is based on Pytorch-pretrained-BERT , but we organized the code as the same framework as fastNLP.

Quick Use:

  1. Download GLUE Dataset to tasks/SequenceClassification/

  2. Download Pre-trained Parameters of BERT

  3. Use this command

    export GLUE_DIR=tasks/SequenceClassification/glue_data
    python run_classifier.py \
      --task_name MRPC \
      --do_train 1 \
      --do_eval 1 \
      --do_lower_case\
      --data_dir $GLUE_DIR/MRPC/ \
      --bert_model pretrained/bert-base-uncased \
      --max_seq_length 128 \
      --train_batch_size 32 \
      --learning_rate 2e-5 \
      --num_train_epochs 3.0 \
      --output_dir tasks/SequenceClassification/mrpc_output

How to Get Pre-trained Parameters:

Parameters from Pytorch-pretrained-BERT:

MODEL LINK
bert-base-uncased https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
bert-large-uncased https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz
bert-base-cased https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz
bert-large-cased https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz
bert-base-multilingual-uncased https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz
bert-base-chinese https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz

BERT-fastNLP

Quick Use:

  1. Download GLUE Dataset to tasks/SequenceClassification/

  2. Download Pre-trained Parameters of BERT

  3. Convert this Parameters into our format in converted/

  4. Use this command

    export GLUE_DIR=../bert_pytorch/tasks/SequenceClassification/glue_data
    python run_classifier_fastNLP.py \
      --task_name MRPC \
      --do_train 1 \
      --do_eval 1 \
      --do_lower_case\
      --data_dir $GLUE_DIR/MRPC/ \
      --bert_model pretrained/bert-base-uncased \
      --max_seq_length 128 \
      --train_batch_size 32 \
      --learning_rate 2e-5 \
      --num_train_epochs 3.0 \
      --output_dir tasks/SequenceClassification/mrpc_output

How to Convert Pre-trained Parameters:

  1. Use our converted/convert.py to converted parameters in bert_pytorch for our model implementation. For example, if we want to converted BERT-LARGE pytorch_model.bin, open this script and set:

    ORGINAL_PATH = "../../bert_pytorch/pretrained/bert-large-uncased/pytorch_model.bin"
    OUTPUT_PATH = "large-uncased/"
    LAYERS = 24
  2. For BERT-BASE, add bert_config.json as:

    { "hidden": 768,
      "n_layers": 12,
      "attn_heads": 12,
      "dropout": 0.1  }

    For BERT-LARGE, add bert_config.json as:

    { "hidden": 1024,
      "n_layers": 24,
      "attn_heads": 16,
      "dropout": 0.1  }
  3. Copy the vocab.txt from the original folder to this folder.

How to Use fastNLP in BERT Training:

Taking this run_classifier_fastNLP.py as example, where we will fine-tune BERT for classification based on MRPC dataset.

  1. Load dataset based on fastNLP:

    from preprocessing.sequence_classification import load_dataset
    
    ###### fastNLP.DataSet loading ######
    train_data, dev_data = load_dataset(args)

    where load_dataset will return training data and delopment data with the fastNLP.DataSet data type, you can find the details in precrocess/sequence_classification:

    # training dataset
    train_features = convert_examples_to_features(
        train_examples, label_list, args.max_seq_length, tokenizer)
    train_data = DataSet(
        {
            "x": [f.input_ids for f in train_features],
            "segment_info": [f.segment_ids for f in train_features],
            "mask": [f.input_mask for f in train_features],
            "target": [f.label_id for f in train_features]
        }
    )
    train_data.set_input('x', 'segment_info', 'mask')
    train_data.set_target('target')
  2. Build BERT-encoder model for differenet tasks, where we defined these specific model in bert.py, now we have implememented four models:

    class BertMLM(backbone.Bert):
        """
        BERT Mask Language Model: Bert based model for novel task of mask language model.
        """
        
    class BertMC(backbone.Bert):
        """
        BERT Multiple Choice Model: Bert based classification model for multiple choice
        """
        
    class BertQA(backbone.Bert):
        """
        BERT Question Answering Model: Bert based model for question answering
        """
        
    class BertSC(backbone.Bert):
        """
        BERT Sequence Classification Model: Bert based classification model for sequence
        """

    In main() function, we can build our model as:

    from bert import BertMLM, BertSC, BertQA, BertMC
    
    model = BertSC(args.vocab_size, num_labels=args.num_labels)

    and load converted pre-trained parameters

    MODEL_NAME = "pytorch_model.bin"
    args.bert_dir = "converted/base-uncased"
    
    model.load(os.path.join(args.bert_dir, MODEL_NAME))
  3. Build your Optimizer, where we reuse the BertAdam (with warmup):

    from optimization import BertAdam
    
    ###### ptimizer initializing ######
    optimizer = BertAdam(
        optimizer_grouped_parameters,
        lr=args.learning_rate,
        warmup=args.warmup_proportion,
        t_total=t_total
    )
  4. Use fastNLP.Trainer to fine-tune the specific bert model:

    from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric
    
    ###### fastNLP.Trainer initializing ######
    trainer = Trainer(model=model,
                      train_data=train_data,
                      dev_data=dev_data,
                      loss=CrossEntropyLoss(pred="pred", target="target"),
                      metrics=AccuracyMetric(),
                      print_every=1,
                      optimizer=optimizer,
                      batch_size=args.train_batch_size,
                      n_epochs=args.num_train_epochs)
    
    # train our model
    trainer.train()

NOTICE: Due to the API of fastNLP, some training tricks is difficult to be implemented directly based on fastNLP.Trainer. In this project, to reproduce the training and evaluation protocol, I remained the original trainining code for SWAG and SQUAD v1.1 task in run_swag_fastNLP.py and run_squad_fastNLP.py for these reasons:

  • For SWAG, fastNLP.Batch will raise errors when build batches, which is corresponding to the shape of training data.
  • In SWAG training, we set args.gradient_accumulation_steps = 4 , it seems not easy to realize gradient accumulation in fastNLP.Trainer.
  • For SQUAD, we set loss function as (CE(start, start_)+CE(end, end_))/2. In fastNLP, it might be hard to use multiple loss functions in one training epoch.

Though we remain the orignal training code for these two tasks, we replace the orginal bert model with our version. Please check the details in run_swag_fastNLP.py and run_squad_fastNLP.py .

The Implementation of BERT

  1. We re-implemented the multi-head attention and transformer class based on the project of Pytorch-pretrained-BERT, BERT-pytorch and Google-bert.

    1. In fastNLP.module.aggregator.attention, our multi-head attention version is as below. It's worth noting that we found that all of these implementations above are concatentating attentions in mutli-head attention weight-wisely which is more user-friendly and efficient. Therefore we don't apply the existing basic Attention class in fastNLP to implement MultiHeadAtte.

      class MultiHeadAtte(nn.Module):
          def __init__(self, input_size, output_size, hidden_size, num_atte, dropout):
              super(MultiHeadAtte, self).__init__()
              self.num_attention_heads = num_atte
              self.attention_head_size = int(hidden_size / self.num_attention_heads)
              self.all_head_size = self.num_attention_heads * self.attention_head_size
      
              self.query = nn.Linear(hidden_size, self.all_head_size)
              self.key = nn.Linear(hidden_size, self.all_head_size)
              self.value = nn.Linear(hidden_size, self.all_head_size)
      
              self.dropout = nn.Dropout(dropout)
      
              self.dense = nn.Linear(hidden_size, hidden_size)
              self.LayerNorm = LayerNormalization(hidden_size, eps=1e-12)
    2. In fastNLP.module.encoder.transformer, we implemented our TransformerEncoder based on SubLayer with MultiHeadAtte. We can pay attention to that:

      class TransformerEncoder(nn.Module):
         def __init__(self, num_layers, **kargs):
             super(TransformerEncoder, self).__init__()
             self.layers = nn.ModuleList([self.SubLayer(**kargs) for _ in range(num_layers)])

    For self.layers, we used nn.ModuleList instead of nn.Sequential considering that in some task, outputs of all layers are valuable. Because of this, we set flag all_output=True.

  2. We implemented Bert class in backbone.py, where we regared Bert like backbone models (eg. ResNet50) in Computer Vision. We implemented the backbone model here:

    class Bert(nn.Module):
        """
        BERT model : Bidirectional Encoder Representations from Transformers.
        """
        def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1):
            super().__init__()
            self.hidden = hidden
            self.n_layers = n_layers
            self.attn_heads = attn_heads
            # paper noted they used 4*hidden_size for ff_network_hidden_size
            self.feed_forward_hidden = hidden * 4
            # embedding for BERT, sum of positional, segment, token embeddings
            self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=hidden, dropout=dropout)
            # multi-layers transformer blocks, deep network
            self.transformer = Transformer(
                num_layers=n_layers, 
                num_atte=attn_heads,
                input_size=hidden,
                intermediate_size=self.feed_forward_hidden,
                key_size=hidden,
                output_size=hidden,
                activate=GeLU,
                dropout=dropout,
            )
            # Pooling layer
            self.pooler = nn.Linear(hidden, hidden)
            self.activation = nn.Tanh()
  3. Task-specific Bert Models are all defined in bert.py, we inherit the backbone.Bert and add the decoder part very simply. Besides of this, it's easy to load model pre-trained parameters use model.load() which is implemented in backbone.Bert. Taking BertQA as example, where the BertEncoder part is undertaken by self.bert_forward.

    class BertQA(backbone.Bert):
        """
        BERT Question Answering Model: Bert based classification model for question answering
        """
        def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1):
            """
            :param vocab_size: vocab_size of total words
            :param hidden: BERT model hidden size
            :param n_layers: numbers of Transformer blocks(layers)
            :param attn_heads: number of attention heads
            :param dropout: dropout rate
            """
            super(BertQA, self).__init__(vocab_size, hidden, n_layers, attn_heads, dropout)
            self.qa_classifier = nn.Linear(hidden, 2)
    
        def forward(self, x, segment_info=None, mask=None):
            output_layer, _ = self.bert_forward(x, segment_info, mask=mask, all_output=False)
            start, end = self.qa_classifier(output_layer).split(1, dim=-1)
            return {'pred_start': start.squeeze(-1), 'pred_end': end.squeeze(-1)}

Contributor:

Shihan Ran (RshCaroline)

Zhankui He (AaronHeee)

RelatedRepo: