Coding

Original Model path：https://github.com/xhw205/Efficient-GlobalPointer-torch

Based on GlobalPointer，Keras version ，The solution key is to represent token-pair 。
The base model is from a torch version repository。

Updating records

2022/04/23 buiild
2022/6/23 Add boundary smoothing function

package dependencies

python==3.6
torch==1.8.1
transformers==4.4.1

I use a service for training ，the seting is：

python==3.7
torch==1.8.1
transformers==4.10.0

For more details, please refer to server_requirements.txt.

File Structure

EfficientGlobalPointer4KeyExtraction
├── ensemble.sh  # model ensemble script
├── GP_runner.sh  # finetune  script
├── README.md
├── server_requirements.txt  # server dependencies
├── result.txt  
├── err.log
├── checkpoints  # model save dic
│   ├── bad_cases_GP.txt # badcase output
│   └── experiments_notes.md  # experiment record
├── datasets
│   └── split_data
│       ├── biaffine_labels.txt  # Entity tag file (entity type, without BIO)
│       ├── dev.json 
│       ├── test.json
│       ├── train.json
│       ├── features  # human feature dic
│       │   ├── dic  # keyward dict
│       │   │   ├── all_dic.json  # all the keyward
│       │   │   ├── get_train_dic.py
│       │   │   ├── thu_caijing_dic.json  # keyward From Tsinghua University
│       │   │   └── train_dic.json  
│       │   └── word_feature
│       │       ├── dev_flag_features.json
│       │       ├── dev_word_emb_features.json
│       │       ├── dev_word_features.json
│       │       ├── flag2id.json
│       │       └── ...
│       ├── get_mlm_data.py
│       ├── mlm  # pretrain the PLM
│       │   ├── mlm_dev.txt
│       │   └── mlm_train.txt
│       └──enhanced_train.json  # the augmented samples file
├── enhancement
│   └── replace.py  # replace keywords
├── mlm
│   ├── pretrain.sh
│   └── run_mlm.py
├── models
│   ├── GlobalPointer.py
│   ├── __init__.py
│   └── __pycache__
│       ├── GlobalPointer.cpython-36.pyc
│       └── __init__.cpython-36.pyc
├── src
│   ├── __pycache__
│   │   └── predict_CME.cpython-36.pyc
│   ├── ensemble.py
│   ├── predict_CME.py
│   └── train_CME.py
├── utils
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-36.pyc
│   │   ├── bert_optimization.cpython-36.pyc
│   │   ├── data_loader.cpython-36.pyc
│   │   ├── finetuning_argparse.cpython-36.pyc
│   │   ├── logger.cpython-36.pyc
│   │   └── ths_data_utills.cpython-36.pyc
│   ├── bert_optimization.py
│   ├── data_loader.py
│   ├── finetuning_argparse.py  # args settig files
│   ├── logger.py
│   └── ths_data_utills.py
└── word2vec
    ├── data
    │   └── word2vec_dic.json  # w2v location
    ├── gensim_ttl.py
    └── save_w2v_features.py

My Modifications for Training

Hierarchical learning rate
output threshold
add a LSTM Layer
Add three artificial features
Continuing Pre training
R-drop
fgm : confrontation training
data augmentation：keyword replace
model ensemble
SWA
Boundary Smoothing

Run

data format

[
  [
    "和成都海光微相关的上市公司有哪些",
    [
      "成都海光微",
      "上市公司"
    ]
  ],
  [
    "股价异常波动停牌的股票",
    [
      "股价异常波动",
      "股票"
    ]
  ],
   ...
]

Training

Download the pytorch pre-training model, the address is passed to the --bert_model_path parameter.
To start training, refer to GP_runner.sh for parameters or scripts, and finetuning_argparse.py for fine-tuning some parameters.

artificial features

in_dic: Co-occurrence features, subscript set to 1 if labeled keyword is in the input text.
w2v_emb: Splice word_emb to token_emb.
flag_id: Add part-of-speech one-hot features.

in_dic

Running conditions: datasets/split_data/features/dic/get_train_dic.py uses the keyword list all_dic.json, and it can be run.

w2v_emb

Operating conditions:

Train yourself or find w2v features from the Internet, process them into a dictionary form { token : emb_list }, and save them to word2vec/data/word2vec_dic.json.
Use word2vec/save_w2v_features.py to save the w2v features corresponding to each word to datasets/split_data/features/word_feature/..._word_emb_features.json.
Add the args parameter and run.

refer to in_dic

flag_id

Refer to w2v_emb

Data Augmentation：keyword Replace

Download or build a keyword list, here use THU-caijing corpus, put it in datasets/split_data/features/dic/thu_caijing_dic.json.
Run enhancement/replace.py to get the enhanced sample datasets/split_data/enhanced_train.json.

Boundary Smoothing

dataloader.py 中加入 self.get_boundary_smoothing() Get a new soft label, compared with the source code, make the following changes:
1. Modify dimension order for this project
2. [cls] and [sep] cannot be inside soft label
3. Allow start index == end index, that is, allow a single token as an entity
train_CMR.py - multilabel_categorical_crossentropy() Adjust the logic of calculating loss.

Model ensembling

The checkpoints in the script, which is separated by spaces, specifies the checkpoint list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_English.md

README_English.md

Coding

Updating records

package dependencies

File Structure

My Modifications for Training

Run

data format

Training

artificial features

in_dic

w2v_emb

flag_id

Data Augmentation：keyword Replace

Boundary Smoothing

Model ensembling

Files

README_English.md

Latest commit

History

README_English.md

File metadata and controls

Coding

Updating records

package dependencies

File Structure

My Modifications for Training

Run

data format

Training

artificial features

in_dic

w2v_emb

flag_id

Data Augmentation：keyword Replace

Boundary Smoothing

Model ensembling