Original Model path:https://github.com/xhw205/Efficient-GlobalPointer-torch
Based on GlobalPointer,Keras version ,The solution key is to represent token-pair 。
The base model is from a torch version repository。
- 2022/04/23 buiild
- 2022/6/23 Add boundary smoothing function
python==3.6
torch==1.8.1
transformers==4.4.1
I use a service for training ,the seting is:
python==3.7
torch==1.8.1
transformers==4.10.0
For more details, please refer to server_requirements.txt
.
EfficientGlobalPointer4KeyExtraction
├── ensemble.sh # model ensemble script
├── GP_runner.sh # finetune script
├── README.md
├── server_requirements.txt # server dependencies
├── result.txt
├── err.log
├── checkpoints # model save dic
│ ├── bad_cases_GP.txt # badcase output
│ └── experiments_notes.md # experiment record
├── datasets
│ └── split_data
│ ├── biaffine_labels.txt # Entity tag file (entity type, without BIO)
│ ├── dev.json
│ ├── test.json
│ ├── train.json
│ ├── features # human feature dic
│ │ ├── dic # keyward dict
│ │ │ ├── all_dic.json # all the keyward
│ │ │ ├── get_train_dic.py
│ │ │ ├── thu_caijing_dic.json # keyward From Tsinghua University
│ │ │ └── train_dic.json
│ │ └── word_feature
│ │ ├── dev_flag_features.json
│ │ ├── dev_word_emb_features.json
│ │ ├── dev_word_features.json
│ │ ├── flag2id.json
│ │ └── ...
│ ├── get_mlm_data.py
│ ├── mlm # pretrain the PLM
│ │ ├── mlm_dev.txt
│ │ └── mlm_train.txt
│ └──enhanced_train.json # the augmented samples file
├── enhancement
│ └── replace.py # replace keywords
├── mlm
│ ├── pretrain.sh
│ └── run_mlm.py
├── models
│ ├── GlobalPointer.py
│ ├── __init__.py
│ └── __pycache__
│ ├── GlobalPointer.cpython-36.pyc
│ └── __init__.cpython-36.pyc
├── src
│ ├── __pycache__
│ │ └── predict_CME.cpython-36.pyc
│ ├── ensemble.py
│ ├── predict_CME.py
│ └── train_CME.py
├── utils
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ ├── bert_optimization.cpython-36.pyc
│ │ ├── data_loader.cpython-36.pyc
│ │ ├── finetuning_argparse.cpython-36.pyc
│ │ ├── logger.cpython-36.pyc
│ │ └── ths_data_utills.cpython-36.pyc
│ ├── bert_optimization.py
│ ├── data_loader.py
│ ├── finetuning_argparse.py # args settig files
│ ├── logger.py
│ └── ths_data_utills.py
└── word2vec
├── data
│ └── word2vec_dic.json # w2v location
├── gensim_ttl.py
└── save_w2v_features.py
- Hierarchical learning rate
- output threshold
- add a LSTM Layer
- Add three artificial features
- Continuing Pre training
- R-drop
- fgm : confrontation training
- data augmentation:keyword replace
- model ensemble
- SWA
- Boundary Smoothing
[
[
"和成都海光微相关的上市公司有哪些",
[
"成都海光微",
"上市公司"
]
],
[
"股价异常波动停牌的股票",
[
"股价异常波动",
"股票"
]
],
...
]
- Download the pytorch pre-training model, the address is passed to the
--bert_model_path
parameter. - To start training, refer to
GP_runner.sh
for parameters or scripts, andfinetuning_argparse.py
for fine-tuning some parameters.
in_dic
: Co-occurrence features, subscript set to 1 if labeled keyword
is in the input text.
w2v_emb
: Splice word_emb
to token_emb
.
flag_id
: Add part-of-speech one-hot features.
Running conditions: datasets/split_data/features/dic/get_train_dic.py
uses the keyword
list all_dic.json
, and it can be run.
Operating conditions:
- Train yourself or find w2v features from the Internet, process them into a dictionary form
{ token : emb_list }
, and save them toword2vec/data/word2vec_dic.json
. - Use
word2vec/save_w2v_features.py
to save the w2v features corresponding to each word todatasets/split_data/features/word_feature/..._word_emb_features.json
. - Add the args parameter and run.
refer to in_dic
Refer to w2v_emb
- Download or build a keyword list, here use THU-caijing corpus, put it in
datasets/split_data/features/dic/thu_caijing_dic.json
. - Run
enhancement/replace.py
to get the enhanced sampledatasets/split_data/enhanced_train.json
.
dataloader.py
中加入self.get_boundary_smoothing()
Get a new soft label, compared with the source code, make the following changes:- Modify dimension order for this project
[cls]
and[sep]
cannot be inside soft label- Allow
start index == end index
, that is, allow a single token as an entity
train_CMR.py
-multilabel_categorical_crossentropy()
Adjust the logic of calculating loss.
The checkpoints
in the script, which is separated by spaces, specifies the checkpoint list.