Skip to content

Commit

Permalink
Preprocess pre-training data
Browse files Browse the repository at this point in the history
  • Loading branch information
zzy14 committed Oct 25, 2019
1 parent 8e2d9b7 commit 8759047
Show file tree
Hide file tree
Showing 7 changed files with 3,907 additions and 6 deletions.
21 changes: 15 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,26 @@ Source code and dataset for "ERNIE: Enhanced Language Representation with Inform
Run the following command to create training instances.

```shell
cd pretrain_data
# Download Wikidump
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
# Download alise
wget -c https://cloud.tsinghua.edu.cn/f/a519318708df4dc8a853/?dl=1 -O alias_entity.txt
# WikiExtractor
python3 WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
python3 pretrain_data/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
# Modify anchor with 4 processes
python3 extract.py 4
python3 pretrain_data/extract.py 4
# Preprocess with 4 processes
python3 create_ids.py 4
# create instances for part 0
python3 ../code/create_instances.py --input_file_prefix raw/0 --output_file pretrain_data/0 --vocab_file ernie_base/vocab.txt --dupe_factor 1 --max_seq_length 256 --max_predictions_per_seq 40
python3 pretrain_data/create_ids.py 4
# create instances
python3 pretrain_data/create_insts.py 4
# merge
python3 code/merge.py
```

Run the following command to pretrain:

```
python3 code/run_pretrain.py --do_train --data_dir pretrain_data/merge --bert_model ernie_base --output_dir pretrain_out/ --task_name pretrain --fp16 --max_seq_length 256
```

#### Pre-trained Model
Expand Down
8 changes: 8 additions & 0 deletions code/merge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import indexed_dataset
import os

builder = indexed_dataset.IndexedDatasetBuilder('pretrain_data/merge.bin')
for filename in os.listdir("pretrain_data/data"):
if filename[-4:] == '.bin':
builder.merge_file_("pretrain_data/data/"+filename[:-4])
builder.finalize("pretrain_data/merge.idx")
Loading

0 comments on commit 8759047

Please sign in to comment.