A pytorch implementation of the assessment of word-level neural LMs for sentence completion. This repository is built upon Link.
- numpy
- pandas
- tqdm
- pytorch == 1.1.0
- pytorch-transformers == 1.0.0
- sentencepiece (for tokenization of bert models)
- nltk == 3.3 (download
punkt
package for tokenization when experimenting WordRNNs)
- Microsoft Research Sentence Completion Challenge
- Training and Test dataset can be downloaded from Link. Store the downloaded test data in
data/completion/
.
- Training and Test dataset can be downloaded from Link. Store the downloaded test data in
- Scholastic Aptitude Test sentence completion questions
- Collected questions are provided in link. Store the downloaded test data in
data/completion/
.
- Collected questions are provided in link. Store the downloaded test data in
- TOPIK cloze questions
- 10 samples are contained in
data/completion/topik_sample.csv
- Metadata for all questions are provided in
data/completion/topik_sample.csv
- You may request the full set via e-mail
- 10 samples are contained in
- Nineteenth century novels (19C novels)
- A preprocessed dataset can be downloded from link.
- Sejong corpus can be downloaded through link
- Pre-trained LM1B can be downloaded from Link
- Pre-trained transformers of pytorch-transformers
- automatically downloaded when running
eval_pretrained.py
with corresponding options
- automatically downloaded when running
create ./settings.json
containing
{
"prob_set_dir": "data/completion/",
"prepro_dir": "path_to_prepro_dir",
"lm1b_dir": "path_to_dir_containing_lm1b_model",
"pretrans_dir": "path_to_dir_containing_pytorch_transformers",
"sejong_dir": "path_to_dir_containing_sejong_corpus"
}
python3 train.py --save_dir mynet
- WordRNN
python3 eval_trained.py --dir mynet
python3 finetune.py --model one_of_('bert', 'gpt', 'gpt2') --pretrained saved_name --update-embeddings
Thanks to Sukhyun Cho who manually collected and annotated the TOPIK questions