This work is part of the course L-645 (CSCI B-659) Advanced Natural Language Processing at Indiana University, Bloomington.
Paper link: https://arxiv.org/abs/2212.04595
Sentence simplification aims at making the structure of text easier to read and understand while maintaining its original meaning. This can be helpful for people with disabilities, new language learners, or those with low literacy. Simplification often involves removing difficult words and rephrasing the sentence. This repo contains the code for fine-tuning transformer models for the task of sentence simplification.
- Train models for sentence simplification in PyTorch.
- Models include GPT-2, BERT, GPT-2 encoder and BERT decoder, BERT encoder and GPT-2 decoder.
- Evaluate results on metrics like SARI, FKGL, and BLEU.
The models were trained on WikiLarge dataset which you can download from David Kauchak's webpage or XingxingZhang/dress.
For ease of access, a small, cleaned dataset from Aakash12980/Sentence-Simplification-using-BERT-GPT2 is provided. This dataset can also be used for training and evaluation purpose.
.
│
├── dataset
│ ├── src_train.txt
│ ├── src_valid.txt
│ ├── src_test.txt
│ ├── tgt_train.txt
│ ├── tgt_valid.txt
│ ├── tgt_test.txt
│ ├── ref_test.pkl
│ ├── ref_valid.pkl
│
├── src
│ ├── datagen.py
│ ├── evaluate.py
│ ├── sari.py
│ ├── train.py
│ ├── train_from_scratch.py
│ ├── utils.py
|
├── requirements.txt
├── README.md
|
The code uses python 3.9.8
and torch 1.10.2
. The other requirements are:
nltk==3.6.7
numpy==1.22.2
tokenizers==0.13.2
torch==1.10.2
tqdm==4.62.3
transformers==4.24.0
Download the code
https://github.com/amanbasu/sentence-simplification.git
Install requirements
pip install -r requirements.txt
train.py usage
$ python train.py -h
usage: train.py [-h] [--model {gpt2,bert,bert_gpt2,gpt2_bert}] [--max_length MAX_LENGTH] [--epochs EPOCHS] [--init_epoch INIT_EPOCH] [--batch_size BATCH_SIZE] [--lr LR] [--save_path SAVE_PATH]
Arguments for training.
optional arguments:
-h, --help show this help message and exit
--model {gpt2,bert,bert_gpt2,gpt2_bert}
model type
--max_length MAX_LENGTH
maximum length for encoder
--epochs EPOCHS number of training epochs
--init_epoch INIT_EPOCH
epoch to resume the training from
--batch_size BATCH_SIZE
batch size for training
--lr LR learning rate for training
--save_path SAVE_PATH
model save path
evaluate.py usage
$ python evaluate.py -h
usage: evaluate.py [-h] [--model {gpt2,bert,bert_gpt2,gpt2_bert}] [--max_length MAX_LENGTH] [--batch_size BATCH_SIZE] [--model_path MODEL_PATH] [--save_predictions {True,False}] [--pred_path PRED_PATH]
Arguments for evaluation.
optional arguments:
-h, --help show this help message and exit
--model {gpt2,bert,bert_gpt2,gpt2_bert}
model type
--max_length MAX_LENGTH
maximum length for encoder
--batch_size BATCH_SIZE
batch size for evaluation
--model_path MODEL_PATH
model save path
--save_predictions {True,False}
saves predictions in a txt file
--pred_path PRED_PATH
path to save the predictions
To train a model
python train.py --model bert --epochs 5 --batch_size 20 --save_path '../checkpoint/model_bert.pt'
To evaluate a model
python evaluate.py --model bert --model_path '../checkpoint/model_bert.pt' --save_predictions True --pred_path '../bert_predictions.txt'
Figure 1. A comparison of our model's performance against previous studies.
@inproceedings{Agarwal2022ExplainTM,
title={Explain to me like I am five -- Sentence Simplification Using Transformers},
author={Aman Agarwal},
year={2022},
doi = {10.48550/ARXIV.2212.04595},
}
- Sample data: https://github.com/Aakash12980/Sentence-Simplification-using-BERT-GPT2
- WikiLarge: https://cs.pomona.edu/~dkauchak/simplification/
- Train models from scratch: https://huggingface.co/blog/how-to-train
- SARI implementation: https://github.com/cocoxu/simplification
- Other metrics (EASSE): https://github.com/feralvam/easse