Sentence Simplification

This work is part of the course L-645 (CSCI B-659) Advanced Natural Language Processing at Indiana University, Bloomington.

Paper link: https://arxiv.org/abs/2212.04595

ℹ️ Overview

Sentence simplification aims at making the structure of text easier to read and understand while maintaining its original meaning. This can be helpful for people with disabilities, new language learners, or those with low literacy. Simplification often involves removing difficult words and rephrasing the sentence. This repo contains the code for fine-tuning transformer models for the task of sentence simplification.

🌟 Highlights

Train models for sentence simplification in PyTorch.
Models include GPT-2, BERT, GPT-2 encoder and BERT decoder, BERT encoder and GPT-2 decoder.
Evaluate results on metrics like SARI, FKGL, and BLEU.

📝 Dataset

The models were trained on WikiLarge dataset which you can download from David Kauchak's webpage or XingxingZhang/dress.

For ease of access, a small, cleaned dataset from Aakash12980/Sentence-Simplification-using-BERT-GPT2 is provided. This dataset can also be used for training and evaluation purpose.

🗂️ Folder structure

.
│
├── dataset
│   ├── src_train.txt
│   ├── src_valid.txt
│   ├── src_test.txt
│   ├── tgt_train.txt
│   ├── tgt_valid.txt
│   ├── tgt_test.txt
│   ├── ref_test.pkl
│   ├── ref_valid.pkl
│
├── src
│   ├── datagen.py
│   ├── evaluate.py
│   ├── sari.py
│   ├── train.py
│   ├── train_from_scratch.py
│   ├── utils.py
|
├── requirements.txt
├── README.md
|

✅ Requirements

The code uses python 3.9.8 and torch 1.10.2. The other requirements are:

nltk==3.6.7
numpy==1.22.2
tokenizers==0.13.2
torch==1.10.2
tqdm==4.62.3
transformers==4.24.0

👨‍💻 Usage

Download the code

https://github.com/amanbasu/sentence-simplification.git

Install requirements

pip install -r requirements.txt

train.py usage

$ python train.py -h
usage: train.py [-h] [--model {gpt2,bert,bert_gpt2,gpt2_bert}] [--max_length MAX_LENGTH] [--epochs EPOCHS] [--init_epoch INIT_EPOCH] [--batch_size BATCH_SIZE] [--lr LR] [--save_path SAVE_PATH]

Arguments for training.

optional arguments:
  -h, --help            show this help message and exit
  --model {gpt2,bert,bert_gpt2,gpt2_bert}
                        model type
  --max_length MAX_LENGTH
                        maximum length for encoder
  --epochs EPOCHS       number of training epochs
  --init_epoch INIT_EPOCH
                        epoch to resume the training from
  --batch_size BATCH_SIZE
                        batch size for training
  --lr LR               learning rate for training
  --save_path SAVE_PATH
                        model save path

evaluate.py usage

$ python evaluate.py -h
usage: evaluate.py [-h] [--model {gpt2,bert,bert_gpt2,gpt2_bert}] [--max_length MAX_LENGTH] [--batch_size BATCH_SIZE] [--model_path MODEL_PATH] [--save_predictions {True,False}] [--pred_path PRED_PATH]

Arguments for evaluation.

optional arguments:
  -h, --help            show this help message and exit
  --model {gpt2,bert,bert_gpt2,gpt2_bert}
                        model type
  --max_length MAX_LENGTH
                        maximum length for encoder
  --batch_size BATCH_SIZE
                        batch size for evaluation
  --model_path MODEL_PATH
                        model save path
  --save_predictions {True,False}
                        saves predictions in a txt file
  --pred_path PRED_PATH
                        path to save the predictions

💬 Examples

To train a model

python train.py --model bert --epochs 5 --batch_size 20 --save_path '../checkpoint/model_bert.pt'

To evaluate a model

python evaluate.py --model bert --model_path '../checkpoint/model_bert.pt' --save_predictions True --pred_path '../bert_predictions.txt'

🏆 Results

Figure 1. A comparison of our model's performance against previous studies.

✍️ Citation

@inproceedings{Agarwal2022ExplainTM,
  title={Explain to me like I am five -- Sentence Simplification Using Transformers},
  author={Aman Agarwal},
  year={2022},
  doi = {10.48550/ARXIV.2212.04595},
}

📚 References

Sample data: https://github.com/Aakash12980/Sentence-Simplification-using-BERT-GPT2
WikiLarge: https://cs.pomona.edu/~dkauchak/simplification/
Train models from scratch: https://huggingface.co/blog/how-to-train
SARI implementation: https://github.com/cocoxu/simplification
Other metrics (EASSE): https://github.com/feralvam/easse

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
dataset		dataset
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentence Simplification

ℹ️ Overview

🌟 Highlights

📝 Dataset

🗂️ Folder structure

✅ Requirements

👨‍💻 Usage

💬 Examples

🏆 Results

✍️ Citation

📚 References

About

Releases

Packages

Languages

License

amanbasu/sentence-simplification

Folders and files

Latest commit

History

Repository files navigation

Sentence Simplification

ℹ️ Overview

🌟 Highlights

📝 Dataset

🗂️ Folder structure

✅ Requirements

👨‍💻 Usage

💬 Examples

🏆 Results

✍️ Citation

📚 References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages