CTRL-SIMP

This repo contains the Med-EASi (Medical dataset for Elaborative and Abstractive Simplification) dataset and models for controllable medical text simplification.

Overview

MedEASi is a uniquely crowdsourced and finely annotated dataset for supervised simplification of short medical texts. Its expert-layman-AI collaborative annotations facilitate controllability over text simplification by marking four kinds of textual transformations: elaboration, replacement, deletion, and insertion. Med-EASi contains diverse training pairs, each with a different set of textual transformations. We also introduce two T5-large based models for controllable simplification. The models are trained using heterogeneous task descriptions, called multi-angle training, where each input and output are written as a combination of slots, similar to MACAW.

Models

We finetune T5-large with a combination of prompting and in filling to add two types of controllability into text simplification.

$CTRL-SIM_{ip}$ : position-aware, where label=annotated data
$CTRL-SIM$ : position-agnostic, where label=content of the expert text that must be edited, type of edit and the unannotated simple text

The models are trained with python version 3.8.10 and torch 1.10 (cu-11.3). The model using T5-large was trained on two CUDA devices (GeForce RX 3080 each).

To train the model:

make appropriate changes to the data path, model path and CUDA devices
make sure NLTK and pandas installations are working
run training.py

Model arguments:

--ip_ann: bool, default = True, include in place annotated data Sa and Ea
--one_slot: bool, default = False, train only on in place annotated data (Ea->Sa)
--shuffle: bool, default = True, shuffle training data to avoid the same examples in one batch

The evaluation functions are in model.py.

Dataset

The dataset is available at https://huggingface.co/datasets/cbasu/Med-EASi

To use the dataset from Huggingface datasets library do the following:

from datasets import load_dataset
dataset = load_dataset("cbasu/Med-EASi")

OR

git lfs install
git clone https://huggingface.co/datasets/cbasu/Med-EASi

Citation

@article{basu2023med,
  title={Med-EASi: Finely Annotated Dataset and Models for Controllable Simplification of Medical Texts},
  author={Basu, Chandrayee and Vasu, Rosni and Yasunaga, Michihiro and Yang, Qian},
  journal={arXiv preprint arXiv:2302.09155},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
LICENSE.md		LICENSE.md
README.md		README.md
annotation_correction.py		annotation_correction.py
automatic_annotation.py		automatic_annotation.py
compute_similarity.py		compute_similarity.py
count_angles_training.py		count_angles_training.py
dataprep.py		dataprep.py
evaluation.py		evaluation.py
loss_analysis.py		loss_analysis.py
metrics.py		metrics.py
model.py		model.py
preprocessing.py		preprocessing.py
strmatching.py		strmatching.py
training.py		training.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTRL-SIMP

Overview

Models

Dataset

Citation

About

Releases

Packages

Languages

License

Chandrayee/CTRL-SIMP

Folders and files

Latest commit

History

Repository files navigation

CTRL-SIMP

Overview

Models

Dataset

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages