Skip to content

Artifact "Grammar-constrained decoding for structured information extraction with fine-tuned generative models applied to clinical trial abstracts"

Notifications You must be signed in to change notification settings

ag-sc/Clinical-Trial-IE-GCD

Repository files navigation

Artifact "Grammar-constrained decoding for structured information extraction with fine-tuned generative models applied to clinical trial abstracts"

DOI: http://doi.org/10.3389/frai.2024.1406857

Zenodo: https://doi.org/10.5281/zenodo.10419785

Setup

We suggest using Python 3.10.12 or newer to run this artifact, older versions have not been tested. The following instructions should work for most mayor Linux distributions.

Start by setting up a virtual environment and installing the required packages. For this, run the following at the top level of a cloned version of this repository:

python -m venv venv
source venv/bin/activate
pip install <your torch_scatter version here, two are in the repo> # if automatic installation via pip fails
pip install -r requirements.txt

All scripts in this artifact automatically activate the virtual environment assuming it is named and located as shown above and set the PYTHONPATH accordingly. If you want to execute a single Python file manually, you have to activate the virtual environment and add the src/ subdirectory as PYTHONPATH:

source venv/bin/activate
export PYTHONPATH=./src/
python some/python/file.py

Artifact Structure

  • data/ - annotated datasets for type 2 diabetes and glaucoma RCT abstracts used in the paper
  • scripts-generative/ - scripts to execute training and evaluation of the generative approach
    • ptr/ - scripts for training the pointer generator models
      • ptr_runs.txt - list of commands necessary to start full hyperparameter search training run for pointer models
      • ptr_eval_dm2_led.sh - train LED pointer model variant on type 2 diabetes dataset
      • ptr_eval_dm2_t5.sh - train T5 pointer model variant on type 2 diabetes dataset
      • ptr_eval_gl_led.sh - train LED pointer model variant on glaucoma dataset
      • ptr_eval_gl_t5.sh - train T5 pointer model variant on glaucoma dataset
    • allRuns.txt - list of commands necessary to start full hyperparameter search training run
    • generative-dm2.sh - (deprecated) given a model name, executes an generative hyperparameter optimization training run with 30 trials for type 2 diabetes dataset
    • generative-gl.sh - (deprecated) given a model name, executes an generative hyperparameter optimization training run with 30 trials for glaucoma dataset
    • generative-best.sh - given path to a best_params.pkl file generated by src/eval_summary.py, executes 10 training runs with the best parameters found during hyperparameter optimization
    • eval-gen.sh - executes evaluation for generative part of directory of trained models, CHANGE PATH IN FILE TO ACTUAL LOCATION OF RESULTS!
    • eval-gen-nogcd.sh - executes evaluation for generative part of directory of trained models without constrained decoding, CHANGE PATH IN FILE TO ACTUAL LOCATION OF RESULTS!
  • src/ - source code of both approaches used in the paper
    • generative_approach - source code of the generative approach (training file is training.py)
    • template_lib - source code of general classes and functions to load and use the dataset
    • full_eval.py - runs evaluation for whole given training results directory
    • eval_summary.py - generates summary of evaluated training results of hyperparameter search
    • eval_summary_best.py - generates summary of evaluated training results of the 10 training runs executed separately with the best hyperparameters
    • main.py - can be used to play around with loaded datasets, contains code to list and count slot fillers of "Journal"
  • requirements.txt - Python requirements of this project
  • sort_results.sh - expecting training to have been executed in top directory of project, sorts models etc. into folders grouped by approach, disease and model, CHANGE PATH IN FILE TO ACTUAL LOCATION OF RESULTS!

Replication Steps

  1. Go to the top directory of this project

  2. Execute all hyperparameter optimization trainings, i.e.:

bash scripts-generative/ptr/ptr_eval_gl_led.sh &> gl_led_$(date '+%Y_%m_%d').txt
bash scripts-generative/ptr/ptr_eval_gl_t5.sh &> gl_t5_$(date '+%Y_%m_%d').txt
bash scripts-generative/ptr/ptr_eval_dm2_led.sh &> dm2_led_$(date '+%Y_%m_%d').txt
bash scripts-generative/ptr/ptr_eval_dm2_t5.sh &> dm2_t5_$(date '+%Y_%m_%d').txt

bash scripts-generative/ptr/ptr_eval_gl_led.sh ptr &> ptr_gl_led_$(date '+%Y_%m_%d').txt
bash scripts-generative/ptr/ptr_eval_gl_t5.sh ptr &> ptr_gl_t5_$(date '+%Y_%m_%d').txt
bash scripts-generative/ptr/ptr_eval_dm2_led.sh ptr &> ptr_dm2_led_$(date '+%Y_%m_%d').txt
bash scripts-generative/ptr/ptr_eval_dm2_t5.sh ptr &> ptr_dm2_t5_$(date '+%Y_%m_%d').txt
  1. Sort results into folders: sort_results.sh (change paths in file!)

  2. Run evaluation for generative models (change paths in file!):

scripts-generative/eval-gen.sh
scripts-generative/eval-gen-nogcd.sh
  1. Generate evaluation summary for hyperparameter optimization (first activate virtual environment and set PYTHONPATH as shown above!). Some tables are only printed and generated if you run the command a second time because they are first only saved to pickle files.
python src/eval_summary.py --results /path/to/results/folder/ 
  1. Run training again for best found parameters:
scripts-generative/generative-best.sh /path/to/best_params.pkl
  1. Sort new results into folders: sort_results.sh and make sure the files are sorted into a different directory such that you can differentiate the hyperparameter optimization from the training with best parameters. Do not forget to also copy the config_*.json files from the original results to the directories of the new results (e.g. using cp --parents) as they are necessary for running the evaluation.

  2. Run evaluation for new generative models (change paths in file accordingly!):

scripts-generative/eval-gen.sh
scripts-generative/eval-gen-nogcd.sh
  1. Generate evaluation summary with mean and standard deviations for training with best parameters (first activate virtual environment and set PYTHONPATH as shown above!). Some tables are only printed and generated if you run the command a second time because they are first only saved to pickle files.
python src/eval_summary_best.py --results /path/to/results/folder/ 

Citation

Please consider citing our work if you find the provided resources useful:

@ARTICLE{schmidt-cimiano-2024-ct-ie-gcd,
AUTHOR={Schmidt, David M.  and Cimiano, Philipp },
TITLE={Grammar-constrained decoding for structured information extraction with fine-tuned generative models applied to clinical trial abstracts},
JOURNAL={Frontiers in Artificial Intelligence},
VOLUME={7},
YEAR={2025},
URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2024.1406857},
DOI={10.3389/frai.2024.1406857},
ISSN={2624-8212},
}

About

Artifact "Grammar-constrained decoding for structured information extraction with fine-tuned generative models applied to clinical trial abstracts"

Resources

Stars

Watchers

Forks

Packages

No packages published