PEDL is a method for predicting protein-protein assocations from text. The paper describing it will be presented at ISMB 2020.
python >= 3.6
pip install -r requirements.txt
pytorch >= 1.3.1
(has to be installed manually, due to different CUDA versions)
We use two types of data sets: Data generated from the BioNLP-ST event extraction data sets and the distantly supervised PID data set
generates the BioNLP data sets for both PEDL and comb-dist
All experiments in the paper have been performed with the masked version of the data, e.g. distant_supervision/data/BioNLP-ST_2011/train_masked.json
Generating the PID data is a bit more involved:
- First, we have to download the raw PubMed Central texts:
. CAUTION: This produces over 200 GB of files and spawns multiple processes. - Then, we have to download the PubTator Central file ( and place it into the root directory. This file consumes another 80 GB when decompressed.
- Generate the raw PID data:
- Generate the final PID data:
Before training, SciBERT has to be downloaded and placed to some directory (called $bert_dir
from now on).
The vocabulary of SciBERT has to be adapted to include the entity markers and protein masks: cp distant_supervision/vocab.txt $bert_dir
PEDL can be trained with python -m distant_supervision.train_pedl
, (see
for exact suitable arguments.
If you just want to reproduce the experiments from the paper, this can be achieved with ./
As an alternative to training your own model, you can use this version of PEDL that was trained on PID and used for the experiments in the paper.
The trained PEDL model can be used to predict PPAs for a new data set. See
for details.
Note, that this is highly experimental research code which is not suitable for production usage. We do not provide warranty of any kind. Use at your own risk.