Deep Binding Detection (DBD) is a tool building a Recurrent Nearual Network (with Long Short Term Memory (LSTM) architecture) to descriminate transcription factor (TF) binding sites from non-binding sites (deep learning practice purpose only).
Based on "LSTM Networks for Sentiment Analysis" (as introduced by http://deeplearning.net/tutorial/lstm.html), DBD tries to apply this LSTM architecture on addressing TFBS prediction question.
Representative binding site sequences for each TF were obtained from JASPAR and TFBSShape databases. For identifying non-regulatory genomic regions, PeakSeq-processed peak files in UCSC BED format for more than 400 human ChIP-seq experiments were downloaded from ENCODE project, and then combined into a file that was used to mask the noncoding regions to obtain a noncoding, non-TFBS (NCNT) region. Non-binding sequences were then sampled from this NCNT region and would be used as negative cases when train and test model.
For each TF, multiple binding sequences were compiled into a corresponding PWM, which then be applied on non-binding sequences sampled from NCNT region. The 0.9998 quantile of the obtained PWM score distribution was used as the PWM score threshold.
DBD stores classification results as well as some statistics (training error, testing error, true positives, false positives, true negatives, false negatives, PPV, and accuracy) at each save point.
Testing sequences comprised representative binding sites for the TF (positive cases) and noncoding non-binding sequences of which 3/4 were randomly sampled and another 1/4 were selected have "chance occurences of PWM matches"
For each TF, the LSTM prediction was compared to PWM method. Generated figures including the trend of test errors, PPV, and accuracy. The following figures are for TF STAT1 as an example,
DBD needs Python 2.7 or higher (currently not support Python 3.x). Numpy, theano, matplotlib were required. Folder structure should be unchanged. Basically,
- "bindingsites" folder stores TF binding site sequences;
- "bindingsitespkl" folder stores prepared binding site sequences;
- "bindingsitespkl_highscorenegative" folder includes prepared high PWM score negative cases;
- "BSPHighPWMScoreNonBinding" folder includes high PWM score noncoding non-binding sequences, and PWM information;
- "plots" includes generated images;
- "saves" and "saves_highPWMscore" are folders to save trained models and generated results;
- "saves_PWMresult" includes results generated by PWM method;
- "Xmers" includes noncoding non-binding sequences.
Run script "loadBindingSite.py" will load both positive cases and negative sequences for the TFs included in "bindingsites" folder. Generated .pkl files will then be stored into "bindingsitespkl".
Similar, run script "loadBindingSite_highScoreNegativeSeqs.py" will load positive cases, randomly sampled nagative sequences, and high PWM score negative sequences.
Run script "evaluatePWM.py" will evaluate the performance of PWM method. Modify it to get the evaluation on your designed testing sequences.
Run script "lstm.py" will train LSTM model, and get its performance on testing set.
Run script "loadsaves.py" will generate performance comparison report.
Any comments and suggestions are highly appreciated.