Skip to content

yangyangjuanjuan/DeepBindingDetection

Repository files navigation

test


Practice project: Apply deep learning on predicting transcription factor (TF) binding sites

Deep Binding Detection (DBD) is a tool building a Recurrent Nearual Network (with Long Short Term Memory (LSTM) architecture) to descriminate transcription factor (TF) binding sites from non-binding sites (deep learning practice purpose only).

Based on "LSTM Networks for Sentiment Analysis" (as introduced by http://deeplearning.net/tutorial/lstm.html), DBD tries to apply this LSTM architecture on addressing TFBS prediction question.

Comparison

TFBS and noncoding sequences

Representative binding site sequences for each TF were obtained from JASPAR and TFBSShape databases. For identifying non-regulatory genomic regions, PeakSeq-processed peak files in UCSC BED format for more than 400 human ChIP-seq experiments were downloaded from ENCODE project, and then combined into a file that was used to mask the noncoding regions to obtain a noncoding, non-TFBS (NCNT) region. Non-binding sequences were then sampled from this NCNT region and would be used as negative cases when train and test model.

Traditional PWM scoring and threshold

For each TF, multiple binding sequences were compiled into a corresponding PWM, which then be applied on non-binding sequences sampled from NCNT region. The 0.9998 quantile of the obtained PWM score distribution was used as the PWM score threshold.

LSTM prediction

DBD stores classification results as well as some statistics (training error, testing error, true positives, false positives, true negatives, false negatives, PPV, and accuracy) at each save point.

High PWM score negative cases

Testing sequences comprised representative binding sites for the TF (positive cases) and noncoding non-binding sequences of which 3/4 were randomly sampled and another 1/4 were selected have "chance occurences of PWM matches"

Results

For each TF, the LSTM prediction was compared to PWM method. Generated figures including the trend of test errors, PPV, and accuracy. The following figures are for TF STAT1 as an example,

testing errors

testing PPVs

testing ACCs

How to use it

DBD needs Python 2.7 or higher (currently not support Python 3.x). Numpy, theano, matplotlib were required. Folder structure should be unchanged. Basically,

  • "bindingsites" folder stores TF binding site sequences;
  • "bindingsitespkl" folder stores prepared binding site sequences;
  • "bindingsitespkl_highscorenegative" folder includes prepared high PWM score negative cases;
  • "BSPHighPWMScoreNonBinding" folder includes high PWM score noncoding non-binding sequences, and PWM information;
  • "plots" includes generated images;
  • "saves" and "saves_highPWMscore" are folders to save trained models and generated results;
  • "saves_PWMresult" includes results generated by PWM method;
  • "Xmers" includes noncoding non-binding sequences.

Run script "loadBindingSite.py" will load both positive cases and negative sequences for the TFs included in "bindingsites" folder. Generated .pkl files will then be stored into "bindingsitespkl".

Similar, run script "loadBindingSite_highScoreNegativeSeqs.py" will load positive cases, randomly sampled nagative sequences, and high PWM score negative sequences.

Run script "evaluatePWM.py" will evaluate the performance of PWM method. Modify it to get the evaluation on your designed testing sequences.

Run script "lstm.py" will train LSTM model, and get its performance on testing set.

Run script "loadsaves.py" will generate performance comparison report.

Troubleshooting

Any comments and suggestions are highly appreciated.

About

Apply deep learning on predicting TF binding sites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages