GitHub - yangyangjuanjuan/DeepBindingDetection: Apply deep learning on predicting TF binding sites

Practice project: Apply deep learning on predicting transcription factor (TF) binding sites

Deep Binding Detection (DBD) is a tool building a Recurrent Nearual Network (with Long Short Term Memory (LSTM) architecture) to descriminate transcription factor (TF) binding sites from non-binding sites (deep learning practice purpose only).

Based on "LSTM Networks for Sentiment Analysis" (as introduced by http://deeplearning.net/tutorial/lstm.html), DBD tries to apply this LSTM architecture on addressing TFBS prediction question.

Comparison

TFBS and noncoding sequences

Representative binding site sequences for each TF were obtained from JASPAR and TFBSShape databases. For identifying non-regulatory genomic regions, PeakSeq-processed peak files in UCSC BED format for more than 400 human ChIP-seq experiments were downloaded from ENCODE project, and then combined into a file that was used to mask the noncoding regions to obtain a noncoding, non-TFBS (NCNT) region. Non-binding sequences were then sampled from this NCNT region and would be used as negative cases when train and test model.

Traditional PWM scoring and threshold

For each TF, multiple binding sequences were compiled into a corresponding PWM, which then be applied on non-binding sequences sampled from NCNT region. The 0.9998 quantile of the obtained PWM score distribution was used as the PWM score threshold.

LSTM prediction

DBD stores classification results as well as some statistics (training error, testing error, true positives, false positives, true negatives, false negatives, PPV, and accuracy) at each save point.

High PWM score negative cases

Testing sequences comprised representative binding sites for the TF (positive cases) and noncoding non-binding sequences of which 3/4 were randomly sampled and another 1/4 were selected have "chance occurences of PWM matches"

Results

For each TF, the LSTM prediction was compared to PWM method. Generated figures including the trend of test errors, PPV, and accuracy. The following figures are for TF STAT1 as an example,

How to use it

DBD needs Python 2.7 or higher (currently not support Python 3.x). Numpy, theano, matplotlib were required. Folder structure should be unchanged. Basically,

"bindingsites" folder stores TF binding site sequences;
"bindingsitespkl" folder stores prepared binding site sequences;
"bindingsitespkl_highscorenegative" folder includes prepared high PWM score negative cases;
"BSPHighPWMScoreNonBinding" folder includes high PWM score noncoding non-binding sequences, and PWM information;
"plots" includes generated images;
"saves" and "saves_highPWMscore" are folders to save trained models and generated results;
"saves_PWMresult" includes results generated by PWM method;
"Xmers" includes noncoding non-binding sequences.

Run script "loadBindingSite.py" will load both positive cases and negative sequences for the TFs included in "bindingsites" folder. Generated .pkl files will then be stored into "bindingsitespkl".

Similar, run script "loadBindingSite_highScoreNegativeSeqs.py" will load positive cases, randomly sampled nagative sequences, and high PWM score negative sequences.

Run script "evaluatePWM.py" will evaluate the performance of PWM method. Modify it to get the evaluation on your designed testing sequences.

Run script "lstm.py" will train LSTM model, and get its performance on testing set.

Run script "loadsaves.py" will generate performance comparison report.

Troubleshooting

Any comments and suggestions are highly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Practice project: Apply deep learning on predicting transcription factor (TF) binding sites

Comparison

TFBS and noncoding sequences

Traditional PWM scoring and threshold

LSTM prediction

High PWM score negative cases

Results

How to use it

Troubleshooting

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
BSPHighPWMScoreNonBinding		BSPHighPWMScoreNonBinding
Xmers		Xmers
bindingsites		bindingsites
bindingsitespkl		bindingsitespkl
bindingsitespkl_highscorenegative		bindingsitespkl_highscorenegative
plots		plots
saves		saves
saves_PWMresult		saves_PWMresult
saves_highPWMscore		saves_highPWMscore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
bindingsites.py		bindingsites.py
bindingsites.pyc		bindingsites.pyc
evaluatePWM.py		evaluatePWM.py
loadBindingSite.py		loadBindingSite.py
loadBindingSite.pyc		loadBindingSite.pyc
loadBindingSite_highScoreNegativeSeqs.py		loadBindingSite_highScoreNegativeSeqs.py
loadsaves.py		loadsaves.py
lstm.py		lstm.py

License

yangyangjuanjuan/DeepBindingDetection

Folders and files

Latest commit

History

Repository files navigation

Practice project: Apply deep learning on predicting transcription factor (TF) binding sites

Comparison

TFBS and noncoding sequences

Traditional PWM scoring and threshold

LSTM prediction

High PWM score negative cases

Results

How to use it

Troubleshooting

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages