PeakSegPipeline: an R package for genome-wide supervised ChIP-seq peak prediction, for a single experiment type (e.g. broad H3K36me3 or sharp H3K4me3 data), jointly using multiple samples and cell types.
- Labeling. The first step of any supervised analysis is to label a few genomic regions with and without peaks in your data. The PeakSegPipeline::create_track_hub function creates a track hub from bigWig files, which makes it easy to visualize data on the UCSC genome browser and then create labels (specific genomic regions where you have observed presence or absence of peaks in specific samples). For more details about labeling see Input Files -> Label Data below.
- Single-sample peak calling using PeakSegFPOP. We use the efficient PeakSegDisk implementation of Generalized Functional Pruning Optimal Partitioning (arXiv:1810.00117) to predict preliminary peaks for each sample independently.
- Multiple-sample peak calling using PeakSegJoint. After single-sample analysis, peaks on different samples are clustered into genomic regions with overlapping peaks. In each cluster we run PeakSegJoint to approximately compute the most likely common peak positions across multiple samples – this includes a prediction of which samples are up and down for each genomic region.
There are two major differences between PeakSegPipeline and all of the other peak detection algorithms for ChIP-seq data analysis:
- Supervised machine learning: PeakSegFPOP and PeakSegJoint are trained by providing labels that indicate regions with and without peaks. So if you see false positives (a peak called where there is clearly only background noise) or false negatives (no peak called where there should be one) you can add labels to correct the models, and they learn and get more accurate as more labels are added. In contrast, other peak detection methods are unsupervised, meaning that they usually have 10-20 parameters, and no obvious way to train them, yielding arbitrarily inaccurate peaks that can NOT be corrected using labels.
- Joint peak detection in any number of samples or cell types so the model can be easily interpreted to find similarities or differences between samples (PeakSegPipeline outputs a binary matrix, peaks x samples). In contrast, it is not easy to find similarities and differences between samples using single-sample peak detection methods (e.g. MACS), and other multi-sample peak detection methods are limited to one (e.g. JAMM) or two (e.g. PePr) cell types (assuming all samples of the same cell type are replicates with the same peak pattern).
Install UCSC command line programs, as explained on our FAQ. These are necessary because the coverage data must be stored in bigWig files for efficient indexed access.
Then, install the PeakSegPipeline R package.
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/PeakSegPipeline")
Once everything has been installed, you can test your installation by executing the test script tests/testthat/test-pipeline-input.R – in the shell, run:
R -e 'source("https://raw.githubusercontent.com/tdhock/PeakSegPipeline/master/tests/testthat/test-pipeline-input.R")'
This is the TEST_SUITE=pipeline-input test that is run by Travis-CI
after every push to this repo – it should take about 20 minutes to run,
depending on the speed of your computer (for debugging on your
computer, it may be useful to look at the expected terminal
output). The test script will first download some bigWigs and labels
to the ~/PeakSegPipeline-test/input
directory, then run the
PeakSegPipeline on them. If everything worked, you can view the
results by opening ~/PeakSegPipeline-test/input/index.html
in a web
browser, and it should be the same as the results shown on
http://members.cbio.mines-paristech.fr/~thocking/hubs/test/input/
PeakSegPipeline uses PeakSegFPOP + PeakSegJoint to predict common and different peaks in multiple samples. It requires three kinds of input data:
- coverage data under
project_dir/samples
, - labels in
project_dir/labels
, - genomic segmentation problems in
project_dir/problems.bed
.
To give a concrete example, let us consider the data set that is used when you run tests/testthat/test-pipeline-demo.R – you can do this via the shell command below: (note that this is not systematically run as a test since its runtime is usually over one hour).
R -e 'source("https://raw.githubusercontent.com/tdhock/PeakSegPipeline/master/tests/testthat/test-pipeline-demo.R")'
Each coverage data file should contain counts of aligned sequence reads at every genomic position, for one sample. These files must be in bigWig format, since it is indexed for fast lookup of coverage in arbitrary genomic regions. For example this test downloads 8 files:
~/PeakSegPipeline-test/demo/samples/bcell/MS026601/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/bcell_/MS010302/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input/MS002201/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input/MS026601/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input_/MS002202/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input_/MS010302/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/kidney/MS002201/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/kidney_/MS002202/coverage.bigWig
In the example above we have the ~/PeakSegPipeline-test/demo
directory which will
contain all data sets, labels, and peak calls for this particular
project. The samples
directory contains a sub-directory for each
sample group (experimental conditions or cell types, e.g. bcell
or
kidney
). Each sample group directory should contain a sub-directory
for each sample (e.g. MS002201
or MS010302
). Each sample
sub-directory should contain a coverage.bigWig
file with counts of
aligned sequence reads (non-negative integers).
Note that in this demonstration project, the groups with underscores
are un-labeled samples (e.g. bcell_
), and the groups without
underscores are labeled samples (e.g. bcell
). In real projects
typically you would combine those two groups into a single labeled
group, but in this project we keep them separate in order to
demonstrate the prediction accuracy of the learning algorithm.
The project_dir/labels/*.txt
files contain genomic regions with or without
peaks. These labels will be used to train the peak prediction models
(automatically select model parameters that yield optimal peak
prediction accuracy). A quick and easy way to create labels is by
visual inspection as in the McGill ChIP-seq peak detection benchmark
(for details please read Hocking et al, Bioinformatics 2016).
To visually label your data first create a project directory on a
webserver. For example if your project directory is in your
~/public_html
directory, your directory structure should be
~/public_html/project_dir/samples/groupID/sampleID/coverage.bigWig
.
To create a track hub, put the following in
~/public_html/project_dir/hub.sh
:
#!/bin/bash
Rscript -e 'PeakSegPipeline::create_track_hub("/home/tdhock/PeakSegPipeline-test/input", "http://members.cbio.mines-paristech.fr/~thocking/hubs/test/input", "hg19", "[email protected]")'
The arguments of the create_track_hub
function are as follows:
- The first argument
path/to/project_dir
is an absolute path to your data/project directory. - The second argument
http://your.server.com/~user/project_dir
is the URL where that directory will be made available on the web.. - The third argument
hg19
is the UCSC genome ID for the genomes.txt file. - The fourth argument
[email protected]
is your email address, which will be written to the hub.txt file.
If that command worked, then you should see a message Created
http://your.server.com/~user/project_dir/hub.txt
and then you can
paste that URL into My Data -> Track Hubs -> My Hubs then click Add
Hub to tell the UCSC genome browser to display your data. Navigate
around the genome until you have found some peaks, then add positive
and negative labels in project_dir/labels/*.txt
files.
For example the test data set contains only one labels file,
~/PeakSegPipeline-test/demo/labels/some_labels.txt
which contains lines such as the following
chr10:33,061,897-33,162,814 noPeaks chr10:33,456,000-33,484,755 peakStart kidney chr10:33,597,317-33,635,209 peakEnd kidney chr10:33,662,034-33,974,942 noPeaks chr10:35,182,820-35,261,001 noPeaks chr10:35,261,418-35,314,654 peakStart bcell kidney
A chunk is a group of nearby labels. In the example above there are two chunks (far apart genomic regions, separated by an empty line). The first chunk has two regions with noPeaks labels in all samples, and two regions with positive labels in kidney samples and noPeaks labels in bcell samples. The second chunk has one region with noPeaks in bcell and kidney samples, and one region with a peakStart label in bcell and kidney samples.
In general, the labels file is divided into separate chunks by empty lines. Each chunk should contain lines for several nearby genomic regions, the corresponding label (noPeaks, peakStart, peakEnd, peaks), and the sample groups to which that label should be assigned (all other groups mentioned in the labels file will receive the noPeaks label). Ideally, each chunk should contain
- At least one label with a peak in all samples.
- At least one label with no peaks in any samples.
- At least one label with a peak in some samples but not others (these labels are crucial for the model to be able to learn what is a significant difference between up and down).
Visualizing labels. After having added some labels in
project_dir/labels/*.txt
files, run the R command
PeakSegPipeline::convert_labels("project_dir")
to create project_dir/all_labels.bed
. Then when you re-create the
track hub, it will include a new track “Manually labeled
regions with and without peaks” that displays the labels you have
created.
The last input file that you need to provide is a list of separate segmentation problems for your reference genome (regions without gaps, i.e. contigs). This file should be in BED format (e.g. hg19_problems.bed).
If you don’t use hg19, but you do use another standard genome that is hosted on UCSC, then you can use PeakSegPipeline::downloadProblems.
PeakSegPipeline::downloadProblems("hg38", "hg38_problems.bed")
If your reference genome does not exist on UCSC, you can use
PeakSegPipeline::gap2problems to make a problems.bed
file.
PeakSegPipeline::gap2problems("yourGenome_gap.bed", "yourGenome_chromInfo.txt", "yourGenome_problems.bed")
where the chromInfo file contains one line for every chromosome, and
the gap file contains one line for every gap in the reference (unknown
/ NNN sequence). If there are no gaps in your genome, then you can use
yourGenome_chromInfo.txt
as a problems.bed
file.
Since the human genome is so large, we recommend to do model training
and peak prediction in parallel using batchtools. We test to make sure
that PeakSegPipeline runs to completion using the SLURM cluster
system. The PeakSegPipeline SLURM template which adds support for
dependencies between jobs must be used, by specifying the following
batchtools configuration (I put it in ~/.batchtools.conf.R
):
cluster.functions = makeClusterFunctionsSlurm(system.file(
file.path("templates", "slurm-afterok.tmpl"),
package="PeakSegPipeline",
mustWork=TRUE))
Some operations in each step/job can be run in parallel on each
node. This usually results in speedups when there are few jobs per
compute node (so there is little risk of maxing out disk i/o bandwidth
on each node). If you want to do these operations in parallel in each
job, you can declare a future plan in your ~/.Rprofile
via e.g.
future::plan("multiprocess")
PeakSegPipeline does a lot of writing to temporary files on disk
during calls to PeakSegDisk::PeakSegFPOP_dir
in Step 1 of the
pipeline. These disk writes are parallelized across all labeled
samples and contigs. To decrease the chance of maxing out disk i/o
bandwidth, PeakSegPipeline tries to avoid writing at the same time on
the same disk, by doing these writes to tempdir()
, which
is usually on a local disk on the compute node where the job is
running (i.e. not available to the other compute nodes, not on a
network file system). R puts its temp dir in the first of TMPDIR
,
TMP
, TEMP
environment variables, so you can set one of these if
you don’t want to use the default, /tmp
.
To run the pipeline using batchtools (via SLURM or whatever other cluster functions you have registered), use
jobs <- PeakSegPipeline::jobs_create("~/PeakSegPipeline-test/demo")
PeakSegPipeline::jobs_submit_batchtools(jobs, resources=list(
walltime = 24*60*60,#seconds
memory = 2000,#megabytes per cpu
ncpus=2,
ntasks=1,
chunks.as.arrayjobs=TRUE))
You can edit the time/memory required for each job via the resources
argument (these resources are used for each job/step). Details of the
jobs/steps are explained on the wiki. To monitor the progress of
jobs/steps you can use:
reg.dirs <- Sys.glob("~/PeakSegPipeline-test/demo/registry/*")
reg.list <- lapply(reg.dirs, batchtools::loadRegistry)
lapply(reg.list, function(reg){
batchtools::getStatus(reg=reg)
})
The last step includes creation of the summary web page (links to some examples here)
~/PeakSegPipeline-test/demo/index.html
which has links to peak
prediction files, plots, and a track hub
~/PeakSegPipeline-test/demo/hub.txt
which can be used on the UCSC
genome browser. It shows
~/PeakSegPipeline-test/demo/samples/*/*/coverage.bigWig
and
~/PeakSegPipeline-test/demo/samples/*/*/joint_peaks.bigWig
files
together in multiWig containers (for each sample, a colored coverage
profile with superimposed peak calls as horizontal black line
segments). To use the track hub, make sure the
~/PeakSegPipeline-test/demo/
directory is publicly accessible on the
web.
The PeakSegPipeline::plot_all function creates
index.html
a web page which summarizes the results,peaks_matrix_sample.tsv.gz
a binary matrix (peaks x samples) in which 1 means peak and 0 means no peak.peaks_matrix_group.tsv.gz
a binary matrix (peaks x groups) in which 1 means peak and 0 means no peak.peaks_matrix_likelihood.tsv.gz
a numeric matrix (peaks x samples) of likelihood values, larger means a more likely peak.peaks_matrix_meanCoverage.tsv.gz
a numeric matrix (peaks x samples) of mean coverage in peaks.peaks_summary.tsv
is a table with a row for each genomic region that has a peak in at least one sample. The columns arechrom
,peakStart
,peakEnd
genomic region of peak.Input.up
if there is an Input group, then this is TRUE for rows where the Input group is predicted to have a peak. This can be useful for filtering/removing peaks which are non-specific.group.loss.diff
andsample.loss.diff
the likelihood of the peak (larger values mean taller and wider peaks in more samples).
PeakSegPipeline uses
- PeakSegDisk for predicting multiple peaks per contig. (for each sample independently)
- PeakError to compute the number of incorrect labels for each peak model.
- penaltyLearning for supervised penalty learning algorithms (interval regression) which are used to predict model complexity (log penalty = number of peaks).
- PeakSegJoint for defining joint peak boundaries across any number of samples and cell types. (independently for each genomic region with a peak predicted by PeakSegDisk)