GitHub - tdhock/PeakSegPipeline: Supervised machine learning pipeline for peak calling in ChIP-seq data

PeakSegPipeline: an R package for genome-wide supervised ChIP-seq peak prediction, for a single experiment type (e.g. broad H3K36me3 or sharp H3K4me3 data), jointly using multiple samples and cell types.

Labeling. The first step of any supervised analysis is to label a few genomic regions with and without peaks in your data. The PeakSegPipeline::create_track_hub function creates a track hub from bigWig files, which makes it easy to visualize data on the UCSC genome browser and then create labels (specific genomic regions where you have observed presence or absence of peaks in specific samples). For more details about labeling see Input Files -> Label Data below.
Single-sample peak calling using PeakSegFPOP. We use the efficient PeakSegDisk implementation of Generalized Functional Pruning Optimal Partitioning (arXiv:1810.00117) to predict preliminary peaks for each sample independently.
Multiple-sample peak calling using PeakSegJoint. After single-sample analysis, peaks on different samples are clustered into genomic regions with overlapping peaks. In each cluster we run PeakSegJoint to approximately compute the most likely common peak positions across multiple samples – this includes a prediction of which samples are up and down for each genomic region.

There are two major differences between PeakSegPipeline and all of the other peak detection algorithms for ChIP-seq data analysis:

Supervised machine learning: PeakSegFPOP and PeakSegJoint are trained by providing labels that indicate regions with and without peaks. So if you see false positives (a peak called where there is clearly only background noise) or false negatives (no peak called where there should be one) you can add labels to correct the models, and they learn and get more accurate as more labels are added. In contrast, other peak detection methods are unsupervised, meaning that they usually have 10-20 parameters, and no obvious way to train them, yielding arbitrarily inaccurate peaks that can NOT be corrected using labels.
Joint peak detection in any number of samples or cell types so the model can be easily interpreted to find similarities or differences between samples (PeakSegPipeline outputs a binary matrix, peaks x samples). In contrast, it is not easy to find similarities and differences between samples using single-sample peak detection methods (e.g. MACS), and other multi-sample peak detection methods are limited to one (e.g. JAMM) or two (e.g. PePr) cell types (assuming all samples of the same cell type are replicates with the same peak pattern).

Installation and testing

Install UCSC command line programs, as explained on our FAQ. These are necessary because the coverage data must be stored in bigWig files for efficient indexed access.

Then, install the PeakSegPipeline R package.

if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/PeakSegPipeline")

Once everything has been installed, you can test your installation by executing the test script tests/testthat/test-pipeline-input.R – in the shell, run:

R -e 'source("https://raw.githubusercontent.com/tdhock/PeakSegPipeline/master/tests/testthat/test-pipeline-input.R")'

This is the TEST_SUITE=pipeline-input test that is run by Travis-CI after every push to this repo – it should take about 20 minutes to run, depending on the speed of your computer (for debugging on your computer, it may be useful to look at the expected terminal output). The test script will first download some bigWigs and labels to the ~/PeakSegPipeline-test/input directory, then run the PeakSegPipeline on them. If everything worked, you can view the results by opening ~/PeakSegPipeline-test/input/index.html in a web browser, and it should be the same as the results shown on http://members.cbio.mines-paristech.fr/~thocking/hubs/test/input/

Pipeline Input Files

PeakSegPipeline uses PeakSegFPOP + PeakSegJoint to predict common and different peaks in multiple samples. It requires three kinds of input data:

coverage data under project_dir/samples,
labels in project_dir/labels,
genomic segmentation problems in project_dir/problems.bed.

To give a concrete example, let us consider the data set that is used when you run tests/testthat/test-pipeline-demo.R – you can do this via the shell command below: (note that this is not systematically run as a test since its runtime is usually over one hour).

R -e 'source("https://raw.githubusercontent.com/tdhock/PeakSegPipeline/master/tests/testthat/test-pipeline-demo.R")'

Coverage data

Each coverage data file should contain counts of aligned sequence reads at every genomic position, for one sample. These files must be in bigWig format, since it is indexed for fast lookup of coverage in arbitrary genomic regions. For example this test downloads 8 files:

~/PeakSegPipeline-test/demo/samples/bcell/MS026601/coverage.bigWig
~/PeakSegPipeline-test/demo/samples/bcell_/MS010302/coverage.bigWig
~/PeakSegPipeline-test/demo/samples/Input/MS002201/coverage.bigWig
~/PeakSegPipeline-test/demo/samples/Input/MS026601/coverage.bigWig
~/PeakSegPipeline-test/demo/samples/Input_/MS002202/coverage.bigWig
~/PeakSegPipeline-test/demo/samples/Input_/MS010302/coverage.bigWig
~/PeakSegPipeline-test/demo/samples/kidney/MS002201/coverage.bigWig
~/PeakSegPipeline-test/demo/samples/kidney_/MS002202/coverage.bigWig

In the example above we have the ~/PeakSegPipeline-test/demo directory which will contain all data sets, labels, and peak calls for this particular project. The samples directory contains a sub-directory for each sample group (experimental conditions or cell types, e.g. bcell or kidney). Each sample group directory should contain a sub-directory for each sample (e.g. MS002201 or MS010302). Each sample sub-directory should contain a coverage.bigWig file with counts of aligned sequence reads (non-negative integers).

Note that in this demonstration project, the groups with underscores are un-labeled samples (e.g. bcell_), and the groups without underscores are labeled samples (e.g. bcell). In real projects typically you would combine those two groups into a single labeled group, but in this project we keep them separate in order to demonstrate the prediction accuracy of the learning algorithm.

Label Data

The project_dir/labels/*.txt files contain genomic regions with or without peaks. These labels will be used to train the peak prediction models (automatically select model parameters that yield optimal peak prediction accuracy). A quick and easy way to create labels is by visual inspection as in the McGill ChIP-seq peak detection benchmark (for details please read Hocking et al, Bioinformatics 2016).

To visually label your data first create a project directory on a webserver. For example if your project directory is in your ~/public_html directory, your directory structure should be ~/public_html/project_dir/samples/groupID/sampleID/coverage.bigWig. To create a track hub, put the following in ~/public_html/project_dir/hub.sh:

#!/bin/bash
Rscript -e 'PeakSegPipeline::create_track_hub("/home/tdhock/PeakSegPipeline-test/input", "http://members.cbio.mines-paristech.fr/~thocking/hubs/test/input", "hg19", "[email protected]")'

The arguments of the create_track_hub function are as follows:

The first argument path/to/project_dir is an absolute path to your data/project directory.
The second argument http://your.server.com/~user/project_dir is the URL where that directory will be made available on the web..
The third argument hg19 is the UCSC genome ID for the genomes.txt file.
The fourth argument [email protected] is your email address, which will be written to the hub.txt file.

If that command worked, then you should see a message Created http://your.server.com/~user/project_dir/hub.txt and then you can paste that URL into My Data -> Track Hubs -> My Hubs then click Add Hub to tell the UCSC genome browser to display your data. Navigate around the genome until you have found some peaks, then add positive and negative labels in project_dir/labels/*.txt files.

For example the test data set contains only one labels file,

~/PeakSegPipeline-test/demo/labels/some_labels.txt

which contains lines such as the following

chr10:33,061,897-33,162,814 noPeaks
chr10:33,456,000-33,484,755 peakStart kidney
chr10:33,597,317-33,635,209 peakEnd kidney
chr10:33,662,034-33,974,942 noPeaks

chr10:35,182,820-35,261,001 noPeaks
chr10:35,261,418-35,314,654 peakStart bcell kidney

A chunk is a group of nearby labels. In the example above there are two chunks (far apart genomic regions, separated by an empty line). The first chunk has two regions with noPeaks labels in all samples, and two regions with positive labels in kidney samples and noPeaks labels in bcell samples. The second chunk has one region with noPeaks in bcell and kidney samples, and one region with a peakStart label in bcell and kidney samples.

In general, the labels file is divided into separate chunks by empty lines. Each chunk should contain lines for several nearby genomic regions, the corresponding label (noPeaks, peakStart, peakEnd, peaks), and the sample groups to which that label should be assigned (all other groups mentioned in the labels file will receive the noPeaks label). Ideally, each chunk should contain

At least one label with a peak in all samples.
At least one label with no peaks in any samples.
At least one label with a peak in some samples but not others (these labels are crucial for the model to be able to learn what is a significant difference between up and down).

Visualizing labels. After having added some labels in project_dir/labels/*.txt files, run the R command

PeakSegPipeline::convert_labels("project_dir")

to create project_dir/all_labels.bed. Then when you re-create the track hub, it will include a new track “Manually labeled regions with and without peaks” that displays the labels you have created.

Genomic segmentation problems

The last input file that you need to provide is a list of separate segmentation problems for your reference genome (regions without gaps, i.e. contigs). This file should be in BED format (e.g. hg19_problems.bed).

If you don’t use hg19, but you do use another standard genome that is hosted on UCSC, then you can use PeakSegPipeline::downloadProblems.

PeakSegPipeline::downloadProblems("hg38", "hg38_problems.bed")

If your reference genome does not exist on UCSC, you can use PeakSegPipeline::gap2problems to make a problems.bed file.

PeakSegPipeline::gap2problems("yourGenome_gap.bed", "yourGenome_chromInfo.txt", "yourGenome_problems.bed")

where the chromInfo file contains one line for every chromosome, and the gap file contains one line for every gap in the reference (unknown / NNN sequence). If there are no gaps in your genome, then you can use yourGenome_chromInfo.txt as a problems.bed file.

Running PeakSegPipeline on a cluster via batchtools

Since the human genome is so large, we recommend to do model training and peak prediction in parallel using batchtools. We test to make sure that PeakSegPipeline runs to completion using the SLURM cluster system. The PeakSegPipeline SLURM template which adds support for dependencies between jobs must be used, by specifying the following batchtools configuration (I put it in ~/.batchtools.conf.R):

cluster.functions = makeClusterFunctionsSlurm(system.file(
  file.path("templates", "slurm-afterok.tmpl"),
  package="PeakSegPipeline",
  mustWork=TRUE))

Some operations in each step/job can be run in parallel on each node. This usually results in speedups when there are few jobs per compute node (so there is little risk of maxing out disk i/o bandwidth on each node). If you want to do these operations in parallel in each job, you can declare a future plan in your ~/.Rprofile via e.g.

future::plan("multiprocess")

PeakSegPipeline does a lot of writing to temporary files on disk during calls to PeakSegDisk::PeakSegFPOP_dir in Step 1 of the pipeline. These disk writes are parallelized across all labeled samples and contigs. To decrease the chance of maxing out disk i/o bandwidth, PeakSegPipeline tries to avoid writing at the same time on the same disk, by doing these writes to tempdir(), which is usually on a local disk on the compute node where the job is running (i.e. not available to the other compute nodes, not on a network file system). R puts its temp dir in the first of TMPDIR, TMP, TEMP environment variables, so you can set one of these if you don’t want to use the default, /tmp.

To run the pipeline using batchtools (via SLURM or whatever other cluster functions you have registered), use

jobs <- PeakSegPipeline::jobs_create("~/PeakSegPipeline-test/demo")
PeakSegPipeline::jobs_submit_batchtools(jobs, resources=list(
  walltime = 24*60*60,#seconds
  memory = 2000,#megabytes per cpu
  ncpus=2,
  ntasks=1,
  chunks.as.arrayjobs=TRUE))

You can edit the time/memory required for each job via the resources argument (these resources are used for each job/step). Details of the jobs/steps are explained on the wiki. To monitor the progress of jobs/steps you can use:

reg.dirs <- Sys.glob("~/PeakSegPipeline-test/demo/registry/*")
reg.list <- lapply(reg.dirs, batchtools::loadRegistry)
lapply(reg.list, function(reg){
  batchtools::getStatus(reg=reg)
})

The last step includes creation of the summary web page (links to some examples here) ~/PeakSegPipeline-test/demo/index.html which has links to peak prediction files, plots, and a track hub ~/PeakSegPipeline-test/demo/hub.txt which can be used on the UCSC genome browser. It shows ~/PeakSegPipeline-test/demo/samples/*/*/coverage.bigWig and ~/PeakSegPipeline-test/demo/samples/*/*/joint_peaks.bigWig files together in multiWig containers (for each sample, a colored coverage profile with superimposed peak calls as horizontal black line segments). To use the track hub, make sure the ~/PeakSegPipeline-test/demo/ directory is publicly accessible on the web.

Output Files

The PeakSegPipeline::plot_all function creates

index.html a web page which summarizes the results,
peaks_matrix_sample.tsv.gz a binary matrix (peaks x samples) in which 1 means peak and 0 means no peak.
peaks_matrix_group.tsv.gz a binary matrix (peaks x groups) in which 1 means peak and 0 means no peak.
peaks_matrix_likelihood.tsv.gz a numeric matrix (peaks x samples) of likelihood values, larger means a more likely peak.
peaks_matrix_meanCoverage.tsv.gz a numeric matrix (peaks x samples) of mean coverage in peaks.
peaks_summary.tsv is a table with a row for each genomic region that has a peak in at least one sample. The columns are
- chrom, peakStart, peakEnd genomic region of peak.
- Input.up if there is an Input group, then this is TRUE for rows where the Input group is predicted to have a peak. This can be useful for filtering/removing peaks which are non-specific.
- group.loss.diff and sample.loss.diff the likelihood of the peak (larger values mean taller and wider peaks in more samples).

Related work

PeakSegPipeline uses

PeakSegDisk for predicting multiple peaks per contig. (for each sample independently)
PeakError to compute the number of incorrect labels for each peak model.
penaltyLearning for supervised penalty learning algorithms (interval regression) which are used to predict model complexity (log penalty = number of peaks).
PeakSegJoint for defining joint peak boundaries across any number of samples and cell types. (independently for each genomic region with a peak predicted by PeakSegDisk)

Name		Name	Last commit message	Last commit date
Latest commit History 421 Commits
R		R
data		data
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS		NEWS
README.org		README.org
appveyor.yml		appveyor.yml
batchtools.conf.R		batchtools.conf.R
build.sh		build.sh
slurm.conf		slurm.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation and testing

Pipeline Input Files

Coverage data

Label Data

Genomic segmentation problems

Running PeakSegPipeline on a cluster via batchtools

Output Files

Related work

About

Releases

Packages

Contributors 3

Languages

tdhock/PeakSegPipeline

Folders and files

Latest commit

History

Repository files navigation

Installation and testing

Pipeline Input Files

Coverage data

Label Data

Genomic segmentation problems

Running PeakSegPipeline on a cluster via batchtools

Output Files

Related work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages