Calculate Targeted Coverage

Calculate Targeted Coverage

Overview

This pipeline extracts read depth calculations from a BAM file and generates outputs that are useful to the interpretation and downstream variant calling of a targeted sequencing experiment. Relevant datasets include targeted gene panels and whole exome sequencing (WXS) experiments. For in-depth downstream coverage QC, the pipeline can output per-base read depth at all targeted loci specified by a target BED file and read depth at genome-wide "off-target" well characterized polymorphic sites known to dbSNP. For a more general overview of targeted sequencing quality, the pipeline can output QC metrics produced by picard CollectHsMetrics. As a direct contribution to a DNA processing workflow, the pipeline can output a coordinate BED file containing target intervals merged with intervals encompassing off-target dbSNP sites enriched in coverage (as determined by a user-defined read-depth threshold). This new coordinate file can be used to indicate base quality recalibration and variant calling intervals to pipeline-recalibrate-BAM and gatk HaplotypeCallerin pipeline-call-gSNP directly or through metapipeline-DNA.

calculates per-base read depth in a BAM file at "target" intervals specified by a target BED file and at "off-target" well characterized polymorphic loci This pipeline performs coverage calculations from a BAM file at intervals specified by a target bed file and reports some basic coverage metrics. The SAMtools depth tool is used to calculate per-base coverage in specified regions. This intermediate output is converted into bed format using an awk script. Then the BEDtools merge tool is used to collapse consecutive coordinates into intervals, with a final output reporting a comma-separated list of per-base read depths for each coordinate in an interval. Picard's CollectHsMetrics is used to report various interval related metrics on the input BAM.

How To Run

Update the params section of the .config file
Update the input yaml
See the submission script, here, to submit your pipeline

Requirements

Currently supported Nextflow versions: v23.04.2

Flow Diagram

A directed acyclic graph of your pipeline.

Pipeline Steps

1. Depth Calculation

Per-base depth is calculated from the input BAM file at coordinates specified by the input target BED file using samtools depth. If off_target_depth is set to true, per-base read depth is also calculated genome-wide at dbSNP loci with a dbSNP reference VCF used as the coordinate file to samtools depth.

2. BED Formatting

TSV output from samtools depth is converted into BED format using awk with read depth reported in the fourth column. Per-base read depth across multiple-base-pair target intervals is collapsed into a comma-separated list of read depth values, one for each base pair encompassed by the interval (bedtools merge).

3. dbSNP off-target site filtering

dbSNP coordinates are filtered to keep only off-target regions. This is done by excluding coordinates specified in the target BED file from the dbSNP read depth BED using bedtools intersect. Near-target regions (+/- 500bp by default) are also excluded by first adding near-target buffers to the specified target intervals using bedtools slop.

4. dbSNP enriched read depth filtering

dbSNP coordinates from step 2 are filtered to keep sites exceeding a minimum read depth threshold (30x by default) using awk.

5. dbSNP coverage-enriched interval expansion

Filtered dbSNP coordinates from step 4 are expanded to include nearby basepairs, so that sites that are close together can be subsequently be merged into one interval (bedtools slop).

6. On-target and enriched off-target interval merging

Coverage enriched dbSNP intervals are merged with the original target intervals into one BED file using a series of bash commands that concatenate and sort the two files, then merge with bedtools.

7. Metrics Reporting

Target BED file and optional bait file are converted to INTERVAL_LIST format using picard BedToIntervalList then used to report metrics on input BAM with picard CollectHsMetrics.

Inputs and Configuration

Input and Input Parameter/Flag	Required	Type	Description
`input.BAM`	yes	path	BAM file for which to calculate coverage, path provided in input yaml.
`target_BED`	yes	path	BED file specifying target intervals (defines regions for target and off-target coverage operations).
`save_intermediate_files`	yes	boolean	Whether to save intermediate files.
`reference_dict`	yes	path	Human genome reference dictionary file for use in BED to INTERVAL_LIST conversion. Required if collecting metrics.
`reference_dbSNP`	yes	path	dbSNP reference VCF file, with proper chromosome encoding and compression. See discussion. Required if performing off-target read depth calculation.
`genome_sizes`	yes	path	Reference file consisting of chromosomes and their lengths used by `bedtools slop`. Required for off-target read depth workflows. `.fai` files accepted.
`target_depth`	no	bool	Whether to calculate per-base read depth in targeted regions. Default false.
`off_target_depth`	no	bool	Whether to perform off-target read depth calculation at dbSNP loci. Default true.
`output_enriched_target_file`	no	bool	Whether to output a new target file containing coverage-enriched off-target dbSNP loci. Default true.
`min_read_depth`	no	bool	Minimum read depth threshold for an off-target locus to be considered enriched and be included in the new target file. Default 30.
`min_base_quality`	no	integer	Minimum base quality for a read to be counted in depth calculation by `samtools depth`. Applies to both off- and on-target calculations. Defaults to 20.
`min_mapping_quality`	no	integer	Minimum mapping quality for a read to be counted in depth calculation by `samtools depth`. Applies to both off- and on-target calculations. Defaults to 20.
`collect_metrics`	no	bool	Whether to run `CollectHsMetrics`. Default true.
`target_interval_list`	no	path	Interval list file specifying target intervals used to calculate coverage by `collecHsMetrics`. If not provided, the target BED will be used to calculate the intervals.
`bait_BED`	no	path	BED file with bait locations that can be used to generate a bait interval list used by `CollecHsMetrics`. If not provided, the target BED will be used.
`bait_interval_list`	no	path	Interval list file specifying bait intervals used by `CollectHsMetrics`. If not provided, the bait BED will be used to calculate the intervals.
`save_interval_list`	yes	boolean	Whether to save a copy of any generated interval lists. Saves to the `output_dir`.
`save_all_dbSNP`	no	boolean	Whether to save a copy of the read depth BED file for all dbSNP loci generated by the off-target workflows. Default false.
`save_raw_target_bed`	no	boolean	Whether to save a copy of the per-base, target read depth BED with uncollapsed intervals. Default false.
`off_target_slop`	no	integer	Number of base pairs to add to either side of target file coordinates so that they may be excluded from off-target read depth calculation. Default is 500.
`dbSNP_slop`	no	integer	Number of base pairs to add to either side of off-target dbSNP loci to generate off-target intervals. The purpose is to merge adjacent dbSNP loci into single intervals prior to mergeing with target intervals. Default is 150.
`coverage_cap`	no	integer	`COVERAGE_CAP` parameter for `CollectHsMetrics`, determines the coverage threshold at which to stop calculating coverage.
`near_distance`	no	integer	`NEAR_DISTANCE` parameter for `CollectHsMetrics`, determines the maximum distance in bp of a read from the nearest probe (bait) for it to be counted as "near probe" in metrics calculations. Default 250.
`samtools_depth_extra_args`	no	string	Extra arguments for `samtools depth`.
`picard_CollectHsMetrics_extra_args`	no	string	Extra arguments for `picard CollectHsMetrics`.
`merge_operation`	no	string	Operation performed on read depth column values when intervals are collapsed during `bedtools merge`. Defaults to 'collapse'. See bedtools documentation for other options.
`work_dir`	no	path	Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively.

Outputs

Output and Output Parameter/Flag	Description
`output_dir`	Location where generated output should be saved.
`*target-with-enriched-off-target-intervals.bed`	New target file including original target intervals and intervals encompassing coverage-enriched off-target dbSNP sites.
`*target-with-enriched-off-target-intervals.bed.gz`	New compressed target file including original target intervals and intervals encompassing coverage-enriched off-target dbSNP sites.
`*off-target-dbSNP-depth-per-base.bed`	Per-base read depth at dbSNP loci outside of targeted regions.
`*collapsed_coverage.bed`	Per-base read depth at specified target intervals, collapsed by interval. (OPTIONAL) Set `target_depth` in config file.
`*target-depth-per-base.bed`	Per-base read depth at target intervals (not collapsed). (OPTIONAL) set `save_raw_target_bed` in config file.
`*genome-wide-dbSNP-depth-per-base.bed`	Per-base read depth at all dbSNP loci. (OPTIONAL) Set `save_all_dbSNP` in config file.
`*HsMetrics.txt`	QC output from CollectHsMetrics()
`.tsv`,`.bed`	Intermediate outputs of unformatted and unmerged depth files. (OPTIONAL) Set `save_intermediate_files` in config file.
`.interval_list`	Intermediate output of target bed file converted to picard's interval list format. (OPTIONAL) Set `save_interval_list` in config file.
`report.html`, `timeline.html` and `trace.txt`	A Nextflowreport, timeline and trace files
`log.command.*`	Process specific logging files created by nextflow.

Performance Validation

Testing was performed in the Boutros Lab SLURM Development cluster. Pipeline version used here is v1.0.0-rc.1

Targeted Panels

General estimates, with wide variations, are that smaller gene panel experiments require 16 cpus and 32GB of memory to run all processes efficiently in parallel. However each individual process requires much fewer resources, and 1CPU and 1GB is frequently sufficient for most component tools. Larger numbers of targets may increase memory requirements, particularly for interval merging steps.

Whole Exomes

General estimates, with wide variations, are that whole exome experiments require 16 CPUs and 32GB of memory to run all processes efficiently in parallel. However each individual process requires much fewer resources, and 1CPU and 1GB is frequently sufficient for most component tools.

References

Discussions

Issue tracker to report errors and enhancement ideas.
Discussions can take place in pipeline-calculate-targeted-coverage Discussions
pipeline-calculate-targeted-coverage pull requests are also open for discussion.

Contributors

Please see list of Contributors at GitHub.

License

pipeline-calculate-targeted-coverage is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.

pipeline-calculate-targeted-coverage performs read-depth related calculations on BAMs from targeted sequencing experiments.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.github		.github
config		config
docs		docs
external		external
input		input
module		module
script		script
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
main.nf		main.nf
metadata.yaml		metadata.yaml
nextflow.config		nextflow.config
nftest.yml		nftest.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calculate Targeted Coverage

Overview

How To Run

Requirements

Flow Diagram

Pipeline Steps

1. Depth Calculation

2. BED Formatting

3. dbSNP off-target site filtering

4. dbSNP enriched read depth filtering

5. dbSNP coverage-enriched interval expansion

6. On-target and enriched off-target interval merging

7. Metrics Reporting

Inputs and Configuration

Outputs

Performance Validation

Targeted Panels

Whole Exomes

References

Discussions

Contributors

License

About

Releases 4

Packages

Contributors 7

Languages

License

uclahs-cds/pipeline-calculate-targeted-coverage

Folders and files

Latest commit

History

Repository files navigation

Calculate Targeted Coverage

Overview

How To Run

Requirements

Flow Diagram

Pipeline Steps

1. Depth Calculation

2. BED Formatting

3. dbSNP off-target site filtering

4. dbSNP enriched read depth filtering

5. dbSNP coverage-enriched interval expansion

6. On-target and enriched off-target interval merging

7. Metrics Reporting

Inputs and Configuration

Outputs

Performance Validation

Targeted Panels

Whole Exomes

References

Discussions

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 7

Languages

Packages