- Calculate Targeted Coverage
This pipeline extracts read depth calculations from a BAM file and generates outputs that are useful to the interpretation and downstream variant calling of a targeted sequencing experiment. Relevant datasets include targeted gene panels and whole exome sequencing (WXS) experiments. For in-depth downstream coverage QC, the pipeline can output per-base read depth at all targeted loci specified by a target BED file and read depth at genome-wide "off-target" well characterized polymorphic sites known to dbSNP. For a more general overview of targeted sequencing quality, the pipeline can output QC metrics produced by picard CollectHsMetrics
. As a direct contribution to a DNA processing workflow, the pipeline can output a coordinate BED file containing target intervals merged with intervals encompassing off-target dbSNP sites enriched in coverage (as determined by a user-defined read-depth threshold). This new coordinate file can be used to indicate base quality recalibration and variant calling intervals to pipeline-recalibrate-BAM
and gatk HaplotypeCaller
in pipeline-call-gSNP
directly or through metapipeline-DNA
.
calculates per-base read depth in a BAM file at "target" intervals specified by a target BED file and at "off-target" well characterized polymorphic loci This pipeline performs coverage calculations from a BAM file at intervals specified by a target bed file and reports some basic coverage metrics. The SAMtools depth tool is used to calculate per-base coverage in specified regions. This intermediate output is converted into bed format using an awk script. Then the BEDtools merge tool is used to collapse consecutive coordinates into intervals, with a final output reporting a comma-separated list of per-base read depths for each coordinate in an interval. Picard's CollectHsMetrics is used to report various interval related metrics on the input BAM.
-
Update the params section of the .config file
-
Update the input yaml
-
See the submission script, here, to submit your pipeline
Currently supported Nextflow versions: v23.04.2
A directed acyclic graph of your pipeline.
Per-base depth is calculated from the input BAM file at coordinates specified by the input target BED file using
samtools depth
. Ifoff_target_depth
is set totrue
, per-base read depth is also calculated genome-wide at dbSNP loci with a dbSNP reference VCF used as the coordinate file tosamtools depth
.
TSV output from
samtools depth
is converted into BED format usingawk
with read depth reported in the fourth column. Per-base read depth across multiple-base-pair target intervals is collapsed into a comma-separated list of read depth values, one for each base pair encompassed by the interval (bedtools merge
).
dbSNP coordinates are filtered to keep only off-target regions. This is done by excluding coordinates specified in the target BED file from the dbSNP read depth BED using
bedtools intersect
. Near-target regions (+/- 500bp by default) are also excluded by first adding near-target buffers to the specified target intervals usingbedtools slop
.
dbSNP coordinates from step 2 are filtered to keep sites exceeding a minimum read depth threshold (30x by default) using
awk
.
Filtered dbSNP coordinates from step 4 are expanded to include nearby basepairs, so that sites that are close together can be subsequently be merged into one interval (
bedtools slop
).
Coverage enriched dbSNP intervals are merged with the original target intervals into one BED file using a series of bash commands that concatenate and sort the two files, then merge with
bedtools
.
Target BED file and optional bait file are converted to INTERVAL_LIST format using picard
BedToIntervalList
then used to report metrics on input BAM with picardCollectHsMetrics
.
Input and Input Parameter/Flag | Required | Type | Description |
---|---|---|---|
input.BAM |
yes | path | BAM file for which to calculate coverage, path provided in input yaml. |
target_BED |
yes | path | BED file specifying target intervals (defines regions for target and off-target coverage operations). |
save_intermediate_files |
yes | boolean | Whether to save intermediate files. |
reference_dict |
yes | path | Human genome reference dictionary file for use in BED to INTERVAL_LIST conversion. Required if collecting metrics. |
reference_dbSNP |
yes | path | dbSNP reference VCF file, with proper chromosome encoding and compression. See discussion. Required if performing off-target read depth calculation. |
genome_sizes |
yes | path | Reference file consisting of chromosomes and their lengths used by bedtools slop . Required for off-target read depth workflows. .fai files accepted. |
target_depth |
no | bool | Whether to calculate per-base read depth in targeted regions. Default false. |
off_target_depth |
no | bool | Whether to perform off-target read depth calculation at dbSNP loci. Default true. |
output_enriched_target_file |
no | bool | Whether to output a new target file containing coverage-enriched off-target dbSNP loci. Default true. |
min_read_depth |
no | bool | Minimum read depth threshold for an off-target locus to be considered enriched and be included in the new target file. Default 30. |
min_base_quality |
no | integer | Minimum base quality for a read to be counted in depth calculation by samtools depth . Applies to both off- and on-target calculations. Defaults to 20. |
min_mapping_quality |
no | integer | Minimum mapping quality for a read to be counted in depth calculation by samtools depth . Applies to both off- and on-target calculations. Defaults to 20. |
collect_metrics |
no | bool | Whether to run CollectHsMetrics . Default true. |
target_interval_list |
no | path | Interval list file specifying target intervals used to calculate coverage by collecHsMetrics . If not provided, the target BED will be used to calculate the intervals. |
bait_BED |
no | path | BED file with bait locations that can be used to generate a bait interval list used by CollecHsMetrics . If not provided, the target BED will be used. |
bait_interval_list |
no | path | Interval list file specifying bait intervals used by CollectHsMetrics . If not provided, the bait BED will be used to calculate the intervals. |
save_interval_list |
yes | boolean | Whether to save a copy of any generated interval lists. Saves to the output_dir . |
save_all_dbSNP |
no | boolean | Whether to save a copy of the read depth BED file for all dbSNP loci generated by the off-target workflows. Default false. |
save_raw_target_bed |
no | boolean | Whether to save a copy of the per-base, target read depth BED with uncollapsed intervals. Default false. |
off_target_slop |
no | integer | Number of base pairs to add to either side of target file coordinates so that they may be excluded from off-target read depth calculation. Default is 500. |
dbSNP_slop |
no | integer | Number of base pairs to add to either side of off-target dbSNP loci to generate off-target intervals. The purpose is to merge adjacent dbSNP loci into single intervals prior to mergeing with target intervals. Default is 150. |
coverage_cap |
no | integer | COVERAGE_CAP parameter for CollectHsMetrics , determines the coverage threshold at which to stop calculating coverage. |
near_distance |
no | integer | NEAR_DISTANCE parameter for CollectHsMetrics , determines the maximum distance in bp of a read from the nearest probe (bait) for it to be counted as "near probe" in metrics calculations. Default 250. |
samtools_depth_extra_args |
no | string | Extra arguments for samtools depth . |
picard_CollectHsMetrics_extra_args |
no | string | Extra arguments for picard CollectHsMetrics . |
merge_operation |
no | string | Operation performed on read depth column values when intervals are collapsed during bedtools merge . Defaults to 'collapse'. See bedtools documentation for other options. |
work_dir |
no | path | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is /scratch and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively. |
Output and Output Parameter/Flag | Description |
---|---|
output_dir |
Location where generated output should be saved. |
*target-with-enriched-off-target-intervals.bed |
New target file including original target intervals and intervals encompassing coverage-enriched off-target dbSNP sites. |
*target-with-enriched-off-target-intervals.bed.gz |
New compressed target file including original target intervals and intervals encompassing coverage-enriched off-target dbSNP sites. |
*off-target-dbSNP-depth-per-base.bed |
Per-base read depth at dbSNP loci outside of targeted regions. |
*collapsed_coverage.bed |
Per-base read depth at specified target intervals, collapsed by interval. (OPTIONAL) Set target_depth in config file. |
*target-depth-per-base.bed |
Per-base read depth at target intervals (not collapsed). (OPTIONAL) set save_raw_target_bed in config file. |
*genome-wide-dbSNP-depth-per-base.bed |
Per-base read depth at all dbSNP loci. (OPTIONAL) Set save_all_dbSNP in config file. |
*HsMetrics.txt |
QC output from CollectHsMetrics() |
.tsv ,.bed |
Intermediate outputs of unformatted and unmerged depth files. (OPTIONAL) Set save_intermediate_files in config file. |
.interval_list |
Intermediate output of target bed file converted to picard's interval list format. (OPTIONAL) Set save_interval_list in config file. |
report.html , timeline.html and trace.txt |
A Nextflowreport, timeline and trace files |
log.command.* |
Process specific logging files created by nextflow. |
Testing was performed in the Boutros Lab SLURM Development cluster. Pipeline version used here is v1.0.0-rc.1
General estimates, with wide variations, are that smaller gene panel experiments require 16 cpus and 32GB of memory to run all processes efficiently in parallel. However each individual process requires much fewer resources, and 1CPU and 1GB is frequently sufficient for most component tools. Larger numbers of targets may increase memory requirements, particularly for interval merging steps.
General estimates, with wide variations, are that whole exome experiments require 16 CPUs and 32GB of memory to run all processes efficiently in parallel. However each individual process requires much fewer resources, and 1CPU and 1GB is frequently sufficient for most component tools.
- Issue tracker to report errors and enhancement ideas.
- Discussions can take place in pipeline-calculate-targeted-coverage Discussions
- pipeline-calculate-targeted-coverage pull requests are also open for discussion.
Please see list of Contributors at GitHub.
pipeline-calculate-targeted-coverage is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.
pipeline-calculate-targeted-coverage performs read-depth related calculations on BAMs from targeted sequencing experiments.
Copyright (C) 2022-2024 University of California Los Angeles ("Boutros Lab") All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.