metapipeline-DNA

metapipeline-DNA

Overview

Metapipeline-DNA is a DNA sequencing processing pipeline that accepts sequencing data as input. The data may be in FASTQ format or in aligned format (BAM/CRAM - BETA FEATURE), with options for re-alignment with back-conversion to FASTQ format. The FASTQs are aligned to the reference genome and recalibrated with insertion-deletion (INDEL) realignment and base quality score recalibration, followed by quality control steps including targeted coverage calculation and whole genome sequencing (WGS) metrics. Various calling steps are performed to identify germline single-nucleotide polymorphisms (SNPs), somatic single-nucleotide variants (SNVs), mitochondrial SNVs, germline structural variants (SVs), somatic SVs, and somatic copy-number aberrations (CNAs). The processing culminates with subclonal reconstruction (SRC).

How To Run

Create a config file using the template, which takes the input samples along with general parameters, with a section defining the parameters, reference files, and resources configurations for each run and each pipeline. See configuration for details on available options and configurations.
Create an input.csv or an input.yaml file (following the descriptions here) to provide input files for each sample. If using an input CSV, add the path to the CSV to the config generated in step 1.
Launch the pipeline using:

# YAML input
nextflow run \
    /path/to/metapipeline-dna/main.nf \
    -c /path/to/generated/config \
    -params-file /path/to/generated/input.yaml

# CSV input
nextflow run \
    /path/to/metapipeline-dna/main.nf \
    -c /path/to/generated/config

Note: UCLAHS-CDS users, submit the pipeline using the submission script.

warning: A low-resource partition (e.g F2 with 2 CPUs and 4GB of memory) is sufficient for the leading job.

Flow Diagram

The general execution of metapipeline-DNA follows these steps, with job submission managed with Slurm:

Each worker node performs the following steps:

Pipeline Steps

1. convert-BAM2FASTQ

Optional: Only run when BAM or CRAM are provided as input and realignment is not overriden.

Aligned data is back-converted to FASTQ using pipeline-convert-BAM2FASTQ.

2. align-DNA

FASTQ data is (re)aligned to the genome on a per-sample basis using pipeline-align-DNA.

3. calculate-targeted-coverage

For targeted or exome sequencing, depth for the target regions is assessed along with off-target coverage enrichment using pipeline-calculate-targeted-coverage.

4. recalibrate-BAM

The aligned BAM undergoes INDEL realignment and base quality score recalibration using pipeline-recalibrate-BAM.

5. generate-SQC-BAM

Quality control is performed on the recalibrated BAM using pipeline-generate-SQC-BAM.

6. call-gSNP

Germline SNPs are called using pipeline-call-gSNP.

7. call-sSNV

Somatic SNVs are called using pipeline-call-sSNV.

8. call-mtSNV

Mitochondrial SNVs are called using pipeline-call-mtSNV.

9. call-gSV

Germline SVs are called using pipeline-call-gSV.

10. call-sSV

Somatic SVs are called using pipeline-call-sSV.

11. call-sCNA

Somatic CNAs are called using pipeline-call-sCNA.

12. call-SRC

Subclonal reconstruction is performed using pipeline-call-SRC.

Configuration

The following parameters are available at the metapipeline level:

Parameter	Type	Required	Description
`output_dir`	path	yes	Absolute path to directory where output files will be saved
`leading_work_dir`	path	yes	Absolute path to common working directory (under `/hot` for example for access across all nodes). Cannot be `/scratch` or any node-specific directory.
`pipeline_work_dir`	path	yes	Absolute path to outputs from each individual pipeline before copying to `output_dir`. Suggested: `/scratch`
`project_id`	string	yes	Project identifier used to name the main output directory of the run
`save_intermediate_files`	boolean	yes	Whether to save intermediate files. Default: `false`
`partition`	string	yes	Partition type for submitting each processing jobs
`clusterOptions`	string	yes	Additional `slurm` submission options
`max_parallel_jobs`	integer	yes	Number of jobs to submit at once. Default: 5
`cluster_submission_interval`	integer	yes	Time in minutes to wait between job submissions, Default: 90
`sample_mode`	string	yes	Mode for sample calling. Options: `paired`, `single`, `multi`. Default: `paired`
`requested_pipelines`	list	yes	List of pipelines requested.
`use_original_intervals`	boolean	yes	Whether original intervals should be used with pipelines rather than expanded intervals generated by calculate-targeted-coverage
`pipeline_params`	namespace	yes	Namespace containing parameters for each individual pipeline. Parameters for the requested pipelines must be given.
`override_realignment`	boolean	yes	Whether to override conversion to FASTQ and realignment when given BAM input. Default: `false`
`override_recalibrate_bam`	boolean	yes	Whether to override recalibrate-BAM pipeline when given BAM input. Default: `false`
`src_snv_tool`	string	yes	Which SNV tool's output to use for SRC. Default: `BCFtools-Intersect`
`src_cna_tool`	string	yes	Which CNA tool's output to use for SRC. Default: `Battenberg`
`override_src_precursor_disable`	boolean	yes	Whether to override the automatic disabling of either call-sSNV or call-sCNA when the respective outputs are provided in the input. Default: `false`
`src_run_all_combinations`	boolean	yes	TO-DO: Whether to run SRC using all combinations of SNV tool and CNA tool. Default: `false`
`run_downstream_pipelines_serially`	boolean	no	Whether to run pipelines downstream of recalibrate-BAM sequentially. Note: if this option is used in conjunction with `downstream_pipeline_order`, any pipelines with a given ordering will be run sequentially regardless of the value of this parameter. Default: `false`
`downstream_pipeline_order`	list	no	List indicating specific order in which to run pipelines downstream of recalibrate-BAM. Default: no order
`input_csv`	path	no	Absolute path to input CSV when using CSV input
`status_email_address`	string	no	Email address to notify when child pipelines start and complete. Default: ``

UCLAHS-CDS WGS global sample job submission parameters

The following parameters are intended to control the global number and rate of WGS jobs. By default, these parameters are enabled; in the case of non-WGS samples or non-UCLAHS-CDS environment, disable uclahs_cds_wgs in the config file params.

Input Parameter	Type	Required	Description
`uclahs_cds_wgs`	boolean	yes	Whether global job number and submission limits should be applied. Default: `true`
`global_rate_limit`	integer	yes	Time in minutes between submission of any WGS jobs. Default: 90

Pipeline selection

Pipeline selection is controlled by the requested_pipelines parameter. Given the list of requested pipelines, metapipeline-DNA will automatically identify any necessary dependencies and enable them for the run.

Pipeline selection follows some default behaviors:

When given BAM input, the default pipeline selector will perform conversion to FASTQ, re-align the FASTQs, and perform recalibration. This default behavior can be disabled with the override_realignment and override_recalibrate_bam parameters. With override_realignment, the back-conversion to FASTQ and re-alignment will be disabled. With override_recalibrate_bam, recalibration of the BAM using recalibrate-BAM will be disabled.
When SNV or CNA calls are given as inputs, metapipeline-DNA will automatically disable the call-sSNV and call-sCNA pipelines, respectively, and use the given inputs for call-SRC. This behavior can be controlled by override_src_precursor_disable to let metapipeline-DNA run the call-sSNV and call-sCNA pipelines to generate inputs for call-SRC using the BAM or FASTQ inputs. Note: This option only has an effect in the case of mixed inputs being provided as the call-sSNV and call-sCNA pipelines require sequencing data as inputs.

Pipeline-specific params

Each pipeline has a set of parameters that must be provided. The available parameters for each pipeline are documented in the links in the steps. Additionally, the default template.config contains the default set of parameters that need to be defined for each pipeline. Any additional supported parameters can be added as needed. The following keys are used as the pipeline names in this namespace:

Pipeline	Key
`convert-BAM2FASTQ`	`convert_BAM2FASTQ`
`align-DNA`	`align_DNA`
`recalibrate-BAM`	`recalibrate_BAM`
`calculate-targeted-coverage`	`calculate_targeted_coverage`
`generate-SQC-BAM`	`generate_SQC_BAM`
`call-gSNP`	`call_gSNP`
`call-sSNV`	`call_sSNV`
`call-mtSNV`	`call_mtSNV`
`call-gSV`	`call_gSV`
`call-sSV`	`call_sSV`
`call-sCNA`	`call_sCNA`
`call-SRC`	`call_SRC`

Each pipeline also defines a set of resources per process to run. These resources can be modified if necessary on a per-process per-pipeline basis by using the base_resource_update functionality for the specific pipeline (this functionality is defined in each pipeline's README). For example, to double the base memory of all processes in the call-sSNV pipeline:

params {
    ...
    pipeline_params {
        ...
        call_sSNV {
            ...
            base_resource_update {
                memory = [
                    [[], 2]
                ]
            }
        }
        ...
    }
}

Intervals

For targeted or exome sequencing, target intervals can be provided in BED format to some of the steps to control processing. The following steps accept intervals:

Step/pipeline	Parameter name
`call-sSNV`	`intersect_regions`
`call-gSNP`	`intervals`
`recalibrate-BAM`	`intervals`
`calculate-targeted-coverage`	`target_bed`

For the respective pipeline params, provide the full path to the intervals file in the generated config to make use of the targets. For example:

params {
    ...
    pipeline_params {
        ...
        call_sSNV {
            ...
            intersect_regions = "/full/path/to/intervals"
            ...
        }
        ...
    }
}

Sample modes

The metapipeline supports running samples in three modes: single, paired, and multi. This is controlled by the sample_mode parameter. In paired or multi sample modes, each patient is expected to have exactly one normal sample and one or more tumor samples.

Given the set of input patients and samples, grouping of samples is controlled based on the run mode as follows:

Single sample mode

All samples are processed individually, regardless of patient, as separate jobs.

Normal samples will go through germline calling (call-gSNP, call-gSV) and somatic SNV calling with Mutect2's normal-only mode.
Tumor samples will go through germline calling (call-gSNP) and somatic SNV calling with Mutect2's tumor-only mode.

Paired sample mode

All samples from the same patient are submitted as a single job, with each normal-tumor pair processed separately in the same job.

Individual samples will go through the convert-BAM2FASTQ and align-DNA pipelines.
The normal sample will then be paired with each tumor sample and each pair will go through recalibration and the somatic calling pipelines.
The normal sample will go through call-gSV.

Multi sample mode

All samples from the same patient are processed as a single job.

Individual samples will go through the convert-BAM2FASTQ and align-DNA pipelines.
The recalibration and germline SNP calling will then proceed on the entire set of samples together.
Somatic SNV calling will proceed in two ways:
1. The normal sample will be paired with each tumor sample and run through the call-sSNV pipeline
2. If Mutect2 was requested, the entire set of samples will go through multi-sample calling with just Mutect2 in call-sSNV.
The normal sample will be paired with each tumor sample and each pair will go through call-mtSNV, call-sSV, and call-sCNA.
The normal sample will go through call-gSV.

Inputs

Inputs can be provided in either CSV or YAML format.

For CSV inputs, identify the fields needed for each input type below and include the respective fields. For mixed inputs, use empty cell values - see template CSVs for examples.

For YAML inputs, see template YAMLs. In each template YAML, any key or value in <> needs to be filled in and the <> removed, ex. <patient1> should be filled in with the actual patient ID, e.g. PRAD0001. Other keys not in <> must be kept as they are.

---
input:
    <patient1>:
...

should be filled in to become:

---
input:
    PRAD0001:
...

Input BAM

Field	Type	Required	Description
patient	string	yes	Identifier for the patient
sample	string	yes	Identifier for the sample
state	string	yes	Must be either "tumor" or "normal"
path	path	yes	Absolute path to the sample BAM file

See this template for CSV format and this template for YAML format.

Input FASTQ

Field	Type	Required	Description
patient	string	yes	Identifier for the patient
sample	string	yes	Identifier for the sample
state	string	yes	Must be either "tumor" or "normal"
read_group_identifier	string	yes	Read group ID
sequencing_center	string	yes	Center where sequencing was performed
library_identifier	string	yes	Library used for sample
platform_technology	string	yes	Technology used for sequencing
platform_unit	string	yes	Name of specific platform unit
bam_header_sm	string	yes	Sample name tag for BAM
lane	string	yes	Lane identifier for sample
read1_fastq	path	yes	Absolute path to R1 FASTQ
read2_fastq	path	yes	Absolute path to R2 FASTQ

See this template for CSV format and this template for YAML format.

Input SRC

For SRC input, only call-SRC can be run. In this case, for each tumor sample, SNV calls and CNA calls must be provided.

Field	Type	Required	Description
patient	string	yes	Identifier for the patient
sample	string	yes	Identifier for the sample
state	string	yes	Must be either "tumor" or "normal"
src_input_type	string	yes	The type of input, must be either "CNA" or "SNV"
src_input_algorithm	string	yes	Algorithm used to generate the input
src_path	string	yes	Full path to the file

See this template for CSV format and this template for YAML format.

Mixed input

A mix of SRC and sequencing inputs can also be provided, in cases where for example CNA calling has already been done and SNV calling needs to be performed.

CNA calls available

If CNA calls are already available, provide the CNA calls as SRC input and provide the sequencing data (either FASTQ or BAM/CRAM) as FASTQ or BAM/CRAM input. With CSV input, keep all columns and leave fields black per row as needed. See template CSV for the CSV format and template YAML for the YAML format.

SNV calls available

If SNV calls are already available, provide the SNV calls as SRC input and provide the sequencing data (either FASTQ or BAM/CRAM) as FASTQ or BAM/CRAM input. With CSV input, keep all columns and leave fields black per row as needed. See template CSV for the CSV format and template YAML for the YAML format.

Outputs

Outputs will be placed under <params.output_dir>/metapipeline-DNA-<version>/<params.project_id> and organized by individual pipeline. See individual pipeline documentation for specific outputs generated per pipeline.

Discussions

Issue tracker to report errors and enhancement ideas.
Discussions can take place in metapipeline-DNA Discussions
metapipeline-DNA pull requests are also open for discussion

Contributors

Please see list of Contributors at GitHub.

References

Yash Patel, Arpi Beshlikyan, Madison Jordan, Gina Kim, Aaron Holmes, Takafumi N Yamaguchi, Paul C Boutros, PipeVal: light-weight extensible tool for file validation, Bioinformatics, Volume 40, Issue 2, February 2024, btae079, https://doi.org/10.1093/bioinformatics/btae079
Yash Patel, Chenghao Zhu, Takafumi N Yamaguchi, Yuan Zhe Bugh, Mao Tian, Aaron Holmes, Sorel T Fitz-Gibbon, Paul C Boutros, NFTest: automated testing of Nextflow pipelines, Bioinformatics, Volume 40, Issue 2, February 2024, btae081, https://doi.org/10.1093/bioinformatics/btae081

License

metapipeline-DNA is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.

Metapipeline-DNA is a Nextflow pipeline to convert BAM to FASTQ, align, perform QC, assess targeted coverage, call gSNP, call sSNV, call mtSNV, call SVs, call sCNA, and perform subclonal reconstruction.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 355 Commits
.github		.github
config		config
external		external
img		img
input		input
module		module
templates		templates
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE.MD		LICENSE.MD
README.md		README.md
main.nf		main.nf
metadata.yaml		metadata.yaml
nextflow.config		nextflow.config
nftest.yaml		nftest.yaml

License

uclahs-cds/metapipeline-DNA

Folders and files

Latest commit

History

Repository files navigation

metapipeline-DNA

Overview

How To Run

Flow Diagram

Pipeline Steps

1. convert-BAM2FASTQ

2. align-DNA

3. calculate-targeted-coverage

4. recalibrate-BAM

5. generate-SQC-BAM

6. call-gSNP

7. call-sSNV

8. call-mtSNV

9. call-gSV

10. call-sSV

11. call-sCNA

12. call-SRC

Configuration

UCLAHS-CDS WGS global sample job submission parameters

Pipeline selection

Pipeline-specific params

Intervals

Sample modes

Single sample mode

Paired sample mode

Multi sample mode

Inputs

Input BAM

Input FASTQ

Input SRC

Mixed input

CNA calls available

SNV calls available

Outputs

Discussions

Contributors

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 30

Packages 0

Contributors 11

Languages

Packages