BfArM-MVH/GRZ_QC_Workflow performs extended quality control of GRZ submissions according to the defined thresholds.
- Read QC (
FastQC
andFASTP
) - Alignment using (
BWAMEM2
) - Coverage calculation by (
Mosdepth
) - Present QC for raw reads (
MultiQC
)
- Install nextflow (and dependencies)
- Make sure to have either conda, docker or singularity.
- Clone the github repository
git clone https://github.com/BfArM-MVH/GRZ_QC_Workflow.git
$output_path = "path/to/analysis/dir"
This pipeline will automatically download the necessary reference genomes and creates an BWA index from them. However, when running this pipeline multiple times on different submissions, the download and indexing steps create unnecessary overhead.
To skip downloading the reference genomes, you can also download the necessary reference genome FASTA files to some shared location:
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
mv hg19.fa.gz $shared_directory/references
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
mv hg38.fa.gz $shared_directory/references
Then you can update the file paths in conf/grzqc.conf
:
params {
[...]
fasta_37 = "$shared_directory/references/hg19.fa.gz"
fasta_38 = "$shared_directory/references/hg38.fa.gz"
}
by replacing $shared_directory
with the absolute path to the shared directory.
After the first run, you can also copy the BWAMEM2 index to the shared directory:
cp -r "${output_basepath}/grzqc_output/references/" "$shared_directory/references/"
and configure it in conf/grzqc.conf
:
params {
[...]
bwa_index_37 = "$shared_directory/references/GRCh37/bwamem2"
bwa_index_38 = "$shared_directory/references/GRCh38/bwamem2"
}
by replacing $shared_directory
with the absolute path to the shared directory.
This pipeline needs a samplesheet which is generated automatically from the metadata.json file included in the submission base directory. Please make sure that the submission base directory has the required folder structure. The script run_grzqc.sh
parses the metadata.json file to create a nextflow samplesheet:
python3 bin/metadata_to_samplesheet.py \
"${submission_basepath}" \
"${output_basepath}/grzqc_output/grzqc_samplesheet.csv"
Now, you can run the pipeline using:
nextflow run main.nf \
-profile grzqc,conda \
--outdir "${output_basepath}/grzqc_output/" \
-work-dir "${output_basepath}/work/" \
--input "${output_basepath}/grzqc_output/grzqc_samplesheet.csv" \
-resume
For your next run, you can use prebuild references. Please prepare your own config file to do so.
Output :
Column | Description |
---|---|
sampleId |
Sample ID |
labDataName |
Lab data name |
libraryType |
Library type, e.g., wes for whole-exome sequencing |
sequenceSubtype |
Sequence subtype, e.g., somatic or germline |
genomicStudySubtype |
Genomic study subtype, e.g., tumor+germline |
meanDepthOfCoverage |
Mean depth of coverage |
meanDepthOfCoverageRequired |
Mean depth of coverage required to pass QC |
fractionBasesAboveQualityThreshold |
Fraction of bases passing the quality threshold |
qualityThreshold |
The quality threshold to pass |
fractionBasesAboveQualityThresholdRequired |
Fraction of bases above the quality threshold required to pass QC |
targetedRegionsAboveMinCoverage |
Fraction of targeted regions above minimum coverage |
minCoverage |
Minimum coverage for target regions |
targetedRegionsAboveMinCoverageRequired |
Fraction of targeted regions above minimum coverage required to pass QC |
passedQC |
true when QC passed, otherwise false |
BfArM-MVH/GRZ_QC_Workflow was originally written by Shounak Chakraborty, Yun Wang, Kübra Narci and Florian R. Hölzlwimmer.