Skip to content

Commit

Permalink
Merge pull request #100 from broadinstitute/staging
Browse files Browse the repository at this point in the history
Staging->Master
  • Loading branch information
sophiacrennan authored Oct 7, 2020
2 parents 3b84df4 + 45ff267 commit 3e08cd2
Show file tree
Hide file tree
Showing 48 changed files with 16,742 additions and 27 deletions.
9 changes: 9 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,12 @@ workflows:
- name: scATAC
subclass: WDL
primaryDescriptorPath: /pipelines/skylab/scATAC/scATAC.wdl
- name: JointGenotyping
subclass: WDL
primaryDescriptorPath: /pipelines/broad/dna_seq/germline/joint_genotyping/JointGenotyping.wdl
- name: ExomeGermlineSingleSample
subclass: WDL
primaryDescriptorPath: /pipelines/broad/dna_seq/germline/single_sample/exome/ExomeGermlineSingleSample.wdl
- name: WholeGenomeGermlineSingleSample
subclass: WDL
primaryDescriptorPath: /pipelines/broad/dna_seq/germline/single_sample/wgs/WholeGenomeGermlineSingleSample.wdl
4 changes: 3 additions & 1 deletion dockers/skylab/optimus-test-matrix/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# Start with an ubuntu system and update it
FROM ubuntu:16.04
FROM ubuntu:18.04

# Image label
LABEL maintainer="Lantern Team <[email protected]>" \
software="Optimus Matrix Tester"

ENV DEBIAN_FRONTEND=noninteractive

# Enable source repositories to install deps for R and update the apt-get list
RUN sed -Ei 's/^# deb-src /deb-src /' /etc/apt/sources.list && apt-get update

Expand Down
5 changes: 5 additions & 0 deletions pipelines/broad/arrays/single_sample/Arrays.changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 2.2.0
2020-10-01

* Updated task definitions to include a new tool not currently used in Arrays wdl

# 2.1.0
2020-08-18

Expand Down
2 changes: 1 addition & 1 deletion pipelines/broad/arrays/single_sample/Arrays.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ import "../../../../tasks/broad/InternalArraysTasks.wdl" as InternalTasks
workflow Arrays {

String pipeline_version = "2.1.0"
String pipeline_version = "2.2.0"

input {

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 1.11.0
2020-10-01

* Updated task definitions to include a new tool not currently used in ValidateChip wdl

# 1.10.0
2020-08-18

Expand Down
2 changes: 1 addition & 1 deletion pipelines/broad/arrays/validate_chip/ValidateChip.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ import "../../../../tasks/broad/InternalArraysTasks.wdl" as InternalTasks
workflow ValidateChip {

String pipeline_version = "1.10.0"
String pipeline_version = "1.11.0"

input {
String sample_alias
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 1.11.0
2020-10-01

* Added use of BafRegress to the pipeline. BafRegress detects and estimates sample contamination using B allele frequency data from Illumina genotyping arrays using a regression model.

# 1.10.0
2020-08-18

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
| Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
| :----: | :---: | :----: | :--------------: |
| [Version 1.9](IlluminaGenotypingArray.wdl) | July 31, 2020 | [Elizabeth Kiernan](mailto:[email protected]) | Please file GitHub issues in warp or contact [Kylee Degatano](mailto:[email protected]) |
| [Version 1.11.0](IlluminaGenotypingArray.wdl) | Oct 1, 2020 | [Elizabeth Kiernan](mailto:[email protected]) | Please file GitHub issues in warp or contact [Kylee Degatano](mailto:[email protected]) |

# Table of Contents
- [Illumina Genotyping Array Pipeline Overview](#illumina-genotyping-array-pipeline-overview)
Expand Down Expand Up @@ -97,6 +97,7 @@ The workflow requires that each input is specified in a JSON file. All sample an
| Input name | Description | Required or Optional | Input format |
| --- | --- | --- | --- |
| call_rate_threshold | Minimal numeric value for a sample to have a passing call rate | Required | Value |
| minor_allele_frequency_file | Cloud path to a chip-specific text file containing locus-id to minor allele frequency | Optional | String |
| contamination_controls_vcf | Cloud path to a VCF of samples run on this chip type to be used to supplement contamination calling | Optional | String |
| subsampled_metrics_interval_list | Cloud path to a file containing a subset sites for which the workflow generate metrics and outputs a VCF | Optional | String |
| disk_size | Default disk (in GiB) for this workflow's cloud VMs | Required | Value |
Expand All @@ -115,6 +116,7 @@ The following table provides a summary of the tasks and tools called by the Illu
| --- | --- | --- |
| Autocall | [iaap-cli gencall](https://support.illumina.com/downloads/iaap-genotyping-cli.html) | Illumina |
| GtcToVcf | [GtcToVcf](https://gatk.broadinstitute.org/hc/en-us/articles/360037595031-GtcToVcf-Picard-) | Picard |
| BafRegress | [BafRegress](https://genome.sph.umich.edu/wiki/BAFRegress) | https://genome.sph.umich.edu/wiki/File:BafRegress.tar.gz |
| VcfToAdpc | [VcfToAdpc](https://gatk.broadinstitute.org/hc/en-us/articles/360036484712-VcfToAdpc-Picard-) | Picard |
| VerifyIDIntensity | [VerifyIDIntensity](https://github.com/gjun/verifyIDintensity) | https://github.com/gjun/verifyIDintensity |
| CreateVerifyIDIntensityContaminationMetricsFile | [CreateVerifyIDIntensityContaminationMetricsFile](https://gatk.broadinstitute.org/hc/en-us/articles/360036805271) | Picard |
Expand Down Expand Up @@ -150,7 +152,13 @@ Illumina BeadChip Genotyping technology demarcates small-nucleotide variants (an
After genoytping, the workflow calls the GtcToVcf task, which runs the Picard tool [GtcToVcf](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_arrays_GtcToVcf.php) to convert the GTC into a VCF.

### 2. Contamination Detection
Intra-species DNA contamination is a common problem for genotyping samples. To detect contamination, the Illumina Array workflow uses the software [VerifyIDIntensity](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487130/), which requires an 'adpc.bin' file (a binary file containing array intensity data that can be used with Illumina software) as input. The workflow first calls the [VcfToAdpc](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_arrays_VcfToAdpc.php) task to convert the VCF output from genotype calling into an 'adpc.bin' file. Next, the [VerifyIDIntensity](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487130/) task uses this input file to measure contamination. The [CreateVerifyIDIntensityContaminationMetricsFile](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_arrays_CreateVerifyIDIntensityContaminationMetricsFile.php) task then converts the VerifyIDIntensity output into a Picard-standard metrics file (chip_well_barcode.verifyidintensity_metrics), suitable for uploading to a metrics database.
Intra-species DNA contamination is a common problem for genotyping samples.

The Illumina Array workflow uses two tools to detect contamination. These are BafRegress and VerifyIDIntensity. The use of VerifyIDIntensity is deprecated as it can overestimate estimated contamination when used in single-sample mode (as it is run typically)

[BafRegress](https://genome.sph.umich.edu/wiki/BAFRegress) is a software that detects and estimates sample contamination using B allele frequency data from Illumina genotyping arrays using a regression model. It requires a file formatted as an Illumina Final Report. The workflow takes care of this in the BafRegress task (which contains functionality to both create the Illumina Final Report from the VCF generated by GtcToVcf and then run the BafRegress tool itself. The output of a BafRegress task is a text file containing the estimated contamination along with associated metrics.

[VerifyIDIntensity](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487130/) requires an 'adpc.bin' file (a binary file containing array intensity data that can be used with Illumina software) as input. The workflow first calls the [VcfToAdpc](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_arrays_VcfToAdpc.php) task to convert the VCF output from genotype calling into an 'adpc.bin' file. Next, the [VerifyIDIntensity](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487130/) task uses this input file to measure contamination. The [CreateVerifyIDIntensityContaminationMetricsFile](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_arrays_CreateVerifyIDIntensityContaminationMetricsFile.php) task then converts the VerifyIDIntensity output into a Picard-standard metrics file (chip_well_barcode.verifyidintensity_metrics), suitable for uploading to a metrics database.

### 3. Rare Variant Calling (Optional)
After running default genotype processing with Autocall, the Illumina Genotyping Array workflow optionally uses the [zCall](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3463112/) task to improve calls on rare variants. To run this task, the workflow requires a zCall threshold file. If the workflow identifies the file, it will output a PLINK .ped and .map file. The [MergePedIntoVcf](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_arrays_MergePedIntoVcf.php) task then merges these outputs into the VCF generated during genotype calling.
Expand Down Expand Up @@ -182,6 +190,7 @@ The tables below summarize all of the workflow's output according to task. Outpu
| chip_well_barcode.vcf.gz | VCF generated by the pipeline | Required | Compressed VCF (vcf.gz) |
| chip_well_barcode.vcf.gz.tbi | Index file of the VCF generated by the pipeline | Required | tabix index (vcf.gz.tbi) |
| chip_well_barcode.gtc | GTC file generated by Autocall | Required | GTC |
| chip_well_barcode.bafregress_metrics | Text output file generated by BafRegress | Optional | txt |
| chip_well_barcode.verifyidintensity_metrics | File containing metrics generated by VerifyIDIntensity | Required | txt |
| chip_well_barcode.arrays_variant_calling_detail_metrics | Detailed metrics file for the output VCF generated by CollectArraysVariantCallingMetrics.detail_metrics | Required | txt |
| chip_well_barcode.arrays_variant_calling_summary_metrics | Summary metrics file for the output VCF as generated by CollectArraysVariantCallingMetrics | Required | txt |
Expand Down
18 changes: 17 additions & 1 deletion pipelines/broad/genotyping/illumina/IlluminaGenotypingArray.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ import "../../../../tasks/broad/IlluminaGenotypingArrayTasks.wdl" as GenotypingT
workflow IlluminaGenotypingArray {

String pipeline_version = "1.10.0"
String pipeline_version = "1.11.0"

input {

Expand Down Expand Up @@ -64,6 +64,9 @@ workflow IlluminaGenotypingArray {
# For Contamination Checking
File? contamination_controls_vcf

# For BAFRegress
File? minor_allele_frequency_file

# For HapMap GenotypeConcordance Check:
File? control_sample_vcf_file
File? control_sample_vcf_index_file
Expand Down Expand Up @@ -134,6 +137,18 @@ workflow IlluminaGenotypingArray {
pipeline_version = "IlluminaGenotypingArray_v" + pipeline_version
}

if (defined(minor_allele_frequency_file)) {
call GenotypingTasks.BafRegress {
input:
input_vcf = GtcToVcf.output_vcf,
input_vcf_index = GtcToVcf.output_vcf_index,
maf_file = minor_allele_frequency_file,
output_results_filename = chip_well_barcode + ".results.txt",
disk_size = disk_size,
preemptible_tries = preemptible_tries,
}
}

call GenotypingTasks.VcfToAdpc {
input:
input_vcf = GtcToVcf.output_vcf,
Expand Down Expand Up @@ -326,6 +341,7 @@ workflow IlluminaGenotypingArray {
File? output_vcf_md5_cloud_path = VcfMd5Sum.md5_cloud_path
File? output_vcf = final_output_vcf
File? output_vcf_index = final_output_vcf_index
File? bafregress_results_file = BafRegress.results_file
File? contamination_metrics = CreateVerifyIDIntensityContaminationMetricsFile.output_metrics_file
File? output_fingerprint_vcf = SelectFingerprintVariants.output_vcf
File? output_fingerprint_vcf_index = SelectFingerprintVariants.output_vcf_index
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,6 @@
"IlluminaGenotypingArray.dbSNP_vcf_index": "gs://gcp-public-data--broad-references/hg19/v0/dbsnp_138.b37.vcf.gz.tbi",
"IlluminaGenotypingArray.haplotype_database_file": "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.haplotype_database.txt",
"IlluminaGenotypingArray.variant_rsids_file": "gs://broad-references-private/hg19/v0/Homo_sapiens_assembly19.haplotype_database.snps.list",
"IlluminaGenotypingArray.minor_allele_frequency_file": "gs://broad-gotc-test-storage/arrays/metadata/GDA-8v1-0_A5/GDA-8v1-0_A5.MAF.txt",
"IlluminaGenotypingArray.preemptible_tries": 3
}
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 2.1.1
2020-10-01

* Removed extra trailing slash in ouput directory from cloud to cloud copy job

# 2.1.0
2020-08-18

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import "../../../../../tasks/broad/CopyFilesFromCloudToCloud.wdl" as Copy

workflow ExternalExomeReprocessing {

String pipeline_version = "2.1.0"
String pipeline_version = "2.1.1"

input {
File? input_cram
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 1.1.1
2020-10-01

* Removed extra trailing slash in ouput directory from cloud to cloud copy job

# 1.1.0
2020-08-18

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import "../../../../../tasks/broad/CopyFilesFromCloudToCloud.wdl" as Copy

workflow ExternalWholeGenomeReprocessing {

String pipeline_version = "1.1.0"
String pipeline_version = "1.1.1"

input {
File? input_cram
Expand Down
8 changes: 8 additions & 0 deletions pipelines/skylab/optimus/Optimus.changelog.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,17 @@
# 4.1.0

2020-10-05 (Date of Last Commit)

* Updated sctools dockers and made them consistent across the Optimus pipeline


# 4.0.2

2020-09-30 (Date of Last Commit)

* Corrected the path to the FastqProcessing WDL


# 4.0.1

2020-09-28 (Date of Last Commit)
Expand Down
2 changes: 1 addition & 1 deletion pipelines/skylab/optimus/Optimus.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ workflow Optimus {
}

# version of this pipeline
String pipeline_version = "4.0.2"
String pipeline_version = "4.1.0"

# this is used to scatter matched [r1_fastq, r2_fastq, i1_fastq] arrays
Array[Int] indices = range(length(r1_fastq))
Expand Down
2 changes: 1 addition & 1 deletion tasks/broad/CopyFilesFromCloudToCloud.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ task CopyFilesFromCloudToCloud {
((count++)) && ((count >= $RETRY_LIMIT)) && break
done
if ! grep -q no_contamination contamination; then
/usr/local/google-cloud-sdk/bin/gsutil -m cp -L cp.log contamination ~{destination_cloud_path}/~{base_file_name}.contamination
/usr/local/google-cloud-sdk/bin/gsutil -m cp -L cp.log contamination ~{destination_cloud_path}~{base_file_name}.contamination
fi
if [ "$count" -ge "$RETRY_LIMIT" ]; then
echo 'Could not copy all the files to the cloud destination' && exit 1
Expand Down
30 changes: 30 additions & 0 deletions tasks/broad/IlluminaGenotypingArrayTasks.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,36 @@ task GtcToVcf {
}
}

task BafRegress {
input {
File input_vcf
File input_vcf_index
File? maf_file
String output_results_filename

Int disk_size
Int preemptible_tries
}

command {
set -eo pipefail

/root/tools/bcftools/bin/bcftools view -f 'PASS,.' ~{input_vcf} 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 | python /root/tools/parseVcfToBAFRegress.py > temp.final_report.txt

python /root/tools/bafRegress.py estimate --freqfile ~{maf_file} temp.final_report.txt > ~{output_results_filename}
}
runtime {
docker: "us.gcr.io/broad-gotc-prod/bafregress:1.0"
disks: "local-disk " + disk_size + " HDD"
memory: "3.5 GiB"
preemptible: preemptible_tries
}

output {
File results_file = output_results_filename
}
}

task VcfToAdpc {
input {
File input_vcf
Expand Down
2 changes: 1 addition & 1 deletion tasks/skylab/Attach10xBarcodes.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ task Attach10xBarcodes {
String chemistry

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.4"
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 48000
Int cpu = 2
# estimate that bam is approximately the size of all inputs plus 50%
Expand Down
4 changes: 2 additions & 2 deletions tasks/skylab/CreateCountMatrix.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ task CreateSparseCountMatrix {
File gtf_file

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.7"
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 8250
Int cpu = 1
Int disk = ceil(size(bam_input, "Gi") + size(gtf_file, "Gi")) * 4 + 10
Expand Down Expand Up @@ -62,7 +62,7 @@ task MergeCountFiles {
Array[File] col_indices

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.7"
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 8250
Int cpu = 1
Int disk = 20 # todo find out how to make this adaptive with Array[file] input
Expand Down
2 changes: 1 addition & 1 deletion tasks/skylab/FastqProcessing.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ task FastqProcessing {
String sample_id

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.10"
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 3850
Int cpu = 16
#TODO decided cpu
Expand Down
16 changes: 8 additions & 8 deletions tasks/skylab/SequenceDataWithMoleculeTagMetrics.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ task CalculateGeneMetrics {
File bam_input

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.7"
Int machine_mem_mb = 8000
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 22000
Int cpu = 1
Int disk = ceil(size(bam_input, "Gi") * 4)
Int preemptible = 3
Expand Down Expand Up @@ -50,8 +50,8 @@ task CalculateCellMetrics {
File original_gtf

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.8"
Int machine_mem_mb = 8000
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 45000
Int cpu = 1
Int disk = ceil(size(bam_input, "Gi") * 2)
Int preemptible = 3
Expand Down Expand Up @@ -95,8 +95,8 @@ task MergeGeneMetrics {
Array[File] metric_files

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.7"
Int machine_mem_mb = 8000
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 3850
Int cpu = 1
Int disk = 20
Int preemptible = 3
Expand Down Expand Up @@ -139,8 +139,8 @@ task MergeCellMetrics {
Array[File] metric_files

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.7"
Int machine_mem_mb = 8000
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 3850
Int cpu = 1
Int disk = 20
Int preemptible = 3
Expand Down
2 changes: 1 addition & 1 deletion tasks/skylab/SplitBamByCellBarcode.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ task SplitBamByCellBarcode {
Float size_in_mb = 1024.0

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.5"
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"

Int machine_mem_mb = 15258
Int cpu = 16
Expand Down
4 changes: 2 additions & 2 deletions tasks/skylab/TagSortBam.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ task CellSortBam {
File bam_input

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.2"
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 100000
Int cpu = 2
Int disk = ceil(size(bam_input, "Gi") * 8)
Expand Down Expand Up @@ -49,7 +49,7 @@ task GeneSortBam {
File bam_input

# runtime values
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.2"
String docker = "quay.io/humancellatlas/secondary-analysis-sctools:v0.3.11"
Int machine_mem_mb = 100000
Int cpu = 2
Int disk = ceil(size(bam_input, "Gi") * 4)
Expand Down
2 changes: 1 addition & 1 deletion tests/skylab/optimus/pr/ValidateOptimus.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ task ValidateMatrix {
>>>

runtime {
docker: "quay.io/humancellatlas/optimus-matrix-test:0.0.4"
docker: "quay.io/humancellatlas/optimus-matrix-test:0.0.7"
cpu: 1
memory: "16 GB"
disks: "local-disk ${required_disk} HDD"
Expand Down
Loading

0 comments on commit 3e08cd2

Please sign in to comment.