-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: merge vardict and tnscope #1475
Conversation
…ALSAMIC into deduplicate_with_umi
…ALSAMIC into deduplicate_with_umi
…BALSAMIC into merge_vardict_tnscope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍 💯 🥇
As we discussed:
Please update the docs to include the filters used and a separate section for FLT3.
Please remove the intermediate research.filter.pass files
Quality Gate passedIssues Measures |
#### Added - UMI extraction and deduplication to TGA workflow - Adapter trimming of fastqs to UMI workflow - Cap base quality in bam for Manta input #### Changed - Refactored multi workflow rule-files to separate files to decrease complexity - Refactored output files to in general comply with format {sample_type}.{sample_name} - Replaced Picard QC tools with matching Sentieon QC tools #### Removed - UMI specific rules for UMI-extraction and alignment (using new TGA-rules instead) - Fastq and UMI trimming command-line options Merged this PR into this one: #1465 #### Added - Added extension of target bed regions to a minimum size of 100 for CNV analysis - PON for: Exome comprehensive 10.2 - PON for: GMSsolid 15.2 - PON for: GMCKsolid 4.2 #### Changed - updated PON for GMCKSolid v4.1 - updated PON for GMSMyeloid v5.3 - updated PON for GMSlymphoid v7.3 Merged this PR into this one: #1448 #### Added - Script to post-process CNVkit output to GENS-format - DNAscope gnomad calling to TGA for GENS #### Changed - Parsing of GENS arguments changed to account for TGA Merged this PR: #1475 into this one #### Changed - Refactored rules for bcftools filters - Renamed final UMI bamfile to ensure hsmetrics are collected in multiqc json - Changed ranked VCF from research to clincial - Lowered min AF for TGA from 0.007 to 0.005 - Lowered maximal SOR for TNscope in TGA tumor only cases from 3 to 2.7 - Changed filter settings for research TNscope vcf, now either PASS or triallelic_site (fixing this issue: #1293) #### Added - TNscope for TGA workflows, merged with VarDict results - New filter for VarDict for tumor in normal contamination - Export TMP environment variables to rules that lack them - Added genmod ranked VCFs to be delivered - Added family-id to genmod in order to get ranked variants to Scout (solved this: #1045) - Added DP and AF to INFO-field of TNscope vcfs for ranking model - Raw TNscope calls and unfiltered research-annotated SNVs to delivery #### Removed - ML-model for TNscope is removed due to license issue with new version of Sentieon - All code associated with TNhaplotyper - Removed research.filtered.pass VCFs from delivery and storage list
Description
This PR is a branch of:
cnvkit_to_gens: #1448
-->
Which is a branch of:
update_cnvkit_pons: #1465
-->
Which is a branch of:
deduplicate_with_umi #1358
Part of the development work in this PR was originally made here: #1429 and which was partially reviewed by Vadym.
All upstream branches affect the quality of the analysis, and the full extent of these effects will be assessed in this PR in a sort of mini-validation.
The original plan for release 16.0.0 of balsamic was to replace VarDict with TNscope, and while all the tests were passing and the evaluation showed the changes to be an improvement in the analysis overall, the testing also revealed some confusing and unstable behaviour of TNscope which led to us deciding to keep VarDict and merge the TNscope results.
This PR in particular
This PR merges TNscope and VarDict results for the TGA workflows, and cleans up the snakemake rules a bit in general, such as removing the rules for TNhaplotyper which has not been in use for a long time.
For now it also removes the ML-model in TNscope since there is currently no available ML model supported for the new version of Sentieon. They are working on making a new model.
I also added a new filter to VarDict for allowing 30% of tumor in normal contamination to align with what we're doing for TNscope.
Beyond that I lowered the minimum VAF for TGA from: 0.007 to 0.005.
Ranking model:
Up to date graph of a TGA UMI T+N workflow below:
Changed
Added
Removed
Documentation
Tests
Feature Tests
Google sheet with results here: https://docs.google.com/spreadsheets/d/13qjetgWKu9rD3hxfTfL6NNv9R_KkXsIOuqntBft6JvM/edit?usp=sharing
Summary of results in google sheet
coverage
number of variants
There are some differences in the number of variants between this workflow and the previous.
Despite merging the vardict and TNscope variant calls, the tumor normal cases have fewer variants this release than the previous. In the validation the variants that are filtered out will be assessed in more detail, but for now it can be noted that in the TWIST pancancer reference samples (which are tumor and normal cases) the sensitivity was improved compared to last version, suggesting that the variants that are filtered out are artefacts. The possible reasons to explain it are the UMI consensus collapse performing some degree of error correction, but the primary reason is probably the new tumor normal filter added to the TGA analysis to match the one already implemented in WGS (removing variants if they had 30% of tumor presence in normal).
From earlier version of testing:
After adding new TNscope option: --trim-soft-clip
I will inform the customers that commonly order myeloid analysis to get their opinion on the increase in number of variants.
horizon FLT3 variants
horizon SNV and InDels
TWIST pancancer reference samples
SeraCare variants
Myeloid variants
CNV profile in GENS
Is the CNV profile visualisation in GENS working?
CNV profile links here
CNV profile in GENS with and without using PON
5 PONS have been created in this upstream PR and which has not been fully evaluated:
update_cnvkit_pons: #1465
However, as we do not possess a set of cases with known CNVs, and have not validated the CNV analysis in the past. And even regardless of these cases (which would be very nice) the best way to evaluate the PONs is still to study the CNV profile by eye and determine if the profile is easier to interpret with the PON or without.
To save time I would suggest that we do this evaluation in the validation itself, which is uniquely possible for this feature specifically as adding a PON does not impact the code itself. If a PON would be deemed to produce noisy results we can simply not add it to the reference directory.
Feature Tests
Pipeline Integrity Tests
.hk
file)Clinical Genomics Stockholm
Documentation
Panel of Normal specific criteria
HOWEVER! We still need approval for 2 of the PoNs #1465
User Changes
Infrastructure Changes
Checklist
Important
Ensure that all checkboxes below are ticked before merging.
For Developers
For Reviewers
conditions where applicable, with satisfactory results.