feat: merge vardict and tnscope #1475

mathiasbio · 2024-08-23T09:16:33Z

Description

This PR is a branch of:
cnvkit_to_gens: #1448
-->
Which is a branch of:
update_cnvkit_pons: #1465
-->
Which is a branch of:
deduplicate_with_umi #1358

Part of the development work in this PR was originally made here: #1429 and which was partially reviewed by Vadym.

All upstream branches affect the quality of the analysis, and the full extent of these effects will be assessed in this PR in a sort of mini-validation.

The original plan for release 16.0.0 of balsamic was to replace VarDict with TNscope, and while all the tests were passing and the evaluation showed the changes to be an improvement in the analysis overall, the testing also revealed some confusing and unstable behaviour of TNscope which led to us deciding to keep VarDict and merge the TNscope results.

This PR in particular

This PR merges TNscope and VarDict results for the TGA workflows, and cleans up the snakemake rules a bit in general, such as removing the rules for TNhaplotyper which has not been in use for a long time.

For now it also removes the ML-model in TNscope since there is currently no available ML model supported for the new version of Sentieon. They are working on making a new model.

I also added a new filter to VarDict for allowing 30% of tumor in normal contamination to align with what we're doing for TNscope.

Beyond that I lowered the minimum VAF for TGA from: 0.007 to 0.005.

Ranking model:

I also realized that the ranking of the VCFs were only done for TGA samples, not for WGS or UMI-cases, and as this would be required to test the ranking of VCFs for all workflows I added the ranking to these workflows as well. However the ranking model requires that DP and AF fields are saved in the info field of the VCFs, which lead to me needing to add a script to add this for TNscope.
The ranked VCFs were also not saved in housekeeper which I am now adding.
Finally, the ranked VCF was not annotated with clinical databases, which could be important. So I moved the ranking from research.filtered.pass to clinical.

Up to date graph of a TGA UMI T+N workflow below:

Changed

Refactored rules for bcftools filters
Renamed final UMI bamfile to ensure hsmetrics are collected in multiqc json
Changed ranked VCF from research to clincial
Lowered min AF for TGA from 0.007 to 0.005
Lowered maximal SOR for TNscope in TGA tumor only cases from 3 to 2.7
Changed filter settings for research TNscope vcf, now either PASS or triallelic_site (fixing this issue: correct bcftools filter with triallelic sites for research vcf #1293)

Added

TNscope for TGA workflows, merged with VarDict results
New filter for VarDict for tumor in normal contamination
Export TMP environment variables to rules that lack them
Added genmod ranked VCFs to be delivered
Added family-id to genmod in order to get ranked variants to Scout (solved this: GenMod family missing for ranked files #1045)
Added DP and AF to INFO-field of TNscope vcfs for ranking model
Raw TNscope calls and unfiltered research-annotated SNVs to delivery

Removed

ML-model for TNscope is removed due to license issue with new version of Sentieon
All code associated with TNhaplotyper
Removed research.filtered.pass VCFs from delivery and storage list

Documentation

N/A
Updated Balsamic documentation to reflect the changes as needed for this PR.
- balsamic_filters
- balsamic_methods
- balsamic_pon

Tests

Feature Tests

Google sheet with results here: https://docs.google.com/spreadsheets/d/13qjetgWKu9rD3hxfTfL6NNv9R_KkXsIOuqntBft6JvM/edit?usp=sharing

Quality of TGA analysis should be the same or improved for samples (compared to v15.0.0 validation)

Summary of results in google sheet

coverage

Coverage is improved across all samples tested after the update to the BAM creation

number of variants

Number of variants has been assessed in all workflows

There are some differences in the number of variants between this workflow and the previous.

Fewer SNVs in tumor normal TGA analysis

Despite merging the vardict and TNscope variant calls, the tumor normal cases have fewer variants this release than the previous. In the validation the variants that are filtered out will be assessed in more detail, but for now it can be noted that in the TWIST pancancer reference samples (which are tumor and normal cases) the sensitivity was improved compared to last version, suggesting that the variants that are filtered out are artefacts. The possible reasons to explain it are the UMI consensus collapse performing some degree of error correction, but the primary reason is probably the new tumor normal filter added to the TGA analysis to match the one already implemented in WGS (removing variants if they had 30% of tumor presence in normal).

More SNVs in tumor only TGA analysis

From earlier version of testing:

In the example case run for TGA tumor only there was a 125% increase in number of SNVs in the new version after merging the results. This is very different to the comparison of the tumor only WES case, which actually decreased in amount of variants by 3.76%. Likely this comes from the difference in sequencing depth of these cases, where the more highly covered TGA case opens up the possibility of detecting more low frequency variants where the callers are more likely to diverge in their algorithms for calling variants, leading to more unique variants added by TNscope in comparison to the WES case. 

I looked at a subsection of extra variants called in this TGA case in IGV to determine if they seemed like valuable additions or not, and saw an issue with TNscope calling variants in soft-clipped ends. There is an option to add to TNscope called trim-soft-clipped ends which I will try to fix it.

After adding new TNscope option: --trim-soft-clip

The increase in SNVs in the same tumor only case increased by 98%. I verified in IGV that the variants previously called in soft clipped ends were no gone. Still the number of variants in this case is almost double of what it was previously. 

I looked in IGV for the extra variants called, and many of them looked like true SNVs, others were InDels that were maybe more dubious as they were called in quite repetitive regions. Though these should be less impactful for the clinicians  as they occur more frequently in introns and introgenic regions and should be filtered out by most filters in Scout. 

My guess from this is that given the clear increase in sensitivity in the tumor normal TWIST pancancer cases, there are likely many added true positive variants from TNscope also in the tumor only case. Though the clear effect on sensitivity and precision on the tumor only analysis is unclear as we do not have enough true positive variants to measure this.

I will inform the customers that commonly order myeloid analysis to get their opinion on the increase in number of variants.

horizon FLT3 variants

FLT3 variants are detected in all cases (Yes, but is missing in all TNscope calls and in one Manta call)

horizon SNV and InDels

All variants are still detected at reasonable VAFs

TWIST pancancer reference samples

Sensitivity and precision is of equal or better quality (better quality across all 3 samples)

SeraCare variants

All seracare variants are still detected at reasonable VAFs

Myeloid variants

All variants are still detected

CNV profile in GENS

Is the CNV profile visualisation in GENS working?

CNV profile links here

gmcksolid_4.1_hg19_design.bed with known EGFR duplication observed in CNV profile
twistexomecomprehensive 10.2 case with clean CNV profile

CNV profile in GENS with and without using PON

5 PONS have been created in this upstream PR and which has not been fully evaluated:
update_cnvkit_pons: #1465

gmcksolid 4.1
gmcksolid 4.2
gmslymphoid 7.3
gmsmyeloid 5.3
gmssolid 15.2
twistexomecomprehensive 10.2

However, as we do not possess a set of cases with known CNVs, and have not validated the CNV analysis in the past. And even regardless of these cases (which would be very nice) the best way to evaluate the PONs is still to study the CNV profile by eye and determine if the profile is easier to interpret with the PON or without.

To save time I would suggest that we do this evaluation in the validation itself, which is uniquely possible for this feature specifically as adding a PON does not impact the code itself. If a PON would be deemed to produce noisy results we can simply not add it to the reference directory.

PONs are beneficial to the CNV analysis (To be evaluated in validation step)

Feature Tests

N/A
Test [Description]
- [Screenshot]

Pipeline Integrity Tests

Report deliver (generation of the .hk file)
- N/A
- Verified
TGA T/O Workflow
- N/A
- Verified
TGA T/N Workflow
- N/A
- Verified
UMI T/O Workflow
- N/A
- Verified
UMI T/N Workflow
- N/A
- Verified
WGS T/O Workflow
- N/A
- Verified
WGS T/N Workflow
- N/A
- Verified
QC Workflow
- N/A
- Verified
PON Workflow
- N/A
- Verified

Clinical Genomics Stockholm

Documentation

Atlas documentation
- N/A
- Updated sentieon version and lists of PONs in production (though liable to change if PON proves noisy in validation): https://github.com/Clinical-Genomics/atlas/pull/2859
Web portal for Clinical Genomics
- N/A
- Updated lists of TGA delivered files: https://github.com/Clinical-Genomics/clinical-genomics-ui/issues/538

Panel of Normal specific criteria

The PR includes the addition of a new Panel of Normals
The samples have been verified to adhere to the sample selection criteria on Atlas PoN creation instructions for Balsamic

HOWEVER! We still need approval for 2 of the PoNs #1465

User Changes

N/A
This PR affects the output files or results.
- User feedback is considered unnecessary because [Justification].
- Affected users have been included in the development process and given a chance to provide feedback.
- Feedback from Myeloid customers regarding number of SNVs received and taken into consideration

Infrastructure Changes

Stored files in Housekeeper
- N/A
- Updated: Updating files and tags for release 16.0.0 of balsamic hermes#116
CG (CLI and delivered/uploaded files)
- N/A
- Updated: Balsamic v16 release cg#3408
Servers (configuration files on Hasta)
- N/A
- Updated: [Link]
Scout interface
- N/A
- Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

PR Description
- Provided a comprehensive description of the PR.
- Linked relevant user stories or issues to the PR.
Documentation
- Verified and updated documentation if necessary.
Tests
- Described and tested the functionality addressed in the PR.
- Ensured integration of the new code with existing workflows. (Tested with PRs linked in Balsamic v16 release cg#3408)
- Confirmed that meaningful unit tests were added for the changes introduced.
- Checked that the PR has successfully passed all relevant code smells and coverage checks.
Review
- Addressed and resolved all the feedback provided during the code review process.
- Obtained final approval from designated reviewers.

For Reviewers

Code
- Code implements the intended features or fixes the reported issue.
- Code follows the project's coding standards and style guide.
Documentation
- Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
Tests
- The author provided a description of their manual testing, including consideration of edge cases and boundary
  conditions where applicable, with satisfactory results.
Review
- Confirmed that the developer has addressed all the comments during the code review.

…ALSAMIC into deduplicate_with_umi

…develop

…ALSAMIC into deduplicate_with_umi

…BALSAMIC into merge_vardict_tnscope

khurrammaqbool

Looks good 👍 💯 🥇
As we discussed:
Please update the docs to include the filters used and a separate section for FLT3.
Please remove the intermediate research.filter.pass files

sonarcloud · 2024-10-14T14:19:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

#### Added - UMI extraction and deduplication to TGA workflow - Adapter trimming of fastqs to UMI workflow - Cap base quality in bam for Manta input #### Changed - Refactored multi workflow rule-files to separate files to decrease complexity - Refactored output files to in general comply with format {sample_type}.{sample_name} - Replaced Picard QC tools with matching Sentieon QC tools #### Removed - UMI specific rules for UMI-extraction and alignment (using new TGA-rules instead) - Fastq and UMI trimming command-line options Merged this PR into this one: #1465 #### Added - Added extension of target bed regions to a minimum size of 100 for CNV analysis - PON for: Exome comprehensive 10.2 - PON for: GMSsolid 15.2 - PON for: GMCKsolid 4.2 #### Changed - updated PON for GMCKSolid v4.1 - updated PON for GMSMyeloid v5.3 - updated PON for GMSlymphoid v7.3 Merged this PR into this one: #1448 #### Added - Script to post-process CNVkit output to GENS-format - DNAscope gnomad calling to TGA for GENS #### Changed - Parsing of GENS arguments changed to account for TGA Merged this PR: #1475 into this one #### Changed - Refactored rules for bcftools filters - Renamed final UMI bamfile to ensure hsmetrics are collected in multiqc json - Changed ranked VCF from research to clincial - Lowered min AF for TGA from 0.007 to 0.005 - Lowered maximal SOR for TNscope in TGA tumor only cases from 3 to 2.7 - Changed filter settings for research TNscope vcf, now either PASS or triallelic_site (fixing this issue: #1293) #### Added - TNscope for TGA workflows, merged with VarDict results - New filter for VarDict for tumor in normal contamination - Export TMP environment variables to rules that lack them - Added genmod ranked VCFs to be delivered - Added family-id to genmod in order to get ranked variants to Scout (solved this: #1045) - Added DP and AF to INFO-field of TNscope vcfs for ranking model - Raw TNscope calls and unfiltered research-annotated SNVs to delivery #### Removed - ML-model for TNscope is removed due to license issue with new version of Sentieon - All code associated with TNhaplotyper - Removed research.filtered.pass VCFs from delivery and storage list

mathiasbio added 30 commits June 11, 2024 11:11

remove dbsnp from TNscope test

5997d25

add gens functionality to tga

e6516d6

add gens inputs

7f6e521

add gens gnomad af to tga

24e41a9

black

cdbe43f

fix

f262919

fix

c2aafb9

fix

cc9e506

fix

d471395

fix

05bd3ae

doc strings and named args

6868556

black

b03a9f0

changelog

582cfb0

bug fix

e9c4c7f

fix

8f88da7

lower padding

4a4671e

add tumor purity adjustment to gens cov file

5bcf768

typehints

acde38f

black

ed4c018

Merge branch 'deduplicate_with_umi' of github.com:Clinical-Genomics/B…

8cd772a

…ALSAMIC into deduplicate_with_umi

fix merge conflicts

b7f13ad

fix pytests

a108a0e

update purity adjustment formula

99d3d57

Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …

bf57f92

…develop

merge develop

d69c3c7

m conflict

44f25e1

m conflict

9335478

Merge branch 'deduplicate_with_umi' of github.com:Clinical-Genomics/B…

a01dfed

…ALSAMIC into deduplicate_with_umi

fix merge conflicts

1f79849

fix bugs

af7c4d5

mathiasbio linked an issue Sep 24, 2024 that may be closed by this pull request

[User Story] Merge VarDict and TNscope results #1480

Open

3 tasks

mathiasbio mentioned this pull request Sep 25, 2024

[User Story] Artefact databases for SNVs and InDels #1377

Open

3 tasks

mathiasbio added 2 commits October 1, 2024 11:14

add found in to vcfs

2ae9348

Merge branch 'merge_vardict_tnscope' of github.com:Clinical-Genomics/…

88820b5

…BALSAMIC into merge_vardict_tnscope

mathiasbio mentioned this pull request Oct 1, 2024

Missing caller info for SNVs in cancer-cases (except for VarDict) Clinical-Genomics/scout#4901

Closed

add rankscore familyid

79b084a

mathiasbio linked an issue Oct 2, 2024 that may be closed by this pull request

GenMod family missing for ranked files #1045

Open

mathiasbio mentioned this pull request Oct 2, 2024

GenMod family missing for ranked files #1045

Open

mathiasbio added 2 commits October 9, 2024 10:40

switch orders of tnscope and vardict for merging

bfd9f1a

bug fix

1c2ee4a

mathiasbio linked an issue Oct 9, 2024 that may be closed by this pull request

correct bcftools filter with triallelic sites for research vcf #1293

Open

Base automatically changed from cnvkit_to_gens to update_cnvkit_pons October 11, 2024 11:09

Base automatically changed from update_cnvkit_pons to deduplicate_with_umi October 11, 2024 11:10

mathiasbio mentioned this pull request Oct 11, 2024

feat: deduplicate with UMIs #1358

Merged

57 tasks

mathiasbio added 6 commits October 11, 2024 14:42

add back tiddit

8e594a0

fix conflicts

8de74df

black

dfe8958

fix

1fbdc23

fix umi

5e25757

remove accidentally added testfile

5a11a97

khurrammaqbool approved these changes Oct 14, 2024

View reviewed changes

mathiasbio added 3 commits October 14, 2024 15:45

remove research filtered pass from deliver

bc29b8e

update docs

00c6e14

changelog

54c2d4b

mathiasbio merged commit 57bdbe0 into deduplicate_with_umi Oct 14, 2024
7 checks passed

mathiasbio deleted the merge_vardict_tnscope branch October 14, 2024 14:39

mathiasbio mentioned this pull request Oct 17, 2024

feat: Release v16.0.0 #1491

Merged

15 tasks

mathiasbio linked an issue Oct 22, 2024 that may be closed by this pull request

[User Story] Tumor in normal contamination filter in VarDict #1494

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: merge vardict and tnscope #1475

feat: merge vardict and tnscope #1475

mathiasbio commented Aug 23, 2024 •

edited

Loading

khurrammaqbool left a comment

sonarcloud bot commented Oct 14, 2024

feat: merge vardict and tnscope #1475

feat: merge vardict and tnscope #1475

Conversation

mathiasbio commented Aug 23, 2024 • edited Loading

Description

This PR in particular

Changed

Added

Removed

Documentation

Tests

Feature Tests

Summary of results in google sheet

Feature Tests

Pipeline Integrity Tests

Clinical Genomics Stockholm

Documentation

Panel of Normal specific criteria

User Changes

Infrastructure Changes

Checklist

For Developers

For Reviewers

khurrammaqbool left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Oct 14, 2024

Quality Gate passed

mathiasbio commented Aug 23, 2024 •

edited

Loading