Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merge vardict and tnscope #1475

Merged
merged 322 commits into from
Oct 14, 2024
Merged

Conversation

mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Aug 23, 2024

Description

This PR is a branch of:
cnvkit_to_gens: #1448
-->
Which is a branch of:
update_cnvkit_pons: #1465
-->
Which is a branch of:
deduplicate_with_umi #1358

Part of the development work in this PR was originally made here: #1429 and which was partially reviewed by Vadym.

All upstream branches affect the quality of the analysis, and the full extent of these effects will be assessed in this PR in a sort of mini-validation.

The original plan for release 16.0.0 of balsamic was to replace VarDict with TNscope, and while all the tests were passing and the evaluation showed the changes to be an improvement in the analysis overall, the testing also revealed some confusing and unstable behaviour of TNscope which led to us deciding to keep VarDict and merge the TNscope results.

This PR in particular

This PR merges TNscope and VarDict results for the TGA workflows, and cleans up the snakemake rules a bit in general, such as removing the rules for TNhaplotyper which has not been in use for a long time.

For now it also removes the ML-model in TNscope since there is currently no available ML model supported for the new version of Sentieon. They are working on making a new model.

I also added a new filter to VarDict for allowing 30% of tumor in normal contamination to align with what we're doing for TNscope.

Beyond that I lowered the minimum VAF for TGA from: 0.007 to 0.005.

Ranking model:

  • I also realized that the ranking of the VCFs were only done for TGA samples, not for WGS or UMI-cases, and as this would be required to test the ranking of VCFs for all workflows I added the ranking to these workflows as well. However the ranking model requires that DP and AF fields are saved in the info field of the VCFs, which lead to me needing to add a script to add this for TNscope.
  • The ranked VCFs were also not saved in housekeeper which I am now adding.
  • Finally, the ranked VCF was not annotated with clinical databases, which could be important. So I moved the ranking from research.filtered.pass to clinical.

Up to date graph of a TGA UMI T+N workflow below:

image

Changed

  • Refactored rules for bcftools filters
  • Renamed final UMI bamfile to ensure hsmetrics are collected in multiqc json
  • Changed ranked VCF from research to clincial
  • Lowered min AF for TGA from 0.007 to 0.005
  • Lowered maximal SOR for TNscope in TGA tumor only cases from 3 to 2.7
  • Changed filter settings for research TNscope vcf, now either PASS or triallelic_site (fixing this issue: correct bcftools filter with triallelic sites for research vcf #1293)

Added

  • TNscope for TGA workflows, merged with VarDict results
  • New filter for VarDict for tumor in normal contamination
  • Export TMP environment variables to rules that lack them
  • Added genmod ranked VCFs to be delivered
  • Added family-id to genmod in order to get ranked variants to Scout (solved this: GenMod family missing for ranked files #1045)
  • Added DP and AF to INFO-field of TNscope vcfs for ranking model
  • Raw TNscope calls and unfiltered research-annotated SNVs to delivery

Removed

  • ML-model for TNscope is removed due to license issue with new version of Sentieon
  • All code associated with TNhaplotyper
  • Removed research.filtered.pass VCFs from delivery and storage list

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • balsamic_filters
    • balsamic_methods
    • balsamic_pon

Tests

Feature Tests

Google sheet with results here: https://docs.google.com/spreadsheets/d/13qjetgWKu9rD3hxfTfL6NNv9R_KkXsIOuqntBft6JvM/edit?usp=sharing

  • Quality of TGA analysis should be the same or improved for samples (compared to v15.0.0 validation)

Summary of results in google sheet

coverage

  • Coverage is improved across all samples tested after the update to the BAM creation

number of variants

  • Number of variants has been assessed in all workflows

There are some differences in the number of variants between this workflow and the previous.

  • Fewer SNVs in tumor normal TGA analysis

Despite merging the vardict and TNscope variant calls, the tumor normal cases have fewer variants this release than the previous. In the validation the variants that are filtered out will be assessed in more detail, but for now it can be noted that in the TWIST pancancer reference samples (which are tumor and normal cases) the sensitivity was improved compared to last version, suggesting that the variants that are filtered out are artefacts. The possible reasons to explain it are the UMI consensus collapse performing some degree of error correction, but the primary reason is probably the new tumor normal filter added to the TGA analysis to match the one already implemented in WGS (removing variants if they had 30% of tumor presence in normal).

  • More SNVs in tumor only TGA analysis

From earlier version of testing:

In the example case run for TGA tumor only there was a 125% increase in number of SNVs in the new version after merging the results. This is very different to the comparison of the tumor only WES case, which actually decreased in amount of variants by 3.76%. Likely this comes from the difference in sequencing depth of these cases, where the more highly covered TGA case opens up the possibility of detecting more low frequency variants where the callers are more likely to diverge in their algorithms for calling variants, leading to more unique variants added by TNscope in comparison to the WES case. 

I looked at a subsection of extra variants called in this TGA case in IGV to determine if they seemed like valuable additions or not, and saw an issue with TNscope calling variants in soft-clipped ends. There is an option to add to TNscope called trim-soft-clipped ends which I will try to fix it.

After adding new TNscope option: --trim-soft-clip

The increase in SNVs in the same tumor only case increased by 98%. I verified in IGV that the variants previously called in soft clipped ends were no gone. Still the number of variants in this case is almost double of what it was previously. 

I looked in IGV for the extra variants called, and many of them looked like true SNVs, others were InDels that were maybe more dubious as they were called in quite repetitive regions. Though these should be less impactful for the clinicians  as they occur more frequently in introns and introgenic regions and should be filtered out by most filters in Scout. 

My guess from this is that given the clear increase in sensitivity in the tumor normal TWIST pancancer cases, there are likely many added true positive variants from TNscope also in the tumor only case. Though the clear effect on sensitivity and precision on the tumor only analysis is unclear as we do not have enough true positive variants to measure this. 

I will inform the customers that commonly order myeloid analysis to get their opinion on the increase in number of variants.

horizon FLT3 variants

  • FLT3 variants are detected in all cases (Yes, but is missing in all TNscope calls and in one Manta call)

horizon SNV and InDels

  • All variants are still detected at reasonable VAFs

TWIST pancancer reference samples

  • Sensitivity and precision is of equal or better quality (better quality across all 3 samples)

SeraCare variants

  • All seracare variants are still detected at reasonable VAFs

Myeloid variants

  • All variants are still detected

CNV profile in GENS

Is the CNV profile visualisation in GENS working?

CNV profile links here

  • gmcksolid_4.1_hg19_design.bed with known EGFR duplication observed in CNV profile
  • twistexomecomprehensive 10.2 case with clean CNV profile

CNV profile in GENS with and without using PON

5 PONS have been created in this upstream PR and which has not been fully evaluated:
update_cnvkit_pons: #1465

  • gmcksolid 4.1
  • gmcksolid 4.2
  • gmslymphoid 7.3
  • gmsmyeloid 5.3
  • gmssolid 15.2
  • twistexomecomprehensive 10.2

However, as we do not possess a set of cases with known CNVs, and have not validated the CNV analysis in the past. And even regardless of these cases (which would be very nice) the best way to evaluate the PONs is still to study the CNV profile by eye and determine if the profile is easier to interpret with the PON or without.

To save time I would suggest that we do this evaluation in the validation itself, which is uniquely possible for this feature specifically as adding a PON does not impact the code itself. If a PON would be deemed to produce noisy results we can simply not add it to the reference directory.

  • PONs are beneficial to the CNV analysis (To be evaluated in validation step)

Feature Tests

  • N/A
  • Test [Description]
    • [Screenshot]

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

Panel of Normal specific criteria

HOWEVER! We still need approval for 2 of the PoNs #1465

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.
    • Feedback from Myeloid customers regarding number of SNVs received and taken into consideration

Infrastructure Changes

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows. (Tested with PRs linked in Balsamic v16 release cg#3408)
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

@mathiasbio mathiasbio linked an issue Sep 24, 2024 that may be closed by this pull request
3 tasks
@mathiasbio mathiasbio linked an issue Oct 2, 2024 that may be closed by this pull request
@mathiasbio mathiasbio linked an issue Oct 9, 2024 that may be closed by this pull request
Base automatically changed from cnvkit_to_gens to update_cnvkit_pons October 11, 2024 11:09
Base automatically changed from update_cnvkit_pons to deduplicate_with_umi October 11, 2024 11:10
@mathiasbio mathiasbio mentioned this pull request Oct 11, 2024
57 tasks
Copy link
Collaborator

@khurrammaqbool khurrammaqbool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍 💯 🥇
As we discussed:
Please update the docs to include the filters used and a separate section for FLT3.
Please remove the intermediate research.filter.pass files

Copy link

sonarcloud bot commented Oct 14, 2024

@mathiasbio mathiasbio merged commit 57bdbe0 into deduplicate_with_umi Oct 14, 2024
7 checks passed
@mathiasbio mathiasbio deleted the merge_vardict_tnscope branch October 14, 2024 14:39
mathiasbio added a commit that referenced this pull request Oct 16, 2024
#### Added

- UMI extraction and deduplication to TGA workflow
- Adapter trimming of fastqs to UMI workflow
- Cap base quality in bam for Manta input

#### Changed

- Refactored multi workflow rule-files to separate files to decrease complexity
- Refactored output files to in general comply with format {sample_type}.{sample_name}
- Replaced Picard QC tools with matching Sentieon QC tools

#### Removed

- UMI specific rules for UMI-extraction and alignment (using new TGA-rules instead) 
- Fastq and UMI trimming command-line options


Merged this PR into this one: #1465

#### Added

- Added extension of target bed regions to a minimum size of 100 for CNV analysis
- PON for: Exome comprehensive 10.2 
- PON for: GMSsolid 15.2 
- PON for: GMCKsolid 4.2

#### Changed

- updated PON for GMCKSolid v4.1 
- updated PON for GMSMyeloid v5.3 
- updated PON for GMSlymphoid v7.3

Merged this PR into this one: #1448

#### Added

- Script to post-process CNVkit output to GENS-format
- DNAscope gnomad calling to TGA for GENS

#### Changed

- Parsing of GENS arguments changed to account for TGA

Merged this PR: #1475 into this one

#### Changed

- Refactored rules for bcftools filters
- Renamed final UMI bamfile to ensure hsmetrics are collected in multiqc json
- Changed ranked VCF from research to clincial
- Lowered min AF for TGA from 0.007 to 0.005
- Lowered maximal SOR for TNscope in TGA tumor only cases from 3 to 2.7
- Changed filter settings for research TNscope vcf, now either PASS or triallelic_site (fixing this issue: #1293)

#### Added

- TNscope for TGA workflows, merged with VarDict results
- New filter for VarDict for tumor in normal contamination
- Export TMP environment variables to rules that lack them
- Added genmod ranked VCFs to be delivered
- Added family-id to genmod in order to get ranked variants to Scout (solved this: #1045)
- Added DP and AF to INFO-field of TNscope vcfs for ranking model
- Raw TNscope calls and unfiltered research-annotated SNVs to delivery

#### Removed

- ML-model for TNscope is removed due to license issue with new version of Sentieon
- All code associated with TNhaplotyper
- Removed research.filtered.pass VCFs from delivery and storage list
@mathiasbio mathiasbio mentioned this pull request Oct 17, 2024
15 tasks
@mathiasbio mathiasbio linked an issue Oct 22, 2024 that may be closed by this pull request
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
2 participants