Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: merge snv variants script #1499

Merged
merged 175 commits into from
Feb 14, 2025
Merged

fix: merge snv variants script #1499

merged 175 commits into from
Feb 14, 2025

Conversation

mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Nov 12, 2024

Description

MERGED THESE PRs INTO THIS ONE to test all new features together:

  1. disable normal hardfilter: feat: add option to disable normal hardfilter #1509
  2. create tnscope mvns: fix: create tnscope mnvs #1524

This method is replacing the current bcftools concat method.

The bcftools concat method has a couple of issues:

  1. The most important is probably that it doesn't merge INFO fields when the variants are shared in the two VCFs, with the most significant consequence being that we don't keep the info that the variant was called by 2 callers. [User Story] Keep both FOUND_IN variantcaller tags in merged variants #1518
  2. It doesn't require that the variants are matching perfectly in the ALT column, so for instance if a variant has been called as a MNV in VarDict and as separate SNVs in TNscope, it merges only the first variant. [Bug] Merging of different variants VarDict and TNscope #1519

Also added this to header of merged vcf:

##merge_snv_variantcallers=merge_snv_variantcallers.py SNV.somatic.setamoeba.tnscope.research.normalised.vcf.gz SNV.somatic.setamoeba.vardict.research.normalised.vcf.gz --output output_merged.vcf
##merge_snv_variantcallers_processing_time=2025-01-20T11:18:24
##INFO_MERGE_SNV_VARIANTCALLERS=Values in merged INFO fields are listed in the order of the input files: first from SNV.somatic.setamoeba.tnscope.research.normalised.vcf.gz, then from SNV.somatic.setamoeba.vardict.research.normalised.vcf.gz

Also removed the filepath in the FOUND_IN pre-processing by edit_vcf_info.py

Based on @khurrammaqbool suggestion I also maintained single values for the AF and DP fields from the 1st VCF in the INFO field and added a new list of AF and DP values which contains the values from both, instead of as previously transforming the AF and DP field directly into a list.

Changed

  • Replaced bcftools concat with custom python script for merging VCFs from VarDict and TNscope

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • [balsamic_filters.rst]
    • [balsamic_methods.rst]

Tests

Feature Tests

Verify that both TNscope and VarDict shows up as callers for merged variants in Scout

image

Verify that INFO field from VarDict and TNscope variants keep AF and DP as single-value fields for merged variants, and that separate AF and DP LIST fields are created with values from each caller. See sheet: https://docs.google.com/spreadsheets/d/1kB2vNaEBmol0tX3HUR3UY1LPQCtWZLutixGwnvbkmhY/edit?gid=1578036768#gid=1578036768

  • Successful

Verify that running the https://github.com/EBIvariation/vcf-validator does not show any new errors that weren't present in the original merged VCF from 16.0.0.

Errors in clinical.filtered.pass vcf from uphippo v16.0.0:

According to the VCF specification, the input file is not valid
Error: Error in meta-data section. This occurs 1 time(s), first time in line 265.
Error: Format is not a colon-separated list of alphanumeric strings. This occurs 113 time(s), first time in line 279.
Warning: Reference and alternate alleles do not share the first nucleotide. This occurs 2 time(s), first time in line 442.

Errors in clinical.filtered.pass vcf from uphippo this PR:

According to the VCF specification, the input file is not valid
Error: Error in meta-data section. This occurs 1 time(s), first time in line 280.
Error: Format is not a colon-separated list of alphanumeric strings. This occurs 115 time(s), first time in line 293.
Warning: Reference and alternate alleles do not share the first nucleotide. This occurs 2 time(s), first time in line 460.
  • Successful. The "Format is not a colon-separated list of alphanumeric strings" Error is from TNscope and represents the majority of TNscope variants. If this was an issue for uploading to Scout we would have seen it.

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

  • Atlas documentation
    • N/A
    • Updated: [Link]
  • Web portal for Clinical Genomics
    • N/A
    • Updated: [Link]

Panel of Normal specific criteria

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

Copy link

codecov bot commented Nov 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.45%. Comparing base (7d529e6) to head (fe0d227).
Report is 47 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1499      +/-   ##
===========================================
- Coverage    99.48%   99.45%   -0.03%     
===========================================
  Files           40       40              
  Lines         1932     2020      +88     
===========================================
+ Hits          1922     2009      +87     
- Misses          10       11       +1     
Flag Coverage Δ
unittests 99.45% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from release_v16.0.0 to master November 19, 2024 14:45
@mathiasbio mathiasbio changed the base branch from master to develop November 21, 2024 11:20
Base automatically changed from create_tnscope_mnvs to develop February 7, 2025 16:00
@mathiasbio mathiasbio requested review from fevac and a team February 13, 2025 12:20
Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

docs/balsamic_filters.rst Outdated Show resolved Hide resolved
docs/balsamic_filters.rst Outdated Show resolved Hide resolved
@mathiasbio mathiasbio merged commit e0459b8 into develop Feb 14, 2025
9 checks passed
@mathiasbio mathiasbio deleted the merge_snv_variants_script branch February 14, 2025 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

3 participants