Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add SOR filter to WGS TN #1506

Merged
merged 9 commits into from
Feb 18, 2025

Conversation

mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Nov 27, 2024

Description

As observed in deviation https://github.com/Clinical-Genomics/Deviations/issues/719 we sometimes have an issue with strandbias in the WGS TN workflow. The cause of these variants with excessive strandbias is unknown but variants with 100% strandbias as these variants had, should never have been maintained in the final clinical VCF.

Issue in balsamic: #1505

Added

  • max SOR 4 filter to WGS TN SNV quality filter

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • [balsamic_filters.rst]

Tests

Feature Tests

  • Test rerun case from deviation and see that the problematic strandbias variants are removed. The shared variants between the two problematic cases are nearly entirely removed.
  • Investigate the change in the number of variants for at least 4 TN cases before and after this filter. In general more than half of the final variants are removed with SOR 4.
  • WGS TN case is run with SOR 4

See sheet: https://docs.google.com/spreadsheets/d/1FaROC-tS8gfJp7pcC9lx-Om_nvqJ8DWfhnI5STMAG4c/edit?gid=0#gid=0

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

  • Atlas documentation
    • N/A
    • Updated: [Link]
  • Web portal for Clinical Genomics
    • N/A
    • Updated: [Link]

Panel of Normal specific criteria

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.45%. Comparing base (7d529e6) to head (b682a9b).
Report is 51 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1506      +/-   ##
===========================================
- Coverage    99.48%   99.45%   -0.03%     
===========================================
  Files           40       40              
  Lines         1932     2020      +88     
===========================================
+ Hits          1922     2009      +87     
- Misses          10       11       +1     
Flag Coverage Δ
unittests 99.45% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mathiasbio mathiasbio added this to the Release 17 milestone Jan 20, 2025
@mathiasbio mathiasbio self-assigned this Jan 20, 2025
@mathiasbio mathiasbio linked an issue Feb 13, 2025 that may be closed by this pull request
3 tasks
@mathiasbio mathiasbio marked this pull request as ready for review February 13, 2025 12:09
@mathiasbio mathiasbio requested a review from a team as a code owner February 13, 2025 12:09
Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. I would add some info about why we are using 3 differnt SOR values to filter in the different subworkflows (2.7, 3, 4).

If possible, standardising would be best, but if not an explanation would be good

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Feb 14, 2025

It's a good question! 🙏

Do you mean to add it in this PR or in the docs somewhere? At the moment the reason is mostly based on variant-numbers, and ideally I would want to start from scratch with a good truthset and optimise the parameters based on this.

But I set the 2.7 on TGA based on the twist pancancer samples that I tested earlier where no true variants were removed, and where I knew that we already had VarDict and the goal was to bring the numbers down. If we only had TNscope as a caller I would probably relax this parameter a bit, but now we were only adding more variants and I weighed a bit more towards adding more likely high quality variants (but turns out that I probably didn't filter hard enough, or at least not on the right parameter - let's hope this fixes it: #1526)

I don't know if SOR 3 is good for WGS TO, but it's what Gothenburg is doing in their pipeline, and what Sentieon is using in their example scripts. Then when adding an SOR filter to the WGS TN workflow in this PR, I was aware that we haven't had as many issues with number of variants in this workflow, but the only issue that we have seen was those clear strand-bias artefacts mentioned in the deviation issue. I also saw that even adding a SOR of 4 which is less stringent than the other workflows, already removed a lot of variants. So I figured it could be nice to not move too drastically, especially when the customers haven't expressed any problems with variants here.

Basically I view it as a balancing act between precision and sensitivity, and for the workflows that don't seem to have an issue with precision I want to be more cautious in applying filters.

@mathiasbio mathiasbio requested a review from fevac February 18, 2025 09:54
@fevac
Copy link
Contributor

fevac commented Feb 18, 2025

I'm not sure I'm fully onboard with having these different thresholds without clear explanation and numbers. But I do see the urgency to remove the strandbias problem. So my suggestion would be to merge this but already now create another user story to fix SOR filtering and add proper documentation. Would you be ok with that?

@mathiasbio
Copy link
Collaborator Author

I don't think I see the issue in the same way. I think we should always look to improve our filters, and in that way the SOR that's defined here is also liable to change. But I would like to more push for a truthset that we can run on WGS and then investigate all filters together at that point, and maybe, and quite likely then the SOR would change, maybe even be the same for WGS TO and WGS TN. But I don't think that having different SOR values for TN and TO is necessarily wrong. It could be that the extra normal filtering in the TN case allows us to widen the field for searching for variants without overloading the clinicians.

I'll make an example just to illustrate a scenario, which I think could be quite realistic. Let's say we run the same case as TO and TN:
On TO we miss 6 variants and have 500 false positive.
On TN we miss 3 variants and have 20 false positive. We reduce FP because of the matched normal, and we find 3 extra true variants because there were some true variants that by chance had a higher SOR than 3 that were filtered out in the TO case. If we changed SOR to 3 for the TN as well we'd instead miss 6 variants and maybe we would filter out 15 more false positives.

I have tried to investigate how this SOR filter behaves in this issue: #1505 but it seems based on conversations with Sentieon that the parameter is based on multiple factors which will be difficult to understand completely in a plot. So I can't really make sense of this parameter intuitively as I would like. We're in a bit of a black box scenario here. I know that Sentieon and Gothenburg is using SOR 3 for WGS, like we're doing for WGS TO, but we don't really know the effects of this filter...and so I think it's nice to exercise some caution here when applying the filter to a workflow that hasn't really had any customer feedback on artefacts.

What if we make an issue instead for investigating WGS SNV filters in general, which would include optimisation compared to some truthset? Even now there are already differences in the filtering of WGS TN and TO in production.

These are only for WGS TO for instance:

        strand_reads = get_tag_and_filtername(snv_quality_filters, "balsamic_low_strand_read_counts"),
        qss = get_tag_and_filtername(snv_quality_filters, "balsamic_low_quality_scores"),
        sor = get_tag_and_filtername(snv_quality_filters, "balsamic_high_strand_oddsratio"),

So I think a more general issue about optimising the filters based on a truthset would make more sense than a SOR specifically

@fevac
Copy link
Contributor

fevac commented Feb 18, 2025

Just to be clear, I am not against having different filtering settings for different workflows if there is a need for it. What I would like is to be able to trust those settings which would require documentation on why we are using these filters and trusted tests cases with clear criteria (vs. specific cases that were run at the time to take a decision but which we don't really know how they behave/what to expect). So in that sense, I fully agree with that we need a reliable truthset and revise the filtering criteria in general. So go ahead and make a more general issue, but add an specify bullet point to revise the SOR too so this doesn't fall under the radar

@mathiasbio
Copy link
Collaborator Author

Okay! I still don't know why the SOR needs to be specifically pointed out to be revised, I mean it could be just as likely that the QSS in the tumor only workflow needs to be changed, or that some completely unused quality metric in the INFO field which we haven't looked at at all would add the greatest benefit to the filtering. But I'll write a list and include the SOR, and add a link to this PR so it can be followed up

@mathiasbio
Copy link
Collaborator Author

Added assessment issue here: #1540

Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@mathiasbio
Copy link
Collaborator Author

Thanks for the review and the discussion! ❤️

@mathiasbio mathiasbio merged commit 0ec90f1 into develop Feb 18, 2025
9 checks passed
@mathiasbio mathiasbio deleted the add_strandbias_filter_to_tn_wgs_workflow branch February 18, 2025 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

[User Story] Apply strandbias filter for WGS TN workflow
2 participants