feat: add SOR filter to WGS TN #1506

mathiasbio · 2024-11-27T13:47:28Z

Description

As observed in deviation https://github.com/Clinical-Genomics/Deviations/issues/719 we sometimes have an issue with strandbias in the WGS TN workflow. The cause of these variants with excessive strandbias is unknown but variants with 100% strandbias as these variants had, should never have been maintained in the final clinical VCF.

Issue in balsamic: #1505

Added

max SOR 4 filter to WGS TN SNV quality filter

Documentation

N/A
Updated Balsamic documentation to reflect the changes as needed for this PR.
- [balsamic_filters.rst]

Tests

Feature Tests

Test rerun case from deviation and see that the problematic strandbias variants are removed. The shared variants between the two problematic cases are nearly entirely removed.
Investigate the change in the number of variants for at least 4 TN cases before and after this filter. In general more than half of the final variants are removed with SOR 4.
WGS TN case is run with SOR 4

See sheet: https://docs.google.com/spreadsheets/d/1FaROC-tS8gfJp7pcC9lx-Om_nvqJ8DWfhnI5STMAG4c/edit?gid=0#gid=0

Pipeline Integrity Tests

Report deliver (generation of the .hk file)
- N/A
- Verified
TGA T/O Workflow
- N/A
- Verified
TGA T/N Workflow
- N/A
- Verified
UMI T/O Workflow
- N/A
- Verified
UMI T/N Workflow
- N/A
- Verified
WGS T/O Workflow
- N/A
- Verified
WGS T/N Workflow
- N/A
- Verified
QC Workflow
- N/A
- Verified
PON Workflow
- N/A
- Verified

Clinical Genomics Stockholm

Documentation

Atlas documentation
- N/A
- Updated: [Link]
Web portal for Clinical Genomics
- N/A
- Updated: [Link]

Panel of Normal specific criteria

The PR includes the addition of a new Panel of Normals
The samples have been verified to adhere to the sample selection criteria on Atlas PoN creation instructions for Balsamic

User Changes

N/A
This PR affects the output files or results.
- User feedback is considered unnecessary because [Justification].
- Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

Stored files in Housekeeper
- N/A
- Updated: [Link]
CG (CLI and delivered/uploaded files)
- N/A
- Updated: [Link]
Servers (configuration files on Hasta)
- N/A
- Updated: [Link]
Scout interface
- N/A
- Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

PR Description
- Provided a comprehensive description of the PR.
- Linked relevant user stories or issues to the PR.
Documentation
- Verified and updated documentation if necessary.
Tests
- Described and tested the functionality addressed in the PR.
- Ensured integration of the new code with existing workflows.
- Confirmed that meaningful unit tests were added for the changes introduced.
- Checked that the PR has successfully passed all relevant code smells and coverage checks.
Review
- Addressed and resolved all the feedback provided during the code review process.
- Obtained final approval from designated reviewers.

For Reviewers

Code
- Code implements the intended features or fixes the reported issue.
- Code follows the project's coding standards and style guide.
Documentation
- Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
Tests
- The author provided a description of their manual testing, including consideration of edge cases and boundary
  conditions where applicable, with satisfactory results.
Review
- Confirmed that the developer has addressed all the comments during the code review.

sonarqubecloud · 2024-11-27T13:48:06Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

codecov · 2024-11-27T13:50:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.45%. Comparing base (7d529e6) to head (b682a9b).
Report is 51 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1506      +/-   ##
===========================================
- Coverage    99.48%   99.45%   -0.03%     
===========================================
  Files           40       40              
  Lines         1932     2020      +88     
===========================================
+ Hits          1922     2009      +87     
- Misses          10       11       +1

Flag	Coverage Δ
unittests	`99.45% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…develop

fevac

Code looks good to me. I would add some info about why we are using 3 differnt SOR values to filter in the different subworkflows (2.7, 3, 4).

If possible, standardising would be best, but if not an explanation would be good

mathiasbio · 2025-02-14T15:33:38Z

It's a good question! 🙏

Do you mean to add it in this PR or in the docs somewhere? At the moment the reason is mostly based on variant-numbers, and ideally I would want to start from scratch with a good truthset and optimise the parameters based on this.

But I set the 2.7 on TGA based on the twist pancancer samples that I tested earlier where no true variants were removed, and where I knew that we already had VarDict and the goal was to bring the numbers down. If we only had TNscope as a caller I would probably relax this parameter a bit, but now we were only adding more variants and I weighed a bit more towards adding more likely high quality variants (but turns out that I probably didn't filter hard enough, or at least not on the right parameter - let's hope this fixes it: #1526)

I don't know if SOR 3 is good for WGS TO, but it's what Gothenburg is doing in their pipeline, and what Sentieon is using in their example scripts. Then when adding an SOR filter to the WGS TN workflow in this PR, I was aware that we haven't had as many issues with number of variants in this workflow, but the only issue that we have seen was those clear strand-bias artefacts mentioned in the deviation issue. I also saw that even adding a SOR of 4 which is less stringent than the other workflows, already removed a lot of variants. So I figured it could be nice to not move too drastically, especially when the customers haven't expressed any problems with variants here.

Basically I view it as a balancing act between precision and sensitivity, and for the workflows that don't seem to have an issue with precision I want to be more cautious in applying filters.

…develop

fevac · 2025-02-18T11:01:51Z

I'm not sure I'm fully onboard with having these different thresholds without clear explanation and numbers. But I do see the urgency to remove the strandbias problem. So my suggestion would be to merge this but already now create another user story to fix SOR filtering and add proper documentation. Would you be ok with that?

mathiasbio · 2025-02-18T11:57:44Z

I don't think I see the issue in the same way. I think we should always look to improve our filters, and in that way the SOR that's defined here is also liable to change. But I would like to more push for a truthset that we can run on WGS and then investigate all filters together at that point, and maybe, and quite likely then the SOR would change, maybe even be the same for WGS TO and WGS TN. But I don't think that having different SOR values for TN and TO is necessarily wrong. It could be that the extra normal filtering in the TN case allows us to widen the field for searching for variants without overloading the clinicians.

I'll make an example just to illustrate a scenario, which I think could be quite realistic. Let's say we run the same case as TO and TN:
On TO we miss 6 variants and have 500 false positive.
On TN we miss 3 variants and have 20 false positive. We reduce FP because of the matched normal, and we find 3 extra true variants because there were some true variants that by chance had a higher SOR than 3 that were filtered out in the TO case. If we changed SOR to 3 for the TN as well we'd instead miss 6 variants and maybe we would filter out 15 more false positives.

I have tried to investigate how this SOR filter behaves in this issue: #1505 but it seems based on conversations with Sentieon that the parameter is based on multiple factors which will be difficult to understand completely in a plot. So I can't really make sense of this parameter intuitively as I would like. We're in a bit of a black box scenario here. I know that Sentieon and Gothenburg is using SOR 3 for WGS, like we're doing for WGS TO, but we don't really know the effects of this filter...and so I think it's nice to exercise some caution here when applying the filter to a workflow that hasn't really had any customer feedback on artefacts.

What if we make an issue instead for investigating WGS SNV filters in general, which would include optimisation compared to some truthset? Even now there are already differences in the filtering of WGS TN and TO in production.

These are only for WGS TO for instance:

        strand_reads = get_tag_and_filtername(snv_quality_filters, "balsamic_low_strand_read_counts"),
        qss = get_tag_and_filtername(snv_quality_filters, "balsamic_low_quality_scores"),
        sor = get_tag_and_filtername(snv_quality_filters, "balsamic_high_strand_oddsratio"),

So I think a more general issue about optimising the filters based on a truthset would make more sense than a SOR specifically

fevac · 2025-02-18T12:07:02Z

Just to be clear, I am not against having different filtering settings for different workflows if there is a need for it. What I would like is to be able to trust those settings which would require documentation on why we are using these filters and trusted tests cases with clear criteria (vs. specific cases that were run at the time to take a decision but which we don't really know how they behave/what to expect). So in that sense, I fully agree with that we need a reliable truthset and revise the filtering criteria in general. So go ahead and make a more general issue, but add an specify bullet point to revise the SOR too so this doesn't fall under the radar

mathiasbio · 2025-02-18T12:38:36Z

Okay! I still don't know why the SOR needs to be specifically pointed out to be revised, I mean it could be just as likely that the QSS in the tumor only workflow needs to be changed, or that some completely unused quality metric in the INFO field which we haven't looked at at all would add the greatest benefit to the filtering. But I'll write a list and include the SOR, and add a link to this PR so it can be followed up

mathiasbio · 2025-02-18T12:48:39Z

Added assessment issue here: #1540

…develop

sonarqubecloud · 2025-02-18T12:52:45Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fevac

🚀

mathiasbio · 2025-02-18T13:14:55Z

Thanks for the review and the discussion! ❤️

add sor filter

6763f28

mathiasbio added this to the Release 17 milestone Jan 20, 2025

mathiasbio self-assigned this Jan 20, 2025

mathiasbio added 2 commits February 13, 2025 11:02

bla

ce9a78b

Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …

ff0d357

…develop

mathiasbio linked an issue Feb 13, 2025 that may be closed by this pull request

[User Story] Apply strandbias filter for WGS TN workflow #1505

Closed

3 tasks

mathiasbio added 2 commits February 13, 2025 11:10

merge develop

506aebb

update docs

6807ceb

mathiasbio marked this pull request as ready for review February 13, 2025 12:09

mathiasbio requested a review from a team as a code owner February 13, 2025 12:09

fevac reviewed Feb 14, 2025

View reviewed changes

mathiasbio added 2 commits February 14, 2025 16:35

Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …

07c46d0

…develop

merge develop

b9a20f8

mathiasbio requested a review from fevac February 18, 2025 09:54

mathiasbio mentioned this pull request Feb 18, 2025

[Assessment] Run WGS truthset and evaluate current SNV filters #1540

Open

mathiasbio added 2 commits February 18, 2025 13:50

Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …

dc0866f

…develop

merge develop

b682a9b

fevac approved these changes Feb 18, 2025

View reviewed changes

mathiasbio merged commit 0ec90f1 into develop Feb 18, 2025
9 checks passed

mathiasbio deleted the add_strandbias_filter_to_tn_wgs_workflow branch February 18, 2025 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SOR filter to WGS TN #1506

feat: add SOR filter to WGS TN #1506

mathiasbio commented Nov 27, 2024 •

edited

Loading

sonarqubecloud bot commented Nov 27, 2024

codecov bot commented Nov 27, 2024 •

edited

Loading

fevac left a comment

mathiasbio commented Feb 14, 2025 •

edited

Loading

fevac commented Feb 18, 2025

mathiasbio commented Feb 18, 2025

fevac commented Feb 18, 2025

mathiasbio commented Feb 18, 2025

mathiasbio commented Feb 18, 2025

sonarqubecloud bot commented Feb 18, 2025

fevac left a comment

mathiasbio commented Feb 18, 2025

feat: add SOR filter to WGS TN #1506

feat: add SOR filter to WGS TN #1506

Conversation

mathiasbio commented Nov 27, 2024 • edited Loading

Description

Added

Documentation

Tests

Feature Tests

Pipeline Integrity Tests

Clinical Genomics Stockholm

Documentation

Panel of Normal specific criteria

User Changes

Infrastructure Changes

Checklist

For Developers

For Reviewers

sonarqubecloud bot commented Nov 27, 2024

Quality Gate passed

codecov bot commented Nov 27, 2024 • edited Loading

Codecov Report

fevac left a comment

Choose a reason for hiding this comment

mathiasbio commented Feb 14, 2025 • edited Loading

fevac commented Feb 18, 2025

mathiasbio commented Feb 18, 2025

fevac commented Feb 18, 2025

mathiasbio commented Feb 18, 2025

mathiasbio commented Feb 18, 2025

sonarqubecloud bot commented Feb 18, 2025

Quality Gate passed

fevac left a comment

Choose a reason for hiding this comment

mathiasbio commented Feb 18, 2025

mathiasbio commented Nov 27, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading

mathiasbio commented Feb 14, 2025 •

edited

Loading