-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add SOR filter to WGS TN #1506
Conversation
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1506 +/- ##
===========================================
- Coverage 99.48% 99.45% -0.03%
===========================================
Files 40 40
Lines 1932 2020 +88
===========================================
+ Hits 1922 2009 +87
- Misses 10 11 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me. I would add some info about why we are using 3 differnt SOR values to filter in the different subworkflows (2.7, 3, 4).
If possible, standardising would be best, but if not an explanation would be good
It's a good question! 🙏 Do you mean to add it in this PR or in the docs somewhere? At the moment the reason is mostly based on variant-numbers, and ideally I would want to start from scratch with a good truthset and optimise the parameters based on this. But I set the 2.7 on TGA based on the twist pancancer samples that I tested earlier where no true variants were removed, and where I knew that we already had VarDict and the goal was to bring the numbers down. If we only had TNscope as a caller I would probably relax this parameter a bit, but now we were only adding more variants and I weighed a bit more towards adding more likely high quality variants (but turns out that I probably didn't filter hard enough, or at least not on the right parameter - let's hope this fixes it: #1526) I don't know if SOR 3 is good for WGS TO, but it's what Gothenburg is doing in their pipeline, and what Sentieon is using in their example scripts. Then when adding an SOR filter to the WGS TN workflow in this PR, I was aware that we haven't had as many issues with number of variants in this workflow, but the only issue that we have seen was those clear strand-bias artefacts mentioned in the deviation issue. I also saw that even adding a SOR of 4 which is less stringent than the other workflows, already removed a lot of variants. So I figured it could be nice to not move too drastically, especially when the customers haven't expressed any problems with variants here. Basically I view it as a balancing act between precision and sensitivity, and for the workflows that don't seem to have an issue with precision I want to be more cautious in applying filters. |
I'm not sure I'm fully onboard with having these different thresholds without clear explanation and numbers. But I do see the urgency to remove the strandbias problem. So my suggestion would be to merge this but already now create another user story to fix SOR filtering and add proper documentation. Would you be ok with that? |
I don't think I see the issue in the same way. I think we should always look to improve our filters, and in that way the SOR that's defined here is also liable to change. But I would like to more push for a truthset that we can run on WGS and then investigate all filters together at that point, and maybe, and quite likely then the SOR would change, maybe even be the same for WGS TO and WGS TN. But I don't think that having different SOR values for TN and TO is necessarily wrong. It could be that the extra normal filtering in the TN case allows us to widen the field for searching for variants without overloading the clinicians. I'll make an example just to illustrate a scenario, which I think could be quite realistic. Let's say we run the same case as TO and TN: I have tried to investigate how this SOR filter behaves in this issue: #1505 but it seems based on conversations with Sentieon that the parameter is based on multiple factors which will be difficult to understand completely in a plot. So I can't really make sense of this parameter intuitively as I would like. We're in a bit of a black box scenario here. I know that Sentieon and Gothenburg is using SOR 3 for WGS, like we're doing for WGS TO, but we don't really know the effects of this filter...and so I think it's nice to exercise some caution here when applying the filter to a workflow that hasn't really had any customer feedback on artefacts. What if we make an issue instead for investigating WGS SNV filters in general, which would include optimisation compared to some truthset? Even now there are already differences in the filtering of WGS TN and TO in production. These are only for WGS TO for instance:
So I think a more general issue about optimising the filters based on a truthset would make more sense than a SOR specifically |
Just to be clear, I am not against having different filtering settings for different workflows if there is a need for it. What I would like is to be able to trust those settings which would require documentation on why we are using these filters and trusted tests cases with clear criteria (vs. specific cases that were run at the time to take a decision but which we don't really know how they behave/what to expect). So in that sense, I fully agree with that we need a reliable truthset and revise the filtering criteria in general. So go ahead and make a more general issue, but add an specify bullet point to revise the SOR too so this doesn't fall under the radar |
Okay! I still don't know why the SOR needs to be specifically pointed out to be revised, I mean it could be just as likely that the QSS in the tumor only workflow needs to be changed, or that some completely unused quality metric in the INFO field which we haven't looked at at all would add the greatest benefit to the filtering. But I'll write a list and include the SOR, and add a link to this PR so it can be followed up |
Added assessment issue here: #1540 |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Thanks for the review and the discussion! ❤️ |
Description
As observed in deviation https://github.com/Clinical-Genomics/Deviations/issues/719 we sometimes have an issue with strandbias in the WGS TN workflow. The cause of these variants with excessive strandbias is unknown but variants with 100% strandbias as these variants had, should never have been maintained in the final clinical VCF.
Issue in balsamic: #1505
Added
Documentation
Tests
Feature Tests
See sheet: https://docs.google.com/spreadsheets/d/1FaROC-tS8gfJp7pcC9lx-Om_nvqJ8DWfhnI5STMAG4c/edit?gid=0#gid=0
Pipeline Integrity Tests
.hk
file)Clinical Genomics Stockholm
Documentation
Panel of Normal specific criteria
User Changes
Infrastructure Changes
Checklist
Important
Ensure that all checkboxes below are ticked before merging.
For Developers
For Reviewers
conditions where applicable, with satisfactory results.