-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handling Ns in make_prg #60
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
of each column. If the whole column is N and/or gaps, we replace it by a random base, with the seed initialised with the alignment sequences, i.e. this replacement is reproducible if the same alignment is given.
Codecov Report
@@ Coverage Diff @@
## dev #60 +/- ##
=======================================
Coverage 99.73% 99.73%
=======================================
Files 18 18
Lines 1494 1525 +31
=======================================
+ Hits 1490 1521 +31
Misses 4 4
|
mbhall88
reviewed
Jul 12, 2023
Hey @mbhall88 , thanks for the comments. I addressed both your comments, added unit tests covering the two changes, and further refactored the |
mbhall88
approved these changes
Jul 17, 2023
iqbal-lab
approved these changes
Jul 17, 2023
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Our handling of
N
bases inmake_prg
has not been appropriate I guess. An extreme example comes from @Danderson123 . He has an MSA with 789 alleles, and just a single allele has two runs of 10N
s. At some point during the recursive cluster and collapse algorithm, we get this part of the allele isolated, and we skip this gene:I think this might be worst than erroring out, specially in cases where we build PRGs from thousands of genes, this just gets buried in the log files, even though it is a
WARNING
log message...The rationale to refuse to build the PRG from MSAs where we have full columns with
N
might be good, but the recursive nature of this problem makes it hard to generalise (e.g. we should refuse building a graph from a 10-allele MSA if some columns are allN
s, but if this was a 1000-allele MSA and just these had 10 had theN
s, it would be ok...). Anyway, we chatted and it seemed to us the best solution would be to replaceN
with the consensus of each column. If the whole column is N and/or gaps, we replace it by a random base, with the seed initialised with the alignment sequences, i.e. this replacement is reproducible if the same alignment is given. This is done upfront when loading the MSA so that we take the whole MSA as the context to find the consensus for each column.This is a small PR, the huge number of additions comes from the integration tests I added (I used the 3 genes @Danderson123 pointed out in his minimum working example).
This should fix iqbal-lab-org/pandora#333