Properly handling Ns in make_prg #60

leoisl · 2023-07-11T14:49:35Z

Our handling of N bases in make_prg has not been appropriate I guess. An extreme example comes from @Danderson123 . He has an MSA with 789 alleles, and just a single allele has two runs of 10 Ns. At some point during the recursive cluster and collapse algorithm, we get this part of the allele isolated, and we skip this gene:

2023-07-11 10:09:41.175 | WARNING  | make_prg.subcommands.from_msa:process_MSA:148 - Skipping building PRG for alsB. Error: All sequences in this slice contained N. Redo sequence curation.
Sequences: ['GAANNNNNNNNNN']

I think this might be worst than erroring out, specially in cases where we build PRGs from thousands of genes, this just gets buried in the log files, even though it is a WARNING log message...

The rationale to refuse to build the PRG from MSAs where we have full columns with N might be good, but the recursive nature of this problem makes it hard to generalise (e.g. we should refuse building a graph from a 10-allele MSA if some columns are all Ns, but if this was a 1000-allele MSA and just these had 10 had the Ns, it would be ok...). Anyway, we chatted and it seemed to us the best solution would be to replace N with the consensus of each column. If the whole column is N and/or gaps, we replace it by a random base, with the seed initialised with the alignment sequences, i.e. this replacement is reproducible if the same alignment is given. This is done upfront when loading the MSA so that we take the whole MSA as the context to find the consensus for each column.

This is a small PR, the huge number of additions comes from the integration tests I added (I used the 3 genes @Danderson123 pointed out in his minimum working example).

This should fix iqbal-lab-org/pandora#333

of each column. If the whole column is N and/or gaps, we replace it by a random base, with the seed initialised with the alignment sequences, i.e. this replacement is reproducible if the same alignment is given.

codecov · 2023-07-11T14:57:31Z

Codecov Report

Merging #60 (7454a89) into dev (80aca37) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##              dev      #60   +/-   ##
=======================================
  Coverage   99.73%   99.73%           
=======================================
  Files          18       18           
  Lines        1494     1525   +31     
=======================================
+ Hits         1490     1521   +31     
  Misses          4        4

Impacted Files	Coverage Δ
make_prg/utils/io_utils.py	`98.41% <100.00%> (+0.23%)`	⬆️
make_prg/utils/seq_utils.py	`100.00% <100.00%> (ø)`

make_prg/utils/seq_utils.py

refactoring

leoisl · 2023-07-13T14:47:58Z

Hey @mbhall88 , thanks for the comments. I addressed both your comments, added unit tests covering the two changes, and further refactored the get_majority_consensus_from_MSA() function.

leoisl added 3 commits July 11, 2023 15:22

Instead of ignoring Ns in the MSA, we now replace N with the consensus

fbc325d

of each column. If the whole column is N and/or gaps, we replace it by a random base, with the seed initialised with the alignment sequences, i.e. this replacement is reproducible if the same alignment is given.

Adding Amira test cases

0b8380e

just precommit

dd75a6c

leoisl requested review from mbhall88, Danderson123 and iqbal-lab July 11, 2023 14:49

Adding another test to get_majority_consensus_from_MSA()

ae9628e

mbhall88 reviewed Jul 12, 2023

View reviewed changes

make_prg/utils/seq_utils.py Outdated Show resolved Hide resolved

make_prg/utils/seq_utils.py Outdated Show resolved Hide resolved

leoisl added 2 commits July 13, 2023 15:17

Randomly select highest-count bases, do not require upper-cased MSA and

2089324

refactoring

linting

7454a89

leoisl requested a review from mbhall88 July 13, 2023 14:46

mbhall88 approved these changes Jul 17, 2023

View reviewed changes

iqbal-lab approved these changes Jul 17, 2023

View reviewed changes

leoisl merged commit 7fbbc8c into iqbal-lab-org:dev Jul 17, 2023

leoisl mentioned this pull request Jul 20, 2023

v0.5.0 release #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handling Ns in make_prg #60

Properly handling Ns in make_prg #60

leoisl commented Jul 11, 2023

codecov bot commented Jul 11, 2023 •

edited

Loading

leoisl commented Jul 13, 2023

Properly handling Ns in make_prg #60

Properly handling Ns in make_prg #60

Conversation

leoisl commented Jul 11, 2023

codecov bot commented Jul 11, 2023 • edited Loading

Codecov Report

leoisl commented Jul 13, 2023

codecov bot commented Jul 11, 2023 •

edited

Loading