Check if PHMM-based alignment necessarily outputs an MSA #35

jackscanlan · 2024-10-31T04:47:30Z

I don't think this is true if the aligned sequence has an insertion relative to the PHMM

If this is false, FILTER_PHMM sequences will need to be dealigned, and aligned output (#34) will need to be produced by an independent process (make sure if model-training ala. #4 requires unaligned input, both processes can run in parallel)

The text was updated successfully, but these errors were encountered:

alexpiper · 2024-10-31T08:36:38Z

This should be handled by the 'extra' argument in map_to_model

@param extra How to handle insertions which were not part of the PHMM model. 'drop' will truncate all sequences to the shortest alignment length, while 'fill' will use gaps to pad all sequences out to the longest alignment length.

But definitely worth validating as its been a while since this was tested

jackscanlan · 2024-10-31T09:04:56Z

Thinking about it further, the problem is the PHMM alignment is done in chunks (by default 1000 sequences at a time) for efficiency reasons, so different insertions could produce different coordinate systems for the resultant MSAs and they won’t be compatible when concatenated. So unfortunately I don't think that function argument helps I’ve already started implementing an optional alignment step at the end of the pipeline and will soon add in dealignment after the PHMM alignment step to produce a default unaligned output

alexpiper · 2024-10-31T22:31:29Z

Ahh yes i get the issue, good point.

I was initially thinking it might make more sense to merge the chunks before going into FILTER_PHMM (if coding) or a traditional MSA aligner (if non-coding), before going into REMOVE_CONTAM. Starting with an aligned fasta, calculating a distance matrix, and heirarchially clustering it into OTUs may be faster/more robust than kmer::otu, and it would save realigning everything again at the end. But this would lose the efficiency of being able to align some chunks while others are still downloading, which is probably quite bad for runtime.

Insted, there may be something we can output from FILTER_PHMM that keeps track of the alignment coordinates for each alignment relative to the reference PHMM, that we can then use to stitch the seperately aligned chunks into a single alignment and pad out with gaps.

Conceptually i prefer the idea of outputting an aligned FASTA at the end and just unaligning for IDTAXA (can be just a regex removing any - characters), as its pretty useful for additional analyses (finding good primers, checking whether species are monophyletic etc).

jackscanlan · 2024-10-31T23:38:05Z

I was initially thinking it might make more sense to merge the chunks before going into FILTER_PHMM (if coding) or a traditional MSA aligner (if non-coding), before going into REMOVE_CONTAM. Starting with an aligned fasta, calculating a distance matrix, and heirarchially clustering it into OTUs may be faster/more robust than kmer::otu, and it would save realigning everything again at the end. But this would lose the efficiency of being able to align some chunks while others are still downloading, which is probably quite bad for runtime.

Yes, the inefficiency of waiting for all sequences to download before aligning would be bad, but an additional problem is that FILTER_PHMM is used as a "does this sequence actually contain the marker" filter (and a subsetting/extraction tool for mitogenomes), which still needs to be done before any alignment can take place (even a single bad sequence going into an MSA step will really ruin things)

Insted, there may be something we can output from FILTER_PHMM that keeps track of the alignment coordinates for each alignment relative to the reference PHMM, that we can then use to stitch the seperately aligned chunks into a single alignment and pad out with gaps.

I think this is possible but I imagine it would require writing a lot of code outside of existing tools, which could be risky/slow. But something to keep in mind for the future if bulk alignment proves to be an unreasonable bottleneck for the entire pipeline

Conceptually i prefer the idea of outputting an aligned FASTA at the end and just unaligning for IDTAXA (can be just a regex removing any - characters), as its pretty useful for additional analyses (finding good primers, checking whether species are monophyletic etc).

I agree aligned output is very useful -- easy to make that the default if we like in the future

jackscanlan added the validation label Oct 31, 2024

jackscanlan changed the title ~~Check PHMM-based alignment necessarily outputs an MSA~~ Check if PHMM-based alignment necessarily outputs an MSA Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if PHMM-based alignment necessarily outputs an MSA #35

Check if PHMM-based alignment necessarily outputs an MSA #35

jackscanlan commented Oct 31, 2024 •

edited

Loading

alexpiper commented Oct 31, 2024

jackscanlan commented Oct 31, 2024 via email •

edited

Loading

alexpiper commented Oct 31, 2024

jackscanlan commented Oct 31, 2024

Check if PHMM-based alignment necessarily outputs an MSA #35

Check if PHMM-based alignment necessarily outputs an MSA #35

Comments

jackscanlan commented Oct 31, 2024 • edited Loading

alexpiper commented Oct 31, 2024

jackscanlan commented Oct 31, 2024 via email • edited Loading

alexpiper commented Oct 31, 2024

jackscanlan commented Oct 31, 2024

jackscanlan commented Oct 31, 2024 •

edited

Loading

jackscanlan commented Oct 31, 2024 via email •

edited

Loading