Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if PHMM-based alignment necessarily outputs an MSA #35

Open
jackscanlan opened this issue Oct 31, 2024 · 4 comments
Open

Check if PHMM-based alignment necessarily outputs an MSA #35

jackscanlan opened this issue Oct 31, 2024 · 4 comments

Comments

@jackscanlan
Copy link
Contributor

jackscanlan commented Oct 31, 2024

I don't think this is true if the aligned sequence has an insertion relative to the PHMM

If this is false, FILTER_PHMM sequences will need to be dealigned, and aligned output (#34) will need to be produced by an independent process (make sure if model-training ala. #4 requires unaligned input, both processes can run in parallel)

@jackscanlan jackscanlan changed the title Check PHMM-based alignment necessarily outputs an MSA Check if PHMM-based alignment necessarily outputs an MSA Oct 31, 2024
@alexpiper
Copy link

This should be handled by the 'extra' argument in map_to_model

@param extra How to handle insertions which were not part of the PHMM model. 'drop' will truncate all sequences to the shortest alignment length, while 'fill' will use gaps to pad all sequences out to the longest alignment length.

But definitely worth validating as its been a while since this was tested

@jackscanlan
Copy link
Contributor Author

jackscanlan commented Oct 31, 2024 via email

@alexpiper
Copy link

Ahh yes i get the issue, good point.

I was initially thinking it might make more sense to merge the chunks before going into FILTER_PHMM (if coding) or a traditional MSA aligner (if non-coding), before going into REMOVE_CONTAM. Starting with an aligned fasta, calculating a distance matrix, and heirarchially clustering it into OTUs may be faster/more robust than kmer::otu, and it would save realigning everything again at the end. But this would lose the efficiency of being able to align some chunks while others are still downloading, which is probably quite bad for runtime.

Insted, there may be something we can output from FILTER_PHMM that keeps track of the alignment coordinates for each alignment relative to the reference PHMM, that we can then use to stitch the seperately aligned chunks into a single alignment and pad out with gaps.

Conceptually i prefer the idea of outputting an aligned FASTA at the end and just unaligning for IDTAXA (can be just a regex removing any - characters), as its pretty useful for additional analyses (finding good primers, checking whether species are monophyletic etc).

@jackscanlan
Copy link
Contributor Author

I was initially thinking it might make more sense to merge the chunks before going into FILTER_PHMM (if coding) or a traditional MSA aligner (if non-coding), before going into REMOVE_CONTAM. Starting with an aligned fasta, calculating a distance matrix, and heirarchially clustering it into OTUs may be faster/more robust than kmer::otu, and it would save realigning everything again at the end. But this would lose the efficiency of being able to align some chunks while others are still downloading, which is probably quite bad for runtime.

Yes, the inefficiency of waiting for all sequences to download before aligning would be bad, but an additional problem is that FILTER_PHMM is used as a "does this sequence actually contain the marker" filter (and a subsetting/extraction tool for mitogenomes), which still needs to be done before any alignment can take place (even a single bad sequence going into an MSA step will really ruin things)

Insted, there may be something we can output from FILTER_PHMM that keeps track of the alignment coordinates for each alignment relative to the reference PHMM, that we can then use to stitch the seperately aligned chunks into a single alignment and pad out with gaps.

I think this is possible but I imagine it would require writing a lot of code outside of existing tools, which could be risky/slow. But something to keep in mind for the future if bulk alignment proves to be an unreasonable bottleneck for the entire pipeline

Conceptually i prefer the idea of outputting an aligned FASTA at the end and just unaligning for IDTAXA (can be just a regex removing any - characters), as its pretty useful for additional analyses (finding good primers, checking whether species are monophyletic etc).

I agree aligned output is very useful -- easy to make that the default if we like in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants