You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with PICRUSt2, which wraps EPA-NG.
I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).
In the original case the focal query sequence alignment looks like this:
You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.
I think this issue might be related to issue #29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!
You can see the input and output files attached in this zipfile: placement_test.zip.
STUDY_SEQS corresponds to study_seqs_hmmalign_original.fasta and study_seqs_hmmalign_funky.fasta for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.
Thanks,
Gavin
The text was updated successfully, but these errors were encountered:
thank you for finding this! it could very well be related to #29 as you say... though even then it seems strange. My initial thought is there might be some side effects of the optimization functions, though I thought I had properly isolated the relevant code, even to the (slight) detriment of performance.
Early in the new year, when my current paper is finally submitted, I will most likely have another go at epa to solve some issues and include a bunch of useful functionality, such as the issues with memory handling... I'm afraid you will have to stay tuned until then.
I am confused... My impression was that the placement of each query sequence by EPA-ng was completely independent from all the other query sequences. Is this just a bug which is becomes visible for large trees and/or weird query sequences, or is there actually some kind of intentional information sharing between queries? This is particularly relevant for me because I have been assuming that it makes no difference for the results to split up my queries (which I have millions of) into smaller batches and run epa-ng on them independently.
I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with PICRUSt2, which wraps EPA-NG.
I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).
In the original case the focal query sequence alignment looks like this:
I swapped in random DNA for a different test file (with the same header) and the alignment looks like this (only a single base aligned):
You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.
I think this issue might be related to issue #29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!
I ran EPA-NG with this command:
You can see the input and output files attached in this zipfile: placement_test.zip.
STUDY_SEQS corresponds to
study_seqs_hmmalign_original.fasta
andstudy_seqs_hmmalign_funky.fasta
for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.Thanks,
Gavin
The text was updated successfully, but these errors were encountered: