You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been using the SAMSA2 pipeline and it works great for my application.
One thing I've noticed is that the genus/species names reported for Step 5 outputs are parsed using the final two space-separated names in the taxonomy. Most of the time this works well enough (e.g., the output is something like Bacillus subtilis, a proper genus and species pair).
But I seem to have quite a few cases where the output is something like "sp. Root239" or "sp. NRRL", the latter of which is particularly uninformative because NRRL is a type collection and so could really be pointing to anything.
I'm wondering if there is a way to modify the output of the script so that the user can get the full taxonomy? I see that the DIAMOND_general_RefSeq_analysis_counter.py python script deals with this function (around line 132 if I'm reading this correctly?). Maybe even having an option to add a column for taxid in the output here would be useful.
Thanks for any input you have on this!
Matt
The text was updated successfully, but these errors were encountered:
This is a good suggestion, and I'll tag this as enhancement - it shouldn't be too difficult to add another parameter to capture the full name when extracting from the RefSeq database. Feel free to submit a PR if you want to tackle this, or I'll work on it when I have a chance and will update this ticket.
Hello,
I've been using the SAMSA2 pipeline and it works great for my application.
One thing I've noticed is that the genus/species names reported for Step 5 outputs are parsed using the final two space-separated names in the taxonomy. Most of the time this works well enough (e.g., the output is something like Bacillus subtilis, a proper genus and species pair).
But I seem to have quite a few cases where the output is something like "sp. Root239" or "sp. NRRL", the latter of which is particularly uninformative because NRRL is a type collection and so could really be pointing to anything.
I'm wondering if there is a way to modify the output of the script so that the user can get the full taxonomy? I see that the DIAMOND_general_RefSeq_analysis_counter.py python script deals with this function (around line 132 if I'm reading this correctly?). Maybe even having an option to add a column for taxid in the output here would be useful.
Thanks for any input you have on this!
Matt
The text was updated successfully, but these errors were encountered: