Parsing of genus/species names #37

mattjmeier · 2019-10-07T13:59:14Z

Hello,

I've been using the SAMSA2 pipeline and it works great for my application.

One thing I've noticed is that the genus/species names reported for Step 5 outputs are parsed using the final two space-separated names in the taxonomy. Most of the time this works well enough (e.g., the output is something like Bacillus subtilis, a proper genus and species pair).

But I seem to have quite a few cases where the output is something like "sp. Root239" or "sp. NRRL", the latter of which is particularly uninformative because NRRL is a type collection and so could really be pointing to anything.

I'm wondering if there is a way to modify the output of the script so that the user can get the full taxonomy? I see that the DIAMOND_general_RefSeq_analysis_counter.py python script deals with this function (around line 132 if I'm reading this correctly?). Maybe even having an option to add a column for taxid in the output here would be useful.

Thanks for any input you have on this!
Matt

transcript · 2019-10-11T22:46:35Z

Hi Matt,

This is a good suggestion, and I'll tag this as enhancement - it shouldn't be too difficult to add another parameter to capture the full name when extracting from the RefSeq database. Feel free to submit a PR if you want to tackle this, or I'll work on it when I have a chance and will update this ticket.

Best,
Sam

transcript added enhancement Python Bug or fix related to the Python scripts. labels Oct 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing of genus/species names #37

Parsing of genus/species names #37

mattjmeier commented Oct 7, 2019

transcript commented Oct 11, 2019

Parsing of genus/species names #37

Parsing of genus/species names #37

Comments

mattjmeier commented Oct 7, 2019

transcript commented Oct 11, 2019