You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the standardised diamond analysis counter I noticed that a FASTA file of the input database was required.
Since this was a rather large database I was wondering why this was needed.
I realised that the python script parses the headers for accession numbers with function and species lookup.
Diamond does provide this functionality to have the fasta headers as an additional column in the tabular format. Only a few modifications are needed to use the diamond result file instead of parsing the fasta file for the database (unless it is used for other statistics?).
When executing diamond make sure to include --salltitles
When converting diamond daa to tsv use --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle to ensure the default format + subject title
Hi Jasper, I was not aware that DIAMOND is able to pull the headers out and include those in the outfile! I definitely want to explore this a bit more (and apologies for the delayed response to this).
If you could submit a cleaned up version, that would be great - I'll probably have to do some testing on my own but a cleaned up version could really be useful for making sure that I can make this update (and credit you, of course!).
When using the standardised diamond analysis counter I noticed that a FASTA file of the input database was required.
Since this was a rather large database I was wondering why this was needed.
I realised that the python script parses the headers for accession numbers with function and species lookup.
Diamond does provide this functionality to have the fasta headers as an additional column in the tabular format. Only a few modifications are needed to use the diamond result file instead of parsing the fasta file for the database (unless it is used for other statistics?).
--outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle
to ensure the default format + subject titleThen some minor code modifications are needed see https://pastebin.com/5eq6GPaB for the whole script.
First reuse the input tsv file
# Reusing db code for parsing diamond file with additional salltitles column
db = open (infile_name, "r")
Change the split
line = line.strip().split("\t")[-1]
Skip the last element
db_id = str(splitline[0].split()[0]) # [1:] # Not needed anymore
In addition I added a check for multispecies as it is currently counted as an error:
I have not removed the database argument yet etc as I am not sure if this method is preferred.
I can submit a cleaned up version later if needed.
The text was updated successfully, but these errors were encountered: