Header parsing using blast format 6 with salltitles #60

jjkoehorst · 2021-03-24T14:21:07Z

When using the standardised diamond analysis counter I noticed that a FASTA file of the input database was required.
Since this was a rather large database I was wondering why this was needed.

I realised that the python script parses the headers for accession numbers with function and species lookup.

Diamond does provide this functionality to have the fasta headers as an additional column in the tabular format. Only a few modifications are needed to use the diamond result file instead of parsing the fasta file for the database (unless it is used for other statistics?).

When executing diamond make sure to include --salltitles
When converting diamond daa to tsv use --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle to ensure the default format + subject title

Then some minor code modifications are needed see https://pastebin.com/5eq6GPaB for the whole script.

First reuse the input tsv file

# Reusing db code for parsing diamond file with additional salltitles column
db = open (infile_name, "r")

Change the split
line = line.strip().split("\t")[-1]

Skip the last element
db_id = str(splitline[0].split()[0]) # [1:] # Not needed anymore

In addition I added a check for multispecies as it is currently counted as an error:

			if "MULTISPECIES" not in line:
				db_error_counter += 1

I have not removed the database argument yet etc as I am not sure if this method is preferred.

I can submit a cleaned up version later if needed.

The text was updated successfully, but these errors were encountered:

transcript · 2021-06-09T16:57:45Z

Hi Jasper, I was not aware that DIAMOND is able to pull the headers out and include those in the outfile! I definitely want to explore this a bit more (and apologies for the delayed response to this).

If you could submit a cleaned up version, that would be great - I'll probably have to do some testing on my own but a cleaned up version could really be useful for making sure that I can make this update (and credit you, of course!).

Best,
Sam

bartns mentioned this issue Jun 16, 2021

counter scripts using blast format 6 with salltitles #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header parsing using blast format 6 with salltitles #60

Header parsing using blast format 6 with salltitles #60

jjkoehorst commented Mar 24, 2021 •

edited

Loading

transcript commented Jun 9, 2021

Header parsing using blast format 6 with salltitles #60

Header parsing using blast format 6 with salltitles #60

Comments

jjkoehorst commented Mar 24, 2021 • edited Loading

transcript commented Jun 9, 2021

jjkoehorst commented Mar 24, 2021 •

edited

Loading