Error while running DIAMOND_subsystems_analysis_counter.py #86

adrianestrada1405 · 2024-07-26T00:54:27Z

I tried running the DIAMOND_subsystems_analysis_counter.py script with the diamond .m8 output I generated by annotating against the subsystems database but I keep getting this message:

File "/work/ae180/Metatranscriptomics_files/DIAMOND_subsystems_analysis_counter.py", line 166, in
org = db_hier_dictionary[entry]
~~~~~~~~~~~~~~~~~~^^^^^^^

transcript · 2024-07-26T16:48:31Z

Hi @adrianestrada1405, can you share the command you're running?

It looks like there's an issue, based on your error, with one of the outputs in the m8 file not having a match in the Subsystems database being used as the reference. Is the set of tildes the only error message you see?

adrianestrada1405 · 2024-07-26T17:14:28Z

Hi! Thank you so much for your quick reply, the command line I am running in a linux terminal (miniconda) is as follows:

python DIAMOND_subsystems_analysis_counter.py -I S1_diamond_subsys.m8 -D subsys_db.fa -O -P

The full error message reads as follows:

1M lines processed so far in 1.4564619064331055 seconds.
2M lines processed so far in 2.7852015495300293 seconds.
3M lines processed so far in 4.12341570854187 seconds.
4M lines processed so far in 5.438323736190796 seconds.
5M lines processed so far in 6.647214412689209 seconds.
6M lines processed so far in 7.874895811080933 seconds.
7M lines processed so far in 9.095749378204346 seconds.
8M lines processed so far in 10.322613954544067 seconds.
9M lines processed so far in 11.49899435043335 seconds.
10M lines processed so far in 12.687752723693848 seconds.
11M lines processed so far in 13.862505912780762 seconds.
12M lines processed so far in 15.037629842758179 seconds.
13M lines processed so far in 16.223573923110962 seconds.
14M lines processed so far in 17.431682109832764 seconds.
15M lines processed so far in 18.673274993896484 seconds.

Analysis of S1_diamond_subsys.m8 complete.
Number of total lines: 15011767
Number of unique sequences: 720690
Time elapsed: 18.68695044517517 seconds.

Starting database analysis now.
1000000 lines processed so far in 5.659058094024658 seconds.
2000000 lines processed so far in 14.029933214187622 seconds.
3000000 lines processed so far in 24.053369522094727 seconds.
4000000 lines processed so far in 35.13755011558533 seconds.
5000000 lines processed so far in 41.43318462371826 seconds.
6000000 lines processed so far in 44.62882852554321 seconds.
7000000 lines processed so far in 47.885552167892456 seconds.

Success!
Time elapsed: 51.02638864517212 seconds.
Number of lines: 7939855
Number of errors: 0
Traceback (most recent call last):
File "/work/ae180/Metatranscriptomics_files/DIAMOND_subsystems_analysis_counter.py", line 160, in
KeyError: 'sseqid'

adrianestrada1405 · 2024-07-26T17:18:36Z

The first error message came up when I ran the same command line (python DIAMOND_subsystems_analysis_counter.py -I S1_diamond_subsys.m8 -D subsys_db.fa -O) without the -P, I originally had run it with -P because I noticed some of the following post-processing R scripts needed the receipt generated by this python script when running it with the -P flag.

transcript · 2024-07-29T19:15:46Z

Hi, thanks for posting the additional info! My guess is that there may be a header in the Subsystems database file that has 'sseqid' in it, which doesn't match to any actual row of data.

Could you do a grep search, like:

grep -A2 -B2 "sseqid" subsys_db.fa

and see what comes up?

adrianestrada1405 · 2024-07-29T19:47:21Z

I did the grep search you suggested but could not find a match in the database.... What else could we try?

transcript · 2024-07-29T20:17:28Z

Is the key in the S1_diamond_subsys.m8 file? Do a grep search on there?

It looks like that phrase is a part of the blast m8 format headers, so I wonder if it's a header line in the m8 file.

adrianestrada1405 · 2024-07-29T20:21:16Z

I just did the grep search and this came up:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
_0 fig|331869.3.peg.5487\tNucleoside-diphosphate 49.0 627 317 13421 1541 12 635 3.53e-177 540
_0 fig|1085.1.peg.1260\tNucleoside-diphosphate 48.1 607 313 23415 1598 17 622 1.39e-170 523

adrianestrada1405 · 2024-07-29T20:21:46Z

It is one of the headers of the .m8 file

transcript · 2024-07-29T21:13:17Z

Got it, that makes sense!

I may look into updating the scripts to have them intelligently check to see if there's a header line, and skip it if so... but to immediately unblock you, you could just trim the header line off the m8 file and then re-run.

adrianestrada1405 · 2024-07-29T21:28:41Z

I ran it again and got the same error message, just with the first protein sequence this time:

Success!
Time elapsed: 26.074662923812866 seconds.
Number of lines: 7939855
Number of errors: 0
Traceback (most recent call last):
File "/work/ae180/Metatranscriptomics_files/DIAMOND_subsystems_analysis_counter.py", line 166, in
org = db_hier_dictionary[entry]
~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 'fig|331869.3.peg.5487\tNucleoside-diphosphate'

transcript added bug Python Bug or fix related to the Python scripts. labels Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while running DIAMOND_subsystems_analysis_counter.py #86

Error while running DIAMOND_subsystems_analysis_counter.py #86

adrianestrada1405 commented Jul 26, 2024

transcript commented Jul 26, 2024

adrianestrada1405 commented Jul 26, 2024

adrianestrada1405 commented Jul 26, 2024

transcript commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024

transcript commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024

transcript commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024

Error while running DIAMOND_subsystems_analysis_counter.py #86

Error while running DIAMOND_subsystems_analysis_counter.py #86

Comments

adrianestrada1405 commented Jul 26, 2024

transcript commented Jul 26, 2024

adrianestrada1405 commented Jul 26, 2024

adrianestrada1405 commented Jul 26, 2024

transcript commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024

transcript commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024

transcript commented Jul 29, 2024

adrianestrada1405 commented Jul 29, 2024