Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while running DIAMOND_subsystems_analysis_counter.py #86

Open
adrianestrada1405 opened this issue Jul 26, 2024 · 10 comments
Open
Labels
bug Python Bug or fix related to the Python scripts.

Comments

@adrianestrada1405
Copy link

I tried running the DIAMOND_subsystems_analysis_counter.py script with the diamond .m8 output I generated by annotating against the subsystems database but I keep getting this message:

File "/work/ae180/Metatranscriptomics_files/DIAMOND_subsystems_analysis_counter.py", line 166, in
org = db_hier_dictionary[entry]
~~~~~~~~~~~~~~~~~~^^^^^^^

@transcript
Copy link
Owner

Hi @adrianestrada1405, can you share the command you're running?

It looks like there's an issue, based on your error, with one of the outputs in the m8 file not having a match in the Subsystems database being used as the reference. Is the set of tildes the only error message you see?

@transcript transcript added bug Python Bug or fix related to the Python scripts. labels Jul 26, 2024
@adrianestrada1405
Copy link
Author

Hi! Thank you so much for your quick reply, the command line I am running in a linux terminal (miniconda) is as follows:

python DIAMOND_subsystems_analysis_counter.py -I S1_diamond_subsys.m8 -D subsys_db.fa -O -P

The full error message reads as follows:

1M lines processed so far in 1.4564619064331055 seconds.
2M lines processed so far in 2.7852015495300293 seconds.
3M lines processed so far in 4.12341570854187 seconds.
4M lines processed so far in 5.438323736190796 seconds.
5M lines processed so far in 6.647214412689209 seconds.
6M lines processed so far in 7.874895811080933 seconds.
7M lines processed so far in 9.095749378204346 seconds.
8M lines processed so far in 10.322613954544067 seconds.
9M lines processed so far in 11.49899435043335 seconds.
10M lines processed so far in 12.687752723693848 seconds.
11M lines processed so far in 13.862505912780762 seconds.
12M lines processed so far in 15.037629842758179 seconds.
13M lines processed so far in 16.223573923110962 seconds.
14M lines processed so far in 17.431682109832764 seconds.
15M lines processed so far in 18.673274993896484 seconds.

Analysis of S1_diamond_subsys.m8 complete.
Number of total lines: 15011767
Number of unique sequences: 720690
Time elapsed: 18.68695044517517 seconds.

Starting database analysis now.
1000000 lines processed so far in 5.659058094024658 seconds.
2000000 lines processed so far in 14.029933214187622 seconds.
3000000 lines processed so far in 24.053369522094727 seconds.
4000000 lines processed so far in 35.13755011558533 seconds.
5000000 lines processed so far in 41.43318462371826 seconds.
6000000 lines processed so far in 44.62882852554321 seconds.
7000000 lines processed so far in 47.885552167892456 seconds.

Success!
Time elapsed: 51.02638864517212 seconds.
Number of lines: 7939855
Number of errors: 0
Traceback (most recent call last):
File "/work/ae180/Metatranscriptomics_files/DIAMOND_subsystems_analysis_counter.py", line 160, in
KeyError: 'sseqid'

@adrianestrada1405
Copy link
Author

The first error message came up when I ran the same command line (python DIAMOND_subsystems_analysis_counter.py -I S1_diamond_subsys.m8 -D subsys_db.fa -O) without the -P, I originally had run it with -P because I noticed some of the following post-processing R scripts needed the receipt generated by this python script when running it with the -P flag.

@transcript
Copy link
Owner

Hi, thanks for posting the additional info! My guess is that there may be a header in the Subsystems database file that has 'sseqid' in it, which doesn't match to any actual row of data.

Could you do a grep search, like:

grep -A2 -B2 "sseqid" subsys_db.fa

and see what comes up?

@adrianestrada1405
Copy link
Author

I did the grep search you suggested but could not find a match in the database.... What else could we try?

@transcript
Copy link
Owner

Is the key in the S1_diamond_subsys.m8 file? Do a grep search on there?

It looks like that phrase is a part of the blast m8 format headers, so I wonder if it's a header line in the m8 file.

@adrianestrada1405
Copy link
Author

I just did the grep search and this came up:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
_0 fig|331869.3.peg.5487\tNucleoside-diphosphate 49.0 627 317 13421 1541 12 635 3.53e-177 540
_0 fig|1085.1.peg.1260\tNucleoside-diphosphate 48.1 607 313 23415 1598 17 622 1.39e-170 523

@adrianestrada1405
Copy link
Author

It is one of the headers of the .m8 file

@transcript
Copy link
Owner

Got it, that makes sense!

I may look into updating the scripts to have them intelligently check to see if there's a header line, and skip it if so... but to immediately unblock you, you could just trim the header line off the m8 file and then re-run.

@adrianestrada1405
Copy link
Author

I ran it again and got the same error message, just with the first protein sequence this time:

Success!
Time elapsed: 26.074662923812866 seconds.
Number of lines: 7939855
Number of errors: 0
Traceback (most recent call last):
File "/work/ae180/Metatranscriptomics_files/DIAMOND_subsystems_analysis_counter.py", line 166, in
org = db_hier_dictionary[entry]
~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 'fig|331869.3.peg.5487\tNucleoside-diphosphate'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Python Bug or fix related to the Python scripts.
Projects
None yet
Development

No branches or pull requests

2 participants