Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download kraken2 database #904

Open
mohitsharma-123 opened this issue Jan 15, 2025 · 7 comments
Open

Unable to download kraken2 database #904

mohitsharma-123 opened this issue Jan 15, 2025 · 7 comments

Comments

@mohitsharma-123
Copy link

[INFO - 2025-01-15 12:09:25,528]: Assigning taxonomic IDs to sequences
concurrent.futures.process._RemoteTraceback: 5/52169 project(s), 6 sequence(s), 12.30 Mbp
"""
Traceback (most recent call last):
File "/home/mohitsharma/miniforge-pypy3/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/data/mohitsharma/kraken2/kraken2/k2", line 800, in assign_taxids
with open(out_filepath, "w") as out_file:
FileNotFoundError: [Errno 2] No such file or directory: 'genomes/all/GCF/000/007/145/GCF_000007145.1_ASM714v1/GCF_000007145.1_ASM714v1_genomic.fna'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/mohitsharma/kraken2/kraken2/k2", line 3706, in
k2_main()
File "/data/mohitsharma/kraken2/kraken2/k2", line 3675, in k2_main
build_standard_database(args)
File "/data/mohitsharma/kraken2/kraken2/k2", line 2505, in build_standard_database
download_genomic_library(args)
File "/data/mohitsharma/kraken2/kraken2/k2", line 1899, in download_genomic_library
sequence_to_url = assign_taxid_to_sequences(
File "/data/mohitsharma/kraken2/kraken2/k2", line 1184, in assign_taxid_to_sequences
result = future.result()
File "/home/mohitsharma/miniforge-pypy3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/home/mohitsharma/miniforge-pypy3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: 'genomes/all/GCF/000/007/145/GCF_000007145.1_ASM714v1/GCF_000007145.1_ASM714v1_genomic.fna'

@mohitsharma-123
Copy link
Author

Also during download

[INFO - 2025-01-15 12:05:11,062]: Calculating MD5 sum for /data/mohitsharma/kraken2_db/library/bacteria/genomes/all/GCF/031/477/395/GCF_031477395.1_ASM3147739v1/GCF_031477395.1_ASM3147739v1_genomic.fna.gz
[INFO - 2025-01-15 12:05:11,063]: MD5 sum of /data/mohitsharma/kraken2_db/library/bacteria/genomes/all/GCF/031/477/395/GCF_031477395.1_ASM3147739v1/GCF_031477395.1_ASM3147739v1_genomic.fna.gz is d41d8cd98f00b204e9800998ecf8427e
[INFO - 2025-01-15 12:05:11,063]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/477/395/GCF_031477395.1_ASM3147739v1/GCF_031477395.1_ASM3147739v1_genomic.fna.gz
[WARNING - 2025-01-15 12:05:11,107]: Unable to download genomes/all/GCF/000/740/375/GCF_000740375.1_ASM74037v1/GCF_000740375.1_ASM74037v1_genomic.fna.gz, will try again
[INFO - 2025-01-15 12:05:11,121]: Saved GCF_001712955.1_ASM171295v1_genomic.fna.gz to /data/mohitsharma/kraken2_db/library/bacteria/genomes/all/GCF/001/712/955/GCF_001712955.1_ASM171295v1

@mohitsharma-123
Copy link
Author

The commands i used
git clone https://github.com/DerrickWood/kraken2
cd kraken2/
./install_kraken2.sh kraken2
cp kraken2/kraken2{,-build,-inspect} /home/mohitsharma/bin/
/data/mohitsharma/kraken2/kraken2/k2 build --standard --no-masking --db /data/mohitsharma/kraken2_db --threads 96

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Jan 15, 2025

Hello,

This is a strange occurence. Is the gzipped file present in the stated location?

@mohitsharma-123
Copy link
Author

After installation i have these files

Which gzipped file are you talking about and where this is supposed to be present and during downloading of database why some are not downloading?
[WARNING - 2025-01-15 12:05:11,107]: Unable to download genomes/all/GCF/000/740/375/GCF_000740375.1_ASM74037v1/GCF_000740375.1_ASM74037v1_genomic.fna.gz, will try again

(base) mohitsharma@deep:/data/mohitsharma/kraken2$ !835
cp kraken2/kraken2{,-build,-inspect} /home/mohitsharma/bin/
(base) mohitsharma@deep:/data/mohitsharma/kraken2$ ll /home/mohitsharma/bin/
total 11764
-rwxrwxr-x 1 mohitsharma mohitsharma 12015744 Sep 21 10:36 broot
-rwxrwxr-x 1 mohitsharma mohitsharma 8157 Jan 15 21:30 kraken2
-rwxrwxr-x 1 mohitsharma mohitsharma 10951 Jan 15 21:30 kraken2-build
-rwxrwxr-x 1 mohitsharma mohitsharma 2719 Jan 15 21:30 kraken2-inspect
-rwxrwxr-x 1 mohitsharma mohitsharma 3636 Oct 14 16:18 run_corason
(base) mohitsharma@deep:/data/mohitsharma/kraken2$ ll
total 48
-rw-rw-r-- 1 mohitsharma mohitsharma 6209 Jan 15 21:28 CHANGELOG.md
-rw-rw-r-- 1 mohitsharma mohitsharma 720 Jan 15 21:28 CMakeLists.txt
drwxrwxr-x 2 mohitsharma mohitsharma 4096 Jan 15 21:28 data
drwxrwxr-x 2 mohitsharma mohitsharma 4096 Jan 15 21:28 docs
-rwxrwxr-x 1 mohitsharma mohitsharma 1265 Jan 15 21:28 install_kraken2.sh
drwxrwxr-x 2 mohitsharma mohitsharma 4096 Jan 15 21:30 kraken2
-rw-rw-r-- 1 mohitsharma mohitsharma 1084 Jan 15 21:28 LICENSE
-rw-rw-r-- 1 mohitsharma mohitsharma 358 Jan 15 21:28 Makefile
-rw-rw-r-- 1 mohitsharma mohitsharma 2970 Jan 15 21:28 README.md
drwxrwxr-x 2 mohitsharma mohitsharma 4096 Jan 15 21:28 scripts
drwxrwxr-x 2 mohitsharma mohitsharma 4096 Jan 15 21:30 src
(base) mohitsharma@deep:/data/mohitsharma/kraken2$ cd kraken2/
(base) mohitsharma@deep:/data/mohitsharma/kraken2/kraken2$ ll
total 1016
-rwxrwxr-x 1 mohitsharma mohitsharma 959 Jan 15 21:30 16S_gg_installation.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 956 Jan 15 21:30 16S_rdp_installation.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 1193 Jan 15 21:30 16S_silva_installation.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 947 Jan 15 21:30 add_to_library.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 135016 Jan 15 21:30 build_db
-rwxrwxr-x 1 mohitsharma mohitsharma 2438 Jan 15 21:30 build_gg_taxonomy.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 4654 Jan 15 21:30 build_kraken2_db.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 2191 Jan 15 21:30 build_rdp_taxonomy.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 1269 Jan 15 21:30 build_silva_taxonomy.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 256448 Jan 15 21:30 classify
-rwxrwxr-x 1 mohitsharma mohitsharma 583 Jan 15 21:30 clean_db.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 1277 Jan 15 21:30 cp_into_tempfile.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 4497 Jan 15 21:30 download_genomic_library.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 1760 Jan 15 21:30 download_taxonomy.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 189984 Jan 15 21:30 dump_table
-rwxrwxr-x 1 mohitsharma mohitsharma 50240 Jan 15 21:30 estimate_capacity
-rwxrwxr-x 1 mohitsharma mohitsharma 133505 Jan 15 21:30 k2
-rwxrwxr-x 1 mohitsharma mohitsharma 109248 Jan 15 21:30 k2mask
-rwxrwxr-x 1 mohitsharma mohitsharma 8157 Jan 15 21:30 kraken2
-rwxrwxr-x 1 mohitsharma mohitsharma 10951 Jan 15 21:30 kraken2-build
-rwxrwxr-x 1 mohitsharma mohitsharma 2719 Jan 15 21:30 kraken2-inspect
-rw-rw-r-- 1 mohitsharma mohitsharma 2803 Jan 15 21:30 kraken2lib.pm
-rwxrwxr-x 1 mohitsharma mohitsharma 40568 Jan 15 21:30 lookup_accession_numbers
-rwxrwxr-x 1 mohitsharma mohitsharma 1780 Jan 15 21:30 lookup_accession_numbers.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 3883 Jan 15 21:30 make_seqid2taxid_map.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 1209 Jan 15 21:30 mask_low_complexity.sh
-rwxrwxr-x 1 mohitsharma mohitsharma 6410 Jan 15 21:30 rsync_from_ncbi.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 1631 Jan 15 21:30 scan_fasta_file.pl
-rwxrwxr-x 1 mohitsharma mohitsharma 1596 Jan 15 21:30 standard_installation.sh

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Jan 16, 2025

Which gzipped file are you talking about and where this is supposed to
be present and during downloading of database why some are not
downloading? [WARNING - 2025-01-15 12:05:11,107]: Unable to download
genomes/all/GCF/000/740/375/GCF_000740375.1_ASM74037v1/GCF_000740375.1_ASM74037v1_genomic.fna.gz,
will try again

NCBI will sometimes randomly drop connections either when the number of concurrent connections (threads) is high or the client is trying to retreive a large amount of data. To make sure all files are downloaded k2 will append the file that failed to download to list of files that are yet to be retrieved and try again later. The log for a file that failed to download should look like this:

bacteria.log:5299:[INFO - 2024-12-27 14:39:43,846]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/238/275/GCF_002238275.1_ASM223827v1/GCF_002238275.1_ASM223827v1_genomic.fna.gz
bacteria.log:5300:[WARNING - 2024-12-27 14:39:43,876]: Unable to download genomes/all/GCF/002/238/275/GCF_002238275.1_ASM223827v1/GCF_002238275.1_ASM223827v1_genomic.fna.gz, will try again
bacteria.log:99516:[INFO - 2024-12-27 15:06:28,558]: Calculating MD5 sum for /scratch/kraken2/library/bacteria/genomes/all/GCF/002/238/275/GCF_002238275.1_ASM223827v1/GCF_002238275.1_ASM223827v1_genomic.fna.gz
bacteria.log:99517:[INFO - 2024-12-27 15:06:28,563]: MD5 sum of /scratch/kraken2/library/bacteria/genomes/all/GCF/002/238/275/GCF_002238275.1_ASM223827v1/GCF_002238275.1_ASM223827v1_genomic.fna.gz is d41d8cd98f00b204e9800998ecf8427e
bacteria.log:99518:[INFO - 2024-12-27 15:06:28,563]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/238/275/GCF_002238275.1_ASM223827v1/GCF_002238275.1_ASM223827v1_genomic.fna.gz
bacteria.log:99525:[INFO - 2024-12-27 15:06:28,721]: Saved GCF_002238275.1_ASM223827v1_genomic.fna.gz to /scratch/kraken2/library/bacteria/genomes/all/GCF/002/238/275/GCF_002238275.1_ASM223827v1

If that file is not retrieved later then it is bug in k2.

I hope that makes sense.

@mohitsharma-123
Copy link
Author

Thanks for response but my download got terminated midway
Please help me how to complete the downloading of database

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Jan 17, 2025

This seems to have happened while downloading the bacteria library. If all the other libraries that make up the standard database downloaded successfully you can then download the bacteria library separately by running:
./k2 download-library --library bacteria --db /data/mohitsharma/kraken2_db --no-masking --threads 96
then run:
k2 build --db /data/mohitsharma/kraken2_db --threads 96

If there are other libraries that need downloading try rerunning:
k2 build --standard --no-masking --db /data/mohitsharma/kraken2_db --threads 96

k2 will skip all files that are already downloaded then fetch the remaining libraries.

I hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants