Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing during checking assembly? #224

Open
Thomieh73 opened this issue Jun 13, 2024 · 1 comment
Open

Failing during checking assembly? #224

Thomieh73 opened this issue Jun 13, 2024 · 1 comment
Labels

Comments

@Thomieh73
Copy link

Thomieh73 commented Jun 13, 2024

Hi,
I am trying to download viral genomes from genbank. I have put the commands inside a nextflow script and it runs on the login node of our HPC cluster.

The command inside the nextflow script that I use is this:

ncbi-genome-download --formats fasta --section genbank viral --parallel 4 --flat-output -r 5 -P -o genbank_genomes

What I encounter is that your tool is first doing the checking of the assemblies and that finishes when I do it for the refseq genomes. But for the genbank genomes it is a lot more genomes that need to be checked.

Checking assemblies:  23%|██▎       | 43268/187089 [1:34:09<4:41:54,  8.50entries/s](ncbidown33)

There is two things I notice.

  1. the connection to NCBI is failing. But this is a problem you know, which you have addressed by making it possible to set the number of retries. I have that currently at 5 times.
  2. The whole job is killed on my login node, before the number of retries reaches 5. I notice that it has restarted the process several times because it somehow stalls. I get the exitcode 143. Which usually means the process gets killed externally. The job stops at different places, it can be early, but it can also be after having checked 99% of the assemblies.

So I wonder what to do here?
I have contacted the admin of the HPC I am using to see if they have an idea.

Would it help to have more parallel processes? Are those used in the checking step? But I might eat up more cpus on the login node of our cluster. I can not use the compute nodes, since they have no access to the internet.

Or would there be another way of breaking up the checking of assemblies, so that I can make batches which are smaller and will finish.
I know the cache file that is created contains the ftp location of the genome. By grabbing that I can download all genomes, but why would I then use this tool.

Any suggestions you might have are welcome

@kblin
Copy link
Owner

kblin commented Jul 3, 2024

Phew, good question, I've never had things fail this way. The NCBI connection falling over happens more than I'd like, but the script just getting stuck I don't think I've seen before.

Would it make sense to try and run the download on a "not the HPC machine" and then copy the files over locally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants