You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am trying to download viral genomes from genbank. I have put the commands inside a nextflow script and it runs on the login node of our HPC cluster.
The command inside the nextflow script that I use is this:
What I encounter is that your tool is first doing the checking of the assemblies and that finishes when I do it for the refseq genomes. But for the genbank genomes it is a lot more genomes that need to be checked.
the connection to NCBI is failing. But this is a problem you know, which you have addressed by making it possible to set the number of retries. I have that currently at 5 times.
The whole job is killed on my login node, before the number of retries reaches 5. I notice that it has restarted the process several times because it somehow stalls. I get the exitcode 143. Which usually means the process gets killed externally. The job stops at different places, it can be early, but it can also be after having checked 99% of the assemblies.
So I wonder what to do here?
I have contacted the admin of the HPC I am using to see if they have an idea.
Would it help to have more parallel processes? Are those used in the checking step? But I might eat up more cpus on the login node of our cluster. I can not use the compute nodes, since they have no access to the internet.
Or would there be another way of breaking up the checking of assemblies, so that I can make batches which are smaller and will finish.
I know the cache file that is created contains the ftp location of the genome. By grabbing that I can download all genomes, but why would I then use this tool.
Any suggestions you might have are welcome
The text was updated successfully, but these errors were encountered:
Phew, good question, I've never had things fail this way. The NCBI connection falling over happens more than I'd like, but the script just getting stuck I don't think I've seen before.
Would it make sense to try and run the download on a "not the HPC machine" and then copy the files over locally?
Hi,
I am trying to download viral genomes from genbank. I have put the commands inside a nextflow script and it runs on the login node of our HPC cluster.
The command inside the nextflow script that I use is this:
What I encounter is that your tool is first doing the checking of the assemblies and that finishes when I do it for the refseq genomes. But for the genbank genomes it is a lot more genomes that need to be checked.
There is two things I notice.
So I wonder what to do here?
I have contacted the admin of the HPC I am using to see if they have an idea.
Would it help to have more parallel processes? Are those used in the checking step? But I might eat up more cpus on the login node of our cluster. I can not use the compute nodes, since they have no access to the internet.
Or would there be another way of breaking up the checking of assemblies, so that I can make batches which are smaller and will finish.
I know the cache file that is created contains the ftp location of the genome. By grabbing that I can download all genomes, but why would I then use this tool.
Any suggestions you might have are welcome
The text was updated successfully, but these errors were encountered: