Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enaGroupGet doesn't throw an exception when it fails to download one accession #42

Open
tobsecret opened this issue Oct 23, 2018 · 11 comments
Assignees

Comments

@tobsecret
Copy link

tobsecret commented Oct 23, 2018

I have been running into trouble in a NextFlow pipeline I wrote. After some investigation, i found out that for some of the samples only one of the fastq files was downloaded but the process still finished without an error. The relevant part of my bash script is below:

enaDataGet -f fastq $accession || enaGroupGet -f fastq $accession

I am using both because some accessions have only one sub-accession and some have plenty.

@nicsilvester
Copy link
Contributor

Sample and project accessions will work wtb enaGroupGet, run and experiment accessions will work with enaDataGet.

For the FASTQ file issue. Are you using Python 2 or Python 3? If the former, please try switching to Python 2. There is a known issue with Python 2 that can intermittently cause this problem.

@nicsilvester
Copy link
Contributor

I will however look into the error catching and throwing. It might be bypassing something

@tobsecret
Copy link
Author

I have some more info - I got a notification from the cluster admin that I am out of disk space. So maybe the file is trying to be written but cannot because there is no disk space available.

@tobsecret
Copy link
Author

Also I am using Python 3. Am pretty sure now that it's related to availability of space on the cluster. We have a SLURM cluster and I am pretty sure what is happening is that enaGroupGet thinks it's writing to file, when the cluster just silently denies any actual writing. Also the dual call to enaDataGet and enaGroupGet is because I don't know in advance if an accession features a single run or whether it is an experiment accession that features multiple runs.

@nicsilvester
Copy link
Contributor

You don't need enaGroupGet for an experiment accession. If you tried it, it wouldn't work. An experiment is still classed as one object and therefore will get all of the runs associated with it. The group option is for samples and projects that can have multiple experiments.

Thanks for the update on the problem. I will see if there is a way to check for disk space or the type of error that comes back when it is full

@tobsecret
Copy link
Author

Yes, that's exactly why I am using both! Consider the following two accessions:
ERS010614
ERS347598

@nicsilvester
Copy link
Contributor

Both of those are sample accessions and therefore can only be used with enaGroupGet regardless of how many runs or experiments they contain.

enaDataGet is for experiment (ERX, DRX, SRX) and run (ERR, DRR, SRR) accessions

enaGroupGet is for sample (ERS, DRS, SRS, SAME, SAMN, SAMD) and project (ERP, DRP, SRP, PRJE, PRJD, PRJN) accessions

enaDataGet knows what data type you want to download based on the accession.

enaGroupGet needs you to tell it which type of data (read data, sequences, assemblies) should be downloaded for that sample/project.

@tobsecret
Copy link
Author

Thanks for the explanation! For some reason I thought I had gotten an error previously from enaGroupGet for the first sample in the provided list but now that I tested it again, it works just fine.

@tobsecret
Copy link
Author

I am still having issues with this. The problem lies in the fact that in my hands, enaGroupGet will sometimes not download both .fastq.gz files from a paired end run and between that and the fact that some accessions are represented by a single read file, like e.g. ERS010304, it's impossible to accommodate for both cases. Is there a dry-run option? With that I could do the error checking myself. Or does enaGroupGet do any checksum testing? I wouldn't mind fronting a pull-request for that but as someone unfamiliar with the code, this would ofc take a hot minute.

@nicsilvester
Copy link
Contributor

Yes it does checksum checking. I'll look into if any more checking can be added that might help. If a file fails to download, it should inform you of that. I'll look into this further too. And what use a dry run would be, though download problems would usually be in the part that can't be dry run. Unfortunately I haven't had time to look at this code for a few months. I should be able to pick it up again in Feb

@SichongP
Copy link

SichongP commented Apr 8, 2021

This seems like a related problem so I'll add to this thread. Sometimes it only downloads one of the 2 files in a run but still exit with an exit code of 0. See below message from a enaDataGet run:

Error with FTP transfer: <urlopen error ftp error: URLError("ftp error: error_perm('500 OOPS: socket')")>
Error with FTP transfer: <urlopen error retrieval incomplete: got only 239906688 out of 20600901726 bytes>
Checking availability of https://www.ebi.ac.uk/ena/browser/api/xml/SRR515218
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR515/SRR515218/SRR515218_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR515/SRR515218/SRR515218_1.fastq.gz
Failed to download file after two attempts
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR515/SRR515218/SRR515218_2.fastq.gz
Completed

It seems that this is because down_file() function does not return any error code.

I think a potentially better solution is to have down_file() return its success flag and in download_files(), abort and return an error if any download_file() call returns False -- it makes little sense to keep downloading if one read has already failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants