-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snakemake hangs when it doesn't recognize SLURM-killed jobs (timeout or too much memory) #5
Comments
…iles/slurm use --cluster-status slurm-status.py to recognize slurm-killed jobs as dead
… allocations add --cluster-status $0/slurm-status.py Modified slurm-status.py to use argv[4] instead of argv[1] as they had in https://github.com/Snakemake-Profiles/slurm Add slurm-status.py to the install script, and have snakemakeslurm check for it. Issue a warning if it's missing.
partial fixMy test cases are to use the RNA-Seq pipeline, and to edit the allowable time or memory for “align” to be 1 min or 1M, which induces death-by-slurm rather quickly. However, in the 1min case, this leaves a 0-length bam file good
problem
sample session, after fix
|
@rusalkaguy As you figured out, absence of output files clean-up after timeout or sigkill errors is a major issue. So far I have resorted to manual cleanup after identifying the cause of job failure in above special cases. Not surprisingly, this is a rather not-so-great solution as I forget manual cleanup step sometimes. I still use |
@ManavalanG shocking. Thanks for confirming. I haven't found much discussion of this, or an open snakemake bug... have you? Should we file one? It seems like a serious oversight, and likely an easy fix, since the output cleanup machinery already exists. |
From my memory of last time (few months ago) I looked into this issue, this has already been reported and this is currently thought to be a rather difficult problem. Below is why: When a snakemake pipeline (lets call it master) is run in cluster, each of the jobs (lets call them children) sent by this master to cluster are themselves a snakemake job. So, when a child job fails with typical non-zero exitcode, child removes its output files and then exits. That is, child job, not master, is responsible for failed output files cleanup. However, in cases of cluster jobs failed due to insufficient memory or timeout, cheaha kills the child job and therefore child job's snakemake never has the opportunity to cleanup failed output files. Let me know if this doesn't make sense. |
Figured I will share some relevant stuff:
|
@ManavalanG thanks. Did find the upstream from following your fork. We decided to differ going to profiles for the moment, and also stay on 5.9.1, as we have limited time just now for investigations. |
See Slack comments for at least 2 solutions: https://uab-rc.slack.com/archives/CL8LHFFD0/p1576105967058800?thread_ts=1576105319.058600
In general, we should move this wrapper to use profiles. As a stop gap, we should add --cluster-status with an appropriate slurm-status.py script.
The text was updated successfully, but these errors were encountered: