snakemake hangs when it doesn't recognize SLURM-killed jobs (timeout or too much memory) #5

rusalkaguy · 2020-05-08T21:51:12Z

See Slack comments for at least 2 solutions: https://uab-rc.slack.com/archives/CL8LHFFD0/p1576105967058800?thread_ts=1576105319.058600

Snakemake --cluster-status
- helper script parses sacct output; see docs: https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html?highlight=--cluster-status%20#using-cluster-status
Snakemake PROFILEs for SLURM https://github.com/Snakemake-Profiles/slurm/
- @ManavalanG's version for Cheaha: https://github.com/ManavalanG/slurm
- his slurm-status.py
- it's better than the one in --cluster-status example: it has retries on the sacct calls, and falls back to sctontrol if there's an sacct problem.
LIanov references known bug: https://bitbucket.org/snakemake/snakemake/issues/695/snakemake-does-not-recongize-timeout-error
- that page is gone, as snakemake moved to github. Can't find equivalent bug

In general, we should move this wrapper to use profiles. As a stop gap, we should add --cluster-status with an appropriate slurm-status.py script.

…iles/slurm use --cluster-status slurm-status.py to recognize slurm-killed jobs as dead

… allocations add --cluster-status $0/slurm-status.py Modified slurm-status.py to use argv[4] instead of argv[1] as they had in https://github.com/Snakemake-Profiles/slurm Add slurm-status.py to the install script, and have snakemakeslurm check for it. Issue a warning if it's missing.

Also start loading dvctools first, to avoid GCC conflicts caused when it's loaded 2nd.

rusalkaguy · 2020-05-12T19:54:15Z

partial fix

My test cases are to use the RNA-Seq pipeline, and to edit the allowable time or memory for “align” to be 1 min or 1M, which induces death-by-slurm rather quickly. However, in the 1min case, this leaves a 0-length bam file

good

The current implementation (--cluster-status slurm-status.py ) allows snakemake to SEE the failure, throw an error, and stop the workflow.

problem

The outputs for failed rule are not deleted!!!
- test-case: set 1min time limit for STAR. A 0-len bam file will be produced, that snakemake later interprets as meaning successful completion.
The error doesn't tell you that SLURM killed it, nor whether it was for memory or time.
- perhaps we could add some print() to stderr from slurm-status.py that would explain?

sample session, after fix

[last: 17s][~/scratch/zajac/rna.sm591]$ snakemakeslurm --use-conda star/DN_M30-r1/Aligned.out.bam
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 999
Job counts:
        count   jobs
        1       align
        1

[Tue May 12 14:38:27 2020]
rule align:
    input: trimmed/DN_M30-r1.1.fastq.gz, trimmed/DN_M30-r1.2.fastq.gz
    output: star/DN_M30-r1/Aligned.out.bam, star/DN_M30-r1/ReadsPerGene.out.tab
    log: logs/star/DN_M30-r1.log
    jobid: 0
    wildcards: sample=DN_M30, unit=r1
    threads: 8

Submitted job 0 with external jobid 'Submitted batch job 4771502'.
[Tue May 12 14:39:38 2020]
Error in rule align:
    jobid: 0
    output: star/DN_M30-r1/Aligned.out.bam, star/DN_M30-r1/ReadsPerGene.out.tab
    log: logs/star/DN_M30-r1.log (check log file(s) for error message)
    conda-env: /scratch/curtish/zajac/rna.sm591/.snakemake/conda/8a33389f
    cluster_jobid: Submitted batch job 4771502

Error executing rule align on cluster (jobid: 0, external: Submitted batch job 4771502, jobscript: /scratch/curtish/zajac/rna.sm591/.snakemake/tmp.ot4x6adn/snakejob.align.0.sh). For error details see the cluster log and the log files of the involved rule(s).
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /scratch/curtish/zajac/rna.sm591/.snakemake/log/2020-05-12T143826.461041.snakemake.log

ManavalanG · 2020-05-13T13:26:33Z

@rusalkaguy As you figured out, absence of output files clean-up after timeout or sigkill errors is a major issue. So far I have resorted to manual cleanup after identifying the cause of job failure in above special cases. Not surprisingly, this is a rather not-so-great solution as I forget manual cleanup step sometimes. I still use slurm-status.py script as frozen pipeline would drive me mad :)

rusalkaguy · 2020-05-13T15:43:15Z

@ManavalanG shocking. Thanks for confirming. I haven't found much discussion of this, or an open snakemake bug... have you? Should we file one? It seems like a serious oversight, and likely an easy fix, since the output cleanup machinery already exists.

ManavalanG · 2020-05-13T16:14:36Z

From my memory of last time (few months ago) I looked into this issue, this has already been reported and this is currently thought to be a rather difficult problem. Below is why:

When a snakemake pipeline (lets call it master) is run in cluster, each of the jobs (lets call them children) sent by this master to cluster are themselves a snakemake job. So, when a child job fails with typical non-zero exitcode, child removes its output files and then exits. That is, child job, not master, is responsible for failed output files cleanup.

However, in cases of cluster jobs failed due to insufficient memory or timeout, cheaha kills the child job and therefore child job's snakemake never has the opportunity to cleanup failed output files. slurm-status.py helps to identify child job's failure in these atypical cases but doesn't do much beyond that.

Let me know if this doesn't make sense.

ManavalanG · 2020-05-13T16:29:48Z

Figured I will share some relevant stuff:

Snakemake is deprecating --cluster in favor of profiles. If I remember correctly, they added this note in v5.10.x. I'm not sure why though. Also, snakemake doc doesn't have much info on on how to completely port --cluster setup to profiles. The best I have seen so far is from a blogpost.
Looks like slurm profile for snakemake, which I forked from, has seen several improvements since I set up mine. At quick glance, it looks like they have since fixed their job submit scripts to play better with --cluster-config. However for time being, I'm sticking with snakemake v5.9.1 and --cluster setup, but I plan to revisit this few months later once they have matured.

rusalkaguy · 2020-05-13T20:22:43Z

@ManavalanG thanks. Did find the upstream from following your fork. We decided to differ going to profiles for the moment, and also stay on 5.9.1, as we have limited time just now for investigations.

rusalkaguy self-assigned this May 8, 2020

rusalkaguy added a commit that referenced this issue May 12, 2020

Issue #5: Copy slurm-status.py from https://github.com/Snakemake-Prof…

911fa12

…iles/slurm use --cluster-status slurm-status.py to recognize slurm-killed jobs as dead

rusalkaguy added a commit that referenced this issue May 12, 2020

Issue #5, #6: carry over fixes to most recent version of snakemake

7ee73ad

Also start loading dvctools first, to avoid GCC conflicts caused when it's loaded 2nd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snakemake hangs when it doesn't recognize SLURM-killed jobs (timeout or too much memory) #5

snakemake hangs when it doesn't recognize SLURM-killed jobs (timeout or too much memory) #5

rusalkaguy commented May 8, 2020 •

edited

Loading

rusalkaguy commented May 12, 2020 •

edited

Loading

ManavalanG commented May 13, 2020 •

edited

Loading

rusalkaguy commented May 13, 2020

ManavalanG commented May 13, 2020 •

edited

Loading

ManavalanG commented May 13, 2020

rusalkaguy commented May 13, 2020

snakemake hangs when it doesn't recognize SLURM-killed jobs (timeout or too much memory) #5

snakemake hangs when it doesn't recognize SLURM-killed jobs (timeout or too much memory) #5

Comments

rusalkaguy commented May 8, 2020 • edited Loading

rusalkaguy commented May 12, 2020 • edited Loading

partial fix

good

problem

sample session, after fix

ManavalanG commented May 13, 2020 • edited Loading

rusalkaguy commented May 13, 2020

ManavalanG commented May 13, 2020 • edited Loading

ManavalanG commented May 13, 2020

rusalkaguy commented May 13, 2020

rusalkaguy commented May 8, 2020 •

edited

Loading

rusalkaguy commented May 12, 2020 •

edited

Loading

ManavalanG commented May 13, 2020 •

edited

Loading

ManavalanG commented May 13, 2020 •

edited

Loading