Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakemake hangs when it doesn't recognize SLURM-killed jobs (timeout or too much memory) #5

Open
rusalkaguy opened this issue May 8, 2020 · 6 comments
Assignees

Comments

@rusalkaguy
Copy link
Owner

rusalkaguy commented May 8, 2020

See Slack comments for at least 2 solutions: https://uab-rc.slack.com/archives/CL8LHFFD0/p1576105967058800?thread_ts=1576105319.058600

In general, we should move this wrapper to use profiles. As a stop gap, we should add --cluster-status with an appropriate slurm-status.py script.

@rusalkaguy rusalkaguy self-assigned this May 8, 2020
rusalkaguy added a commit that referenced this issue May 12, 2020
…iles/slurm

use --cluster-status slurm-status.py to recognize slurm-killed jobs as dead
rusalkaguy added a commit that referenced this issue May 12, 2020
… allocations

add --cluster-status $0/slurm-status.py

Modified slurm-status.py to use argv[4] instead of argv[1]
as they had in https://github.com/Snakemake-Profiles/slurm

Add slurm-status.py to the install script, and have snakemakeslurm check for it.
Issue a warning if it's missing.
rusalkaguy added a commit that referenced this issue May 12, 2020
Also start loading dvctools first, to avoid GCC conflicts caused when it's loaded 2nd.
@rusalkaguy
Copy link
Owner Author

rusalkaguy commented May 12, 2020

partial fix

My test cases are to use the RNA-Seq pipeline, and to edit the allowable time or memory for “align” to be 1 min or 1M, which induces death-by-slurm rather quickly. However, in the 1min case, this leaves a 0-length bam file

good

  • The current implementation (--cluster-status slurm-status.py ) allows snakemake to SEE the failure, throw an error, and stop the workflow.

problem

  • The outputs for failed rule are not deleted!!!
    • test-case: set 1min time limit for STAR. A 0-len bam file will be produced, that snakemake later interprets as meaning successful completion.
  • The error doesn't tell you that SLURM killed it, nor whether it was for memory or time.
    • perhaps we could add some print() to stderr from slurm-status.py that would explain?

sample session, after fix

[last: 17s][~/scratch/zajac/rna.sm591]$ snakemakeslurm --use-conda star/DN_M30-r1/Aligned.out.bam
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 999
Job counts:
        count   jobs
        1       align
        1

[Tue May 12 14:38:27 2020]
rule align:
    input: trimmed/DN_M30-r1.1.fastq.gz, trimmed/DN_M30-r1.2.fastq.gz
    output: star/DN_M30-r1/Aligned.out.bam, star/DN_M30-r1/ReadsPerGene.out.tab
    log: logs/star/DN_M30-r1.log
    jobid: 0
    wildcards: sample=DN_M30, unit=r1
    threads: 8

Submitted job 0 with external jobid 'Submitted batch job 4771502'.
[Tue May 12 14:39:38 2020]
Error in rule align:
    jobid: 0
    output: star/DN_M30-r1/Aligned.out.bam, star/DN_M30-r1/ReadsPerGene.out.tab
    log: logs/star/DN_M30-r1.log (check log file(s) for error message)
    conda-env: /scratch/curtish/zajac/rna.sm591/.snakemake/conda/8a33389f
    cluster_jobid: Submitted batch job 4771502

Error executing rule align on cluster (jobid: 0, external: Submitted batch job 4771502, jobscript: /scratch/curtish/zajac/rna.sm591/.snakemake/tmp.ot4x6adn/snakejob.align.0.sh). For error details see the cluster log and the log files of the involved rule(s).
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /scratch/curtish/zajac/rna.sm591/.snakemake/log/2020-05-12T143826.461041.snakemake.log

@ManavalanG
Copy link

ManavalanG commented May 13, 2020

@rusalkaguy As you figured out, absence of output files clean-up after timeout or sigkill errors is a major issue. So far I have resorted to manual cleanup after identifying the cause of job failure in above special cases. Not surprisingly, this is a rather not-so-great solution as I forget manual cleanup step sometimes. I still use slurm-status.py script as frozen pipeline would drive me mad :)

@rusalkaguy
Copy link
Owner Author

@ManavalanG shocking. Thanks for confirming. I haven't found much discussion of this, or an open snakemake bug... have you? Should we file one? It seems like a serious oversight, and likely an easy fix, since the output cleanup machinery already exists.

@ManavalanG
Copy link

ManavalanG commented May 13, 2020

From my memory of last time (few months ago) I looked into this issue, this has already been reported and this is currently thought to be a rather difficult problem. Below is why:

When a snakemake pipeline (lets call it master) is run in cluster, each of the jobs (lets call them children) sent by this master to cluster are themselves a snakemake job. So, when a child job fails with typical non-zero exitcode, child removes its output files and then exits. That is, child job, not master, is responsible for failed output files cleanup.

However, in cases of cluster jobs failed due to insufficient memory or timeout, cheaha kills the child job and therefore child job's snakemake never has the opportunity to cleanup failed output files. slurm-status.py helps to identify child job's failure in these atypical cases but doesn't do much beyond that.

Let me know if this doesn't make sense.

@ManavalanG
Copy link

Figured I will share some relevant stuff:

  • Snakemake is deprecating --cluster in favor of profiles. If I remember correctly, they added this note in v5.10.x. I'm not sure why though. Also, snakemake doc doesn't have much info on on how to completely port --cluster setup to profiles. The best I have seen so far is from a blogpost.
  • Looks like slurm profile for snakemake, which I forked from, has seen several improvements since I set up mine. At quick glance, it looks like they have since fixed their job submit scripts to play better with --cluster-config. However for time being, I'm sticking with snakemake v5.9.1 and --cluster setup, but I plan to revisit this few months later once they have matured.

@rusalkaguy
Copy link
Owner Author

@ManavalanG thanks. Did find the upstream from following your fork. We decided to differ going to profiles for the moment, and also stay on 5.9.1, as we have limited time just now for investigations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants