Skip to content

SLURM resources

Henry Mutegeki edited this page Jul 3, 2024 · 16 revisions

Here you can find a minimal set of resources to use SLURM job scheduler on an HPC cluster.

What is SLURM? See this quickstart: https://slurm.schedmd.com/quickstart.html

SLURM cheatsheets:

Commands

  • sinfo: get cluster status (e.g., number of free nodes at the moment).
  • squeue -u USERNAME: visualize the queue of jobs of USERNAME user.
  • sbatch JOBSCRIPT: submit a job script to the SLURM queue.
  • scontrol show job JOBID: get detailed info of job with JOBID id.
  • scancel JOBID: cancel job with JOBID id.
  • scancel -u USERNAME: cancel all jobs of USERNAME user.
  • srun: is used to execute a command in a SLURM job script. Example: srun python train.py.
  • sacct -j JOBID: Get job stats after completion/when running.

More commands here: https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/

SLURM commands on JSC: https://apps.fz-juelich.de/jsc/hps/juwels/batchsystem.html#slurm-commands

Job scripts for batch jobs

SLURM job scripts are regular shell files enriched with some #SBATCH directives at the top.

To know more, see this: https://www.osc.edu/book/export/html/2861

Check job status

Once a job is submitted to the SLURM queue, it goes through a number of states before finishing. You can check in which state is a job of interest using the following command:

scontrol show job JOBID

To interpret the state code, use this guide: https://confluence.cscs.ch/display/KB/Meaning+of+Slurm+job+state+codes

Interactive shell on a compute node

Allocate a compute node with 4 GPUs on JSC supercomputer:

salloc --account=intertwin --partition=batch --nodes=1 --ntasks-per-node=1 --cpus-per-task=4 --gpus-per-node=4 --time=01:00:00

Once resources are available, the command will return a JOBID. Use it to jump into the compute node with the 4 GPUs in this way:

run --jobid JOBID --overlap --pty /bin/bash

# Check that you are in the compute node and you have 4 GPUs
nvidia-smi

Eventually, remember to load the correct environment modules before activating the python virtual environment.

Environment variables

Before running a job, SLURM will set some environment variables in the job environment.

See here a table of them: https://www.glue.umd.edu/hpcc/help/slurmenv.html

Job arrays

Job arrays allow to conveniently submit a collection of similar and independent jobs.

To know more, see this: https://slurm.schedmd.com/job_array.html

Job array example: https://guiesbibtic.upf.edu/recerca/hpc/array-jobs