-
Notifications
You must be signed in to change notification settings - Fork 5
SLURM resources
Here you can find a minimal set of resources to use SLURM job scheduler on an HPC cluster.
What is SLURM? See this quickstart: https://slurm.schedmd.com/quickstart.html
SLURM cheatsheets:
- https://slurm.schedmd.com/pdfs/summary.pdf
- https://www.carc.usc.edu/user-information/user-guides/hpc-basics/slurm-cheatsheet
-
sinfo
: get cluster status (e.g., number of free nodes at the moment). -
squeue -u USERNAME
: visualize the queue of jobs ofUSERNAME
user. -
sbatch JOBSCRIPT
: submit a job script to the SLURM queue. -
scontrol show job JOBID
: get detailed info of job withJOBID
id. -
scancel JOBID
: cancel job withJOBID
id. -
scancel -u USERNAME
: cancel all jobs ofUSERNAME
user. -
srun
: is used to execute a command in a SLURM job script. Example:srun python train.py
. -
sacct -j JOBID
: Get job stats after completion/when running.
More commands here: https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/
SLURM commands on JSC: https://apps.fz-juelich.de/jsc/hps/juwels/batchsystem.html#slurm-commands
SLURM job scripts are regular shell files enriched with some #SBATCH
directives at the top.
To know more, see this: https://www.osc.edu/book/export/html/2861
Once a job is submitted to the SLURM queue, it goes through a number of states before finishing. You can check in which state is a job of interest using the following command:
scontrol show job JOBID
To interpret the state code, use this guide: https://confluence.cscs.ch/display/KB/Meaning+of+Slurm+job+state+codes
Allocate a compute node with 4 GPUs on JSC supercomputer:
salloc --account=intertwin --partition=batch --nodes=1 --ntasks-per-node=1 --cpus-per-task=4 --gpus-per-node=4 --time=01:00:00
Once resources are available, the command will return a JOBID
. Use it to jump into the compute node with the 4 GPUs in this way:
run --jobid JOBID --overlap --pty /bin/bash
# Check that you are in the compute node and you have 4 GPUs
nvidia-smi
Eventually, remember to load the correct environment modules before activating the python virtual environment.
Before running a job, SLURM will set some environment variables in the job environment.
See here a table of them: https://www.glue.umd.edu/hpcc/help/slurmenv.html
Job arrays allow to conveniently submit a collection of similar and independent jobs.
To know more, see this: https://slurm.schedmd.com/job_array.html
Job array example: https://guiesbibtic.upf.edu/recerca/hpc/array-jobs