Managing jobs using Slurm Workload Manager

 

On HPC clusters computations should be performed on the compute nodes. Special programs called resource managers, workload managers or job schedulers are used to allocate processors and memory on compute nodes to users’ jobs.  On Condo the Slurm Workload Manager is used for this purpose.  Jobs can be run in interactive and batch modes.  When executing/debugging short-running jobs using small numbers of MPI processes, interactive execution instead of batch execution may speed up the program development.  To start an interactive session for an hour, issue:

salloc -N1 -t 1:00:00

Your environment, such as loaded environment modules, will be copied to the interactive session. It's important to issue

exit

when you're done, so that resources assigned to your interactive job can be freed and be used by other users.

However when running longer jobs, the batch mode should be used instead.  In this case a job script should be created and submitted into queue by issuing:

sbatch <job_script>

 

The job script will contain Slurm settings, such as number of cores and time requested, and the commands that should be executed during the batch session. Use Slurm Job Script Generator for Condo to create job scripts.

In Slurm queues are called partitions. Only partitions for special nodes (such as fat, huhe, GPU) need to be specified when submitting jobs. Otherwise Slurm will submit job into a partition based on the number of nodes and time requested. The Script Generator will automatically add appropriate partition to the generated job script if accelerator nodes were selected.

 

To see the list of available partitions, issue:

sinfo

For more details on partitions limits, issue:

scontrol show partitions

 

To see the job queue, issue:

squeue

To cancel job <job_id>, issue:

scancel <job_id>