Preparing Job Scripts (sbatch)

#SBATCH Directives

Applications have to be run in batch mode on the supercomputer. This means that a job script must be prepared for each run, which contains the description of the required resources and the commands required for the run. The script can be submitted with the sbatch command to the Slurm scheduler. Parameters for the scheduler (resource requirements) can be provided at the top of the script using the #SBATCH directive in the script. (Note, that sbatch will stop processing further sbatch directives once the first non-comment non-whitespace line has been reached in the script.)

Basic Options

The most basic options that can be provided using sbatch directives are demonstrated in the following sbatch script:

#!/bin/bash
#SBATCH --account=ACCOUNT
#SBATCH --job-name=NAME
#SBATCH --partition=PARTITION
#SBATCH --time=TIME
#SBATCH --cpus-per-task=NCPUS
#SBATCH --mem-per-cpu=SIZE

EXECUTABLE COMMANDS
...

Important

If an option is not set, Slurm applies the default value for that option. You should always make sure that either you set the options correctly or that the default is suitable for your job, otherwise your job may not run as expected (e.g. may prioritized low, runs out of memory, will not be able to scale up properly or even may be pending forever).

Description of the options above (and their defaults):

--account (-A): ACCOUNT is the name of the project account to be debited (accessible accounts can be listed using the sbalance command). Default: The account (project) associated with the owner of the job (on Komondor, each user is associated with exactly one account for each project in which the user is participating).

--job-name (-J): NAME is the short name of the job. Default: The name of the batch script or the sender application.

--partition (-p): PARTITION is the partition that is requested for the resource allocation. Default: cpu

--time (-t): TIME is the maximum running time (walltime) allowed for the job. The following time formats can be used: “MINUTES”, “MINUTES:SECONDS”, “HOURS:MINUTES:SECONDS”, “DAYS-HOURS”, “DAYS-HOURS:MINUTES” and “DAYS-HOURS:MINUTES:SECONDS”. Currently the default is 2 days and the maximum is 7 days on all partitions of Komondor.

--cpus-per-task (-c): NCPUS is the number of processors that is to be allocated per task within the job. The default is one CPU core per task.

--mem-per-cpu: SIZE is the minimium memory required per usable allocated CPU. Default units are megabytes. On the Komondor, the default is 1000 MB / CPU core. The maximum memory that can be allocated per CPU core varies per partition: “CPU” - 2000 MB, “GPU” - 4000 MB, “AI” - 4000 MB, “BigData” - 42000 MB. (if more memory allocation is specified, a job will be billed for the numer of CPUs which can provide the required amount of memory)

For a full list of sbatch options, see the description of sbatch in the Slurm documentation or type man sbatch in the terminal.

Time Limit

Running jobs have a time limit for which they are allowed to run. When the time limit expires, the job is canceled by the scheduler. The default time limit on Komondor is 2 days. You can explicitly set the time limit for your job with the --time option. The maximum time limit that can be requested is 7 days (if you request more, your job will be left in a PENDING state, possibly indefinitely).

#SBATCH --time=TIME

TIME is the maximum running time (walltime) allowed for the job. The following time formats can be used: “MINUTES”, “MINUTES:SECONDS”, “HOURS:MINUTES:SECONDS”, “DAYS-HOURS”, “DAYS-HOURS:MINUTES” and “DAYS-HOURS:MINUTES:SECONDS”.

The current time limits set for the partitions can be queried with the following command:

sinfo --Format partition,defaulttime,time

Note

Shorter jobs can get higher priority, so it’s a good idea to set the job’s time limit as accurate as you can estimate. Properly setting the time limit also helps Slurm to schedule jobs more efficiently.

CPU Allocation

By default, one CPU core is allocated per task within your job. You can ask for more processors with the --cpus-per-task option:

#SBATCH --cpus-per-task=NCPUS

NCPUS is the number of processors that is to be allocated per task within the job.

Memory Allocation

By default, 1000 MB of memory is assigned to 1 allocated CPU core. You can request more with the --mem-per-cpu option:

#SBATCH --mem-per-cpu=SIZE

SIZE is given in MB (megabytes) by default. You can specify the unit explicitly: K for KB (kilobytes), M for MB (megabytes), G for GB (gigabytes), T for TB (terabytes).

The maximum memory that can be allocated per CPU core varies per partition:

  • “CPU”: 2000 MB / core

  • “GPU”: 4000 MB / core

  • “AI”: 4000 MB / core

  • “BigData”: 42000 MB / core

You can use the --mem-per-gpu option to specify the amount of memory required per allocated GPU:

#SBATCH --mem-per-gpu=SIZE

The --mem option sets the required memory per allocated node:

#SBATCH --mem=SIZE

Here, setting the SIZE to 0 means all of the memory on the allocated nodes.

Note: The --mem-per-cpu, --mem-per-gpu and --mem options are mutually exclusive.

GPU Allocation

GPUs can be allocated using the --gres option:

#SBATCH --gres=gpu:N

N sets the number of required GPUs per node, which can be 1-4 (“GPU” partition) or 1-8 on the “AI” partition (nodes on “CPU” and “BigData” partitions don’t have GPUs).

Node Allocation

Slurm will allocate enough nodes to satisfy the requested resources. However, you can explicitly specify the number of nodes you want to assign to your job with the --nodes option:

#SBATCH --nodes=N

N sets the number of required nodes for the job. You can also set this option as e.g. “MINIMUM-MAXIMUM” or a comma-separated list of node numbers.

Setting the number of required nodes does not mean you will get all resources of the allocated nodes. It is just means that the tasks of your job can use (can be distributed over) that many nodes. If you want to allocate all the CPUS and GRES (generic resources, e.g. GPUs) on the nodes for your job, you can use the --exclusive option. Note, that this does not mean that you also get all the memory on the allocated nodes.

#SBATCH --nodes=N
#SBATCH --exclusive
#SBATCH --mem=SIZE_PER_NODE

Non-restartable Jobs

In case of jobs that are not restartable or should not restart, the --no-requeue option can be set to prevent requeing (e.g after node failure):

#SBATCH --no-requeue

This setting needs to be applied since default behaviour of Komondor is to requeue the jobs.

Quality of Service (QOS)

Each job submitted to Slurm is assigned a Quality of Service (QoS), that affects the execution (e.g. priority, preemption, interruption, resource billing) of the job. The default QOS is “normal”: the job cannot be interrupted, and as much CPU time is billed as was used.

For more detailed description and information about the available QOSs, see the “Basic Usage / Quality of Service (QOS)” chapter.

You can set a QOS other than the default for your job using the --qos option, for example, here is how you set the QOS to “lowpri”:

#SBATCH --qos=lowpri

Low priority jobs may be interrupted then resumed at any time. It is advised to test the behaviour of the used software by artifically terminating it (more explantion will follow…).

Email Notification

You can instruct Slurm to send an email when the state of your job changes (e.g. starts, ends or gives an error):

#SBATCH --mail-type=ALL
#SBATCH --mail-user=EMAIL

You can set the triggering events using the --mail-type option (you can find the full list in the sbatch description in the Slurm documentation or type man sbatch on the terminal). The --mail-user option sets the email address where the notification should be sent.

Slurm Environment Variables

Many of the configurations specified (explicitly or implicitly) in the sbatch script are accessible with environmental variables after submission. Some examples:

$SLURM_JOB_ID

Unique ID for the job.

$SLURM_JOB_NAME

Name of your job (set with --job-name).

$SLURM_CPUS_PER_TASK

Number of CPU cores per task (set with --cpus-per-task).

$SLURM_MEM_PER_CPU

Amount of memory per CPU core (set with --mem-per-cpu).

$SLURM_MEM_PER_NODE

Amount of memory per node (set with --mem).

$SLURM_NTASKS

Number of tasks (set with --ntasks).

$SLURM_NTASKS_PER_NODE

Number of tasks per node (set with --ntasks-per-node).

You can find a full list of Slurm output variables in the Slurm documentation