Slurm workload manager
What is Slurm and why do we need it?
To achieve high utilization of the supercomputer and to distribute resources properly we use a scheduler to run the user tasks. The scheduler on the Komondor is Slurm, it receives tasks (jobs) from the users, stores them in a waiting queue until the required computational resources become available. If there are free resources, the scheduler sends the job to the compute node(s). Once the job is completed, the resources are released. Every job has a time and a resource limit, jobs that exceed their limits are stopped by the scheduler. There are several waiting queues available (called partitions), which may have different configuration (eg. allowed maximum run time). When submitting a job, users must decide on which partition, how long and what resources the job will require.
Useful commands
sacct
- Displays detailed information of Slurm jobs (eg. state, run time, allocated resources).
salloc
- Allocates a set of resources on the system. It allows to start an interactive session and run commands directly on the compute nodes.
sattach
- Attaches to a running Slurm job interactively. It allows to execute commands or check the output while the job is running.
sbatch
(recommended) - Submits a batch script to Slurm for scheduling. The job will be launched with the requested resources.
sbcast
- Transmits files between running Slurm jobs. This can be useful when jobs need common data.
scancel
- Cancels running or queued Slurm jobs. It provides fast and efficient job handling.
scontrol
- Administrative tasks in the Slurm system (eg. view system state, setting job priorities, handling resources).
sinfo
- Displays information about Slurm nodes and partitions. It allows an overview of resources and helps with scheduling decisions.
sprio
- View the factors that comprise a job’s scheduling priority.
squeue
- Displays list of running or pending Slurm jobs. It allows to track the state and priority of jobs.
srun
- Runs parallel Slurm jobs with the requested resources. It can be used to run tasks directly on the compute nodes.
sshare
- View the Slurm resources and sharing between users. It provides useful information about the system usage and the utilization of the resources.
sstat
- Displays detailed information about a running job. It lists the run time, resource usage, and state of the job.
strigger
- Used to set, get or clear Slurm triggers. Triggers can activate actions or commands in response to certain events.
How to use the Slurm scheduler?
- Preparing a Slurm script:
Create a Slurm script (usually with a “.sbatch” extension) that defines the job you want to run. You can specify the required resources (eg. number of CPU cores, amount of memory, run time limit) in the script. You can also set the name of the job, the output files, and other Slurm parameters.
- Submitting a Slurm script:
The created Slurm script have to be submitted to the Slurm scheduler. To do this, run the following command in the terminal:
sbatch <script_name>.sbatch
- Monitoring jobs:
After you submit the script, the Slurm scheduler handles the job. You can use the
squeue
command to list the jobs and check their running state. For instance:
squeue
You will find a detailed user guide in the “Basic usage” chapter.