Submitting and Managing Jobs

Resource Usage Estimation

Before production runs, it is worth making a preliminary estimate of resource consumption. The sestimate command can be used for this:

sestimate -N NODES -t WALLTIME

where NODES are the number of nodes you want to allocate and WALLTIME is the maximum running time of the task.

It is important to specify the required resources and timelimit of the jobs as accurately as possible, because the scheduler takes these into account when prioritizing jobs. Shorter jobs usually launches sooner. It is worth checking the running time and resource usage of each job using the sacct command after the execution is complete.

Checking Resource Availability

Overall status of the cluster can be viewed using the sinfo command, specific partition can be specified by the -p flag.

sinfo -p ai

Submitting Batch Jobs

Jobs (sbatch scripts) can be submitted using the sbatch command:

sbatch <my_sbatch_script.sh>

You get the following output in case of successful submission:

Submitted batch job <JOBID>

where JOBID is the unique ID of the job.

Status Information

The squeue command lists your scheduled jobs and their status.

Each job has a unique ID (JOBID) assigned to it. You can get more information by referring to the JOBID. The scontrol command displays the attributes of a submitted or running job:

scontrol show job <JOBID>

All job is registered in an accounting database. Attributes and resource consumption statistics of submitted jobs can be retrieved from this database. You can view the brief statistics using the sacct command:

sacct -j <JOBID>

Detailed statistics:

sacct -l -j <JOBID>

The smemory command provides information about memory consumption:

smemory <JOBID>

Resource utilization efficiency of a completed job can be checked using the seff command:

seff <JOBID>

Cancelling Jobs

A job can be cancelled with the scancel command:

scancel <JOBID>

Pending Jobs

Slurm provides information about the reason why certain jobs have not yet started. Using the squeue command, the explanation is shown in the “NODELIST(REASON)” column for pending jobs (for running jobs it shows the nodes associated with the job).

Some examples:

Resources - The job is waiting for resources to become available.

AssociationResourceLimit - The job has reached some resource limit.

AssociationJobLimit / QOSJobLimit - The job has reached the maximum job count.

AssocGrpCPULimit - The project has reached its aggregate CPU limit. For a given project, jobs can run on a maximum of 5120 CPUs at a given time. Current project limits: sacctmgr list assoc account=$project format=Account,User,GrpTRES

AssocGrpGRES - The project has reached its aggregate GPU (or other generic resource) limit.

AssocGrpCPUMinutesLimit - The project has reached the maximum number of minutes of CPU time usage.

AssocGrpGRESMinutes - The project has reached the maximum number of minutes of GPU (or other generic resource) time usage.

Piority - The job is waiting due to low priority (one or more higher priority jobs exist for this partition). Lowering the time limit of the job may result in higher priority.

For a full list of job reason codes, see the “Resource Limits” page of the Slurm documentation or type man sbatch in the terminal.