Submitting and Managing Jobs
Resource Usage Estimation
Before production runs, it is worth making a preliminary estimate of resource consumption. The sestimate
command can be used for this:
sestimate -N NODES -t WALLTIME
where NODES are the number of nodes you want to allocate and WALLTIME is the maximum running time of the task.
It is important to specify the required resources and timelimit of the jobs as accurately as possible, because the scheduler takes these into account when prioritizing jobs.
Shorter jobs usually launches sooner. It is worth checking the running time and resource usage of each job using the sacct
command after the execution is complete.
Checking Resource Availability
Overall status of the cluster can be viewed using the sinfo
command, specific partition can be specified by the -p
flag.
sinfo -p ai
Submitting Batch Jobs
Jobs (sbatch scripts) can be submitted using the sbatch
command:
sbatch <my_sbatch_script.sh>
You get the following output in case of successful submission:
Submitted batch job <JOBID>
where JOBID is the unique ID of the job.
Status Information
The squeue
command lists your scheduled jobs and their status.
Each job has a unique ID (JOBID) assigned to it. You can get more information by referring to the JOBID.
The scontrol
command displays the attributes of a submitted or running job:
scontrol show job <JOBID>
All job is registered in an accounting database. Attributes and resource consumption statistics of submitted jobs can be retrieved from this database.
You can view the brief statistics using the sacct
command:
sacct -j <JOBID>
Detailed statistics:
sacct -l -j <JOBID>
The smemory
command provides information about memory consumption:
smemory <JOBID>
Resource utilization efficiency of a completed job can be checked using the seff
command:
seff <JOBID>
Cancelling Jobs
A job can be cancelled with the scancel
command:
scancel <JOBID>
Pending Jobs
Slurm provides information about the reason why certain jobs have not yet started.
Using the squeue
command, the explanation is shown in the “NODELIST(REASON)” column for pending jobs
(for running jobs it shows the nodes associated with the job).
Some examples:
Resources - The job is waiting for resources to become available.
AssociationResourceLimit - The job has reached some resource limit.
AssociationJobLimit / QOSJobLimit - The job has reached the maximum job count.
AssocGrpCPULimit - The project has reached its aggregate CPU limit. For a given project, jobs can run on a maximum of 5120 CPUs at a given time. Current project limits: sacctmgr list assoc account=$project format=Account,User,GrpTRES
AssocGrpGRES - The project has reached its aggregate GPU (or other generic resource) limit.
AssocGrpCPUMinutesLimit - The project has reached the maximum number of minutes of CPU time usage.
AssocGrpGRESMinutes - The project has reached the maximum number of minutes of GPU (or other generic resource) time usage.
Piority - The job is waiting due to low priority (one or more higher priority jobs exist for this partition). Lowering the time limit of the job may result in higher priority.
For a full list of job reason codes, see the “Resource Limits” page of the Slurm documentation
or type man sbatch
in the terminal.