Checking Job Efficiency
When slurm grants the resource (CPU, memory, GPU) that you requested for a job, it will reserve (and bill for) that resource while the job is running, regardless of whether the job actually uses the resource or not. Therefore, to avoid wasting resources, it is important to check the resource utilization efficiency of the completed jobs and make improvements if the efficiency seems to be low. Reviewing resource usage can also help to identify bottlenecks in the job.
CPU and memory efficiency
$ jobstats <JOBID>
$ reportseff <JOBID>
https://jobstats.komondor.hpc.einfra.hu/
Warning
This section is under construction
The easiest way to check CPU and memory utilization efficiency of a job is to use the seff
script:
$ seff <JOBID>
where JOBID is the unique ID of the job(step) or a comma separated list of job(step) IDs.
An example:
$ seff 1234567
Job ID: 1234567
Cluster: komondor
User/Group: alice/alice
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 01:08:43
CPU Efficiency: 88.10% of 01:18:00 core-walltime
Job Wall-clock time: 00:09:45
Memory Utilized: 2.74 GB
Memory Efficiency: 68.56% of 4.00 GB
A more detailed and highly customizable report can be generated using the sacct
command:
$ sacct -j <JOBID> -o <FIELD1,FIELD2,...>
where FIELD1,FIELD2,… is a comma separated list of the required information.
For example, here is how you can get statistics somewhat similar to the output of the “seff” command using sacct
:
$ sacct -j 1234567 -o JobID,Cluster,User,Group,State,ExitCode,NNodes,NCPUs,TotalCPU,Elapsed,MaxRSS,ReqMem
JobID Cluster User Group State ExitCode NNodes NCPUS TotalCPU Elapsed MaxRSS ReqMem
------------ ---------- --------- --------- ---------- -------- -------- ---------- ---------- ---------- ---------- ----------
3115439 komondor hpctefo hpctefo COMPLETED 0:0 1 8 01:08:43 00:09:45 4G
3115439.bat+ komondor COMPLETED 0:0 1 8 01:08:43 00:09:45 2875752K
Use the --helpformat
option for a list of available fields:
$ sacct --helpformat
For detailed information see the manpages for sacct (type man sacct
in the terminal).
Note
If the job is still running, some fields (e.g. regarding the memory usage) in the sacct output are not available and some information in the seff output can be incomplete and misleading.
Note
Statistics of interrupted jobs (e.g. jobs that have been cancelled with the “scancel” command or due to exceeding the time limit) may not be accurate. It is because the total CPU time calculated for interrupted jobs may not include its descendant processes.
Resource usage of running jobs
You can check the state of a running job (including memory usage) with the sstat
command:
$ sstat -j <JOB(.STEP)> -o <FIELD1,FIELD2>
where JOB(.STEP) is the unique ID of a job or jobstep (or a comma-separated list of IDs) and FIELD1,FIELD2 is a comma separated list of the required information.
To display information for all steps of a job, use the --allsteps
option:
$ sstat -j <JOB> -o <FIELD1,FIELD2> –allsteps
You can list the available fields using the --helpformat
option:
$ sstat --helpformat
In the following example, we list our currently running jobs (using squeue
and sacct
commands),
and then query some statistics about one of the jobs with sstat
:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6342410 cpu bash alice R 0:08 1 x1000c0s1b1n1
6342406 cpu testjob alice R 0:47 1 x1000c1s5b1n0
$ squeue --steps
STEPID NAME PARTITION USER TIME NODELIST
6342406.batch batch cpu alice 0:47 x1000c1s5b1n0
6342410.0 bash cpu alice 0:08 x1000c0s1b1n1
$ sacct -o JobID%15,Partition,Account,State -j 1234567
JobID Partition Account State
--------------- ---------- ---------- ----------
6342406 cpu research RUNNING
6342406.batch research RUNNING
$ sstat -o JobID,MaxRSS,MaxVMSize,AveCPU,MaxDiskRead,MaxDiskWrite --allsteps -j 1234567
JobID MaxRSS MaxVMSize AveCPU MaxDiskRead MaxDiskWrite
------------ ---------- ---------- ---------- ------------ ------------
6342406.bat+ 2850032K 8558440K 00:03:23 2020065913 9356795
For detailed information see the manpages for sstat (type man sstat
in the terminal).
Requested vs allocated resources
In some cases, due to resource allocation constraints, Slurm will allocate more resources to your job than you requested. That will also affect the resource usage statistics.
You can query the ReqTres and AllocTres info fields of the “sacct” output to check the requested and allocated resources for a job (the %N modifier following the field name sets the width of the output):
$ sacct -j <JOBID> -o ReqTres%40,AllocTres%40
For example, the following output shows that we requested 1 cpu and 16000 MB memory but Slurm allocated 8 CPUs and 16000 MB memory. This is because the maximum amount of memory that can be requested per CPU core on the Komondor CPU partition is 2000 MB, so Slurm had to allocate 8 CPUs instead of one to provide 16000 MB of memory for the job.
$ sacct -j 1234567 -o JobID,ReqTres%45,AllocTres%45
JobID ReqTRES AllocTRES
------------ ------------------------------------------------------- ---------------------------------------------
6292355 billing=1,cpu=1,mem=16000M,node=1 billing=8,cpu=8,mem=16000M,node=1
6292355.0 cpu=8,mem=16000M,node=1
GPU usage statistics
You can check the GPU utilization while the job is running. First, you have to interactively “connect” to the resource allocated by the running job:
$ srun --overlap --pty --jobid=<JOBID> bash
where JOBID is the unique ID of the running job.
You can now view the current GPU utilization:
$ nvidia-smi
The previous command will show all GPUs of the given node. At the bottom of the table there will be a list of processes that your job runs on any GPU. If no process is listed, your application is not using the GPU. In the table, you can view the statistics of the GPU(s) you are using.
For further information see the manpages for nvidia-smi (type man nvidia-smi
in the terminal).