Checking Job Efficiency
When slurm grants the resource (CPU, memory, GPU) that you requested for a job, it will reserve (and bill for) that resource while the job is running, regardless of whether the job actually uses the resource or not. Therefore, to avoid wasting resources, it is important to check the resource utilization efficiency of the completed jobs and make improvements if the efficiency seems to be low. Reviewing resource usage can also help to identify bottlenecks in the job.
CPU and memory efficiency
The easiest way to check CPU and memory utilization efficiency of a job is to use the seff
$ seff <JOBID>
where JOBID is the unique ID of the job(step) or a comma separated list of job(step) IDs.
An example:
$ seff 1234567
Job ID: 1234567
Cluster: komondor
User/Group: alice/alice
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 01:08:43
CPU Efficiency: 88.10% of 01:18:00 core-walltime
Job Wall-clock time: 00:09:45
Memory Utilized: 2.74 GB
Memory Efficiency: 68.56% of 4.00 GB
A more detailed and highly customizable report can be generated using the sacct
$ sacct -j <JOBID> -o <FIELD1,FIELD2,...>
where FIELD1,FIELD2,… is a comma separated list of the required information.
For example, here is how you can get statistics somewhat similar to the output of the “seff” command using sacct
$ sacct -j 1234567 -o JobID,Cluster,User,Group,State,ExitCode,NNodes,NCPUs,TotalCPU,Elapsed,MaxRSS,ReqMem
JobID Cluster User Group State ExitCode NNodes NCPUS TotalCPU Elapsed MaxRSS ReqMem
------------ ---------- --------- --------- ---------- -------- -------- ---------- ---------- ---------- ---------- ----------
3115439 komondor hpctefo hpctefo COMPLETED 0:0 1 8 01:08:43 00:09:45 4G
3115439.bat+ komondor COMPLETED 0:0 1 8 01:08:43 00:09:45 2875752K
Use the --helpformat
option for a list of available fields:
$ sacct --helpformat
For detailed information see the manpages for sacct (type man sacct
in the terminal).
If the job is still running, some fields (e.g. regarding the memory usage) in the sacct output are not available and some information in the seff output can be incomplete and misleading.
Statistics of interrupted jobs (e.g. jobs that have been cancelled with the “scancel” command or due to exceeding the time limit) may not be accurate. It is because the total CPU time calculated for interrupted jobs may not include its descendant processes.
Resource usage of running jobs
You can check the state of a running job (including memory usage) with the sstat
$ sstat -j <JOB(.STEP)> -o <FIELD1,FIELD2>
where JOB(.STEP) is the unique ID of a job or jobstep (or a comma-separated list of IDs) and FIELD1,FIELD2 is a comma separated list of the required information.
To display information for all steps of a job, use the --allsteps
$ sstat -j <JOB> -o <FIELD1,FIELD2> –allsteps
You can list the available fields using the --helpformat
$ sstat --helpformat
In the following example, we list our currently running jobs (using squeue
and sacct
and then query some statistics about one of the jobs with sstat
$ squeue
6342410 cpu bash alice R 0:08 1 x1000c0s1b1n1
6342406 cpu testjob alice R 0:47 1 x1000c1s5b1n0
$ squeue --steps
6342406.batch batch cpu alice 0:47 x1000c1s5b1n0
6342410.0 bash cpu alice 0:08 x1000c0s1b1n1
$ sacct -o JobID%15,Partition,Account,State -j 1234567
JobID Partition Account State
--------------- ---------- ---------- ----------
6342406 cpu research RUNNING
6342406.batch research RUNNING
$ sstat -o JobID,MaxRSS,MaxVMSize,AveCPU,MaxDiskRead,MaxDiskWrite --allsteps -j 1234567
JobID MaxRSS MaxVMSize AveCPU MaxDiskRead MaxDiskWrite
------------ ---------- ---------- ---------- ------------ ------------
6342406.bat+ 2850032K 8558440K 00:03:23 2020065913 9356795
For detailed information see the manpages for sstat (type man sstat
in the terminal).
Requested vs allocated resources
In some cases, due to resource allocation constraints, Slurm will allocate more resources to your job than you requested. That will also affect the resource usage statistics.
You can query the ReqTres and AllocTres info fields of the “sacct” output to check the requested and allocated resources for a job (the %N modifier following the field name sets the width of the output):
$ sacct -j <JOBID> -o ReqTres%40,AllocTres%40
For example, the following output shows that we requested 1 cpu and 16000 MB memory but Slurm allocated 8 CPUs and 16000 MB memory. This is because the maximum amount of memory that can be requested per CPU core on the Komondor CPU partition is 2000 MB, so Slurm had to allocate 8 CPUs instead of one to provide 16000 MB of memory for the job.
$ sacct -j 1234567 -o JobID,ReqTres%45,AllocTres%45
------------ ------------------------------------------------------- ---------------------------------------------
6292355 billing=1,cpu=1,mem=16000M,node=1 billing=8,cpu=8,mem=16000M,node=1
6292355.0 cpu=8,mem=16000M,node=1
GPU usage statistics
You can check the GPU utilization while the job is running. First, you have to interactively “connect” to the resource allocated by the running job:
$ srun --overlap --pty --jobid=<JOBID> bash
where JOBID is the unique ID of the running job.
You can now view the current GPU utilization:
$ nvidia-smi
The previous command will show all GPUs of the given node. At the bottom of the table there will be a list of processes that your job runs on any GPU. If no process is listed, your application is not using the GPU. In the table, you can view the statistics of the GPU(s) you are using.
For further information see the manpages for nvidia-smi (type man nvidia-smi
in the terminal).
Job monitoring with jobstats
In addition to the built-in Slurm tools, you can also monitor resource usage using the jobstats service developed by Princeton University.
Jobstats provides both a web interface and command-line tools to inspect running and completed jobs in a more intuitive and visual way.
The Komondor cluster integrates the jobstats monitoring service. You can access the web interface at:
This page provides real-time and historical visualizations of CPU, memory and GPU usage per job. You can browse through all your jobs, inspect individual job metrics in detail, and analyze usage over time.
The monitoring data is automatically collected and available for all users.
- Example view:
Real-time CPU usage
Memory footprint throughout job duration
GPU utilization if available
$ jobstats <JOBID>
Aggregated job efficiency with reportseff
For a command-line summary of job efficiency over a range of jobs, the reportseff tool is recommended.
This utility gives you an aggregated view of the efficiency metrics similar to seff, but works over many jobs and produces structured output ideal for usage analysis.
You can use it as follows:
$ reportseff <JOBID> [<JOBID2> ...]
or run it for all your jobs within a date range (eg last 4 hours):
$ reportseff --since h=4
- This will generate a table or CSV-style output showing:
CPU and memory efficiency
GPU usage (if applicable)
Wall time and total core time
Exit status and job state
- For more information, refer to the official documentation:
The reportseff tool uses data collected by jobstats, so it only works for jobs submitted while the jobstats service was active on the cluster.