Checking Job Efficiency

When slurm grants the resource (CPU, memory, GPU) that you requested for a job, it will reserve (and bill for) that resource while the job is running, regardless of whether the job actually uses the resource or not. Therefore, to avoid wasting resources, it is important to check the resource utilization efficiency of the completed jobs and make improvements if the efficiency seems to be low. Reviewing resource usage can also help to identify bottlenecks in the job.

CPU and memory efficiency

The easiest way to check CPU and memory utilization efficiency of a job is to use the seff script:

$ seff <JOBID>

where JOBID is the unique ID of the job(step) or a comma separated list of job(step) IDs.

An example:

$ seff 1234567
Job ID: 1234567
Cluster: komondor
User/Group: alice/alice
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 01:08:43
CPU Efficiency: 88.10% of 01:18:00 core-walltime
Job Wall-clock time: 00:09:45
Memory Utilized: 2.74 GB
Memory Efficiency: 68.56% of 4.00 GB

A more detailed and highly customizable report can be generated using the sacct command:

$ sacct -j <JOBID> -o <FIELD1,FIELD2,...>

where FIELD1,FIELD2,… is a comma separated list of the required information.

For example, here is how you can get statistics somewhat similar to the output of the “seff” command using sacct:

$ sacct -j 1234567 -o JobID,Cluster,User,Group,State,ExitCode,NNodes,NCPUs,TotalCPU,Elapsed,MaxRSS,ReqMem
JobID           Cluster      User     Group      State ExitCode   NNodes      NCPUS   TotalCPU    Elapsed     MaxRSS     ReqMem
------------ ---------- --------- --------- ---------- -------- -------- ---------- ---------- ---------- ---------- ----------
3115439        komondor   hpctefo   hpctefo  COMPLETED      0:0        1          8   01:08:43   00:09:45                    4G
3115439.bat+   komondor                      COMPLETED      0:0        1          8   01:08:43   00:09:45   2875752K

Use the --helpformat option for a list of available fields:

$ sacct --helpformat

For detailed information see the manpages for sacct (type man sacct in the terminal).

Note

If the job is still running, some fields (e.g. regarding the memory usage) in the sacct output are not available and some information in the seff output can be incomplete and misleading.

Note

Statistics of interrupted jobs (e.g. jobs that have been cancelled with the “scancel” command or due to exceeding the time limit) may not be accurate. It is because the total CPU time calculated for interrupted jobs may not include its descendant processes.

Resource usage of running jobs

You can check the state of a running job (including memory usage) with the sstat command:

$ sstat -j <JOB(.STEP)> -o <FIELD1,FIELD2>

where JOB(.STEP) is the unique ID of a job or jobstep (or a comma-separated list of IDs) and FIELD1,FIELD2 is a comma separated list of the required information.

To display information for all steps of a job, use the --allsteps option:

$ sstat -j <JOB> -o <FIELD1,FIELD2> –allsteps

You can list the available fields using the --helpformat option:

$ sstat --helpformat

In the following example, we list our currently running jobs (using squeue and sacct commands), and then query some statistics about one of the jobs with sstat:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           6342410       cpu     bash    alice  R       0:08      1 x1000c0s1b1n1
           6342406       cpu  testjob    alice  R       0:47      1 x1000c1s5b1n0

$ squeue --steps
         STEPID     NAME PARTITION     USER      TIME NODELIST
  6342406.batch    batch       cpu    alice      0:47 x1000c1s5b1n0
      6342410.0     bash       cpu    alice      0:08 x1000c0s1b1n1

$ sacct -o JobID%15,Partition,Account,State -j 1234567
          JobID  Partition    Account      State
--------------- ---------- ---------- ----------
        6342406        cpu   research    RUNNING
  6342406.batch              research    RUNNING

$ sstat -o JobID,MaxRSS,MaxVMSize,AveCPU,MaxDiskRead,MaxDiskWrite --allsteps -j 1234567
JobID            MaxRSS  MaxVMSize     AveCPU  MaxDiskRead MaxDiskWrite
------------ ---------- ---------- ---------- ------------ ------------
6342406.bat+   2850032K   8558440K   00:03:23   2020065913      9356795

For detailed information see the manpages for sstat (type man sstat in the terminal).

Requested vs allocated resources

In some cases, due to resource allocation constraints, Slurm will allocate more resources to your job than you requested. That will also affect the resource usage statistics.

You can query the ReqTres and AllocTres info fields of the “sacct” output to check the requested and allocated resources for a job (the %N modifier following the field name sets the width of the output):

$ sacct -j <JOBID> -o ReqTres%40,AllocTres%40

For example, the following output shows that we requested 1 cpu and 16000 MB memory but Slurm allocated 8 CPUs and 16000 MB memory. This is because the maximum amount of memory that can be requested per CPU core on the Komondor CPU partition is 2000 MB, so Slurm had to allocate 8 CPUs instead of one to provide 16000 MB of memory for the job.

$ sacct -j 1234567 -o JobID,ReqTres%45,AllocTres%45
JobID                                                        ReqTRES                                     AllocTRES
------------ ------------------------------------------------------- ---------------------------------------------
6292355                            billing=1,cpu=1,mem=16000M,node=1             billing=8,cpu=8,mem=16000M,node=1
6292355.0                                                                                  cpu=8,mem=16000M,node=1

GPU usage statistics

You can check the GPU utilization while the job is running. First, you have to interactively “connect” to the resource allocated by the running job:

$ srun --overlap --pty --jobid=<JOBID> bash

where JOBID is the unique ID of the running job.

You can now view the current GPU utilization:

$ nvidia-smi

The previous command will show all GPUs of the given node. At the bottom of the table there will be a list of processes that your job runs on any GPU. If no process is listed, your application is not using the GPU. In the table, you can view the statistics of the GPU(s) you are using.

For further information see the manpages for nvidia-smi (type man nvidia-smi in the terminal).

Job monitoring with jobstats

In addition to the built-in Slurm tools, you can also monitor resource usage using the jobstats service developed by Princeton University.

Jobstats provides both a web interface and command-line tools to inspect running and completed jobs in a more intuitive and visual way.

The Komondor cluster integrates the jobstats monitoring service. You can access the web interface at:

https://jobstats.komondor.hpc.einfra.hu

This page provides real-time and historical visualizations of CPU, memory and GPU usage per job. You can browse through all your jobs, inspect individual job metrics in detail, and analyze usage over time.

The monitoring data is automatically collected and available for all users.

Example view:

Real-time CPU usage
Memory footprint throughout job duration
GPU utilization if available

CLI:

$ jobstats <JOBID>

Aggregated job efficiency with reportseff

For a command-line summary of job efficiency over a range of jobs, the reportseff tool is recommended.

This utility gives you an aggregated view of the efficiency metrics similar to seff, but works over many jobs and produces structured output ideal for usage analysis.

You can use it as follows:

$ reportseff <JOBID> [<JOBID2> ...]

or run it for all your jobs within a date range (eg last 4 hours):

$ reportseff --since h=4

This will generate a table or CSV-style output showing:

CPU and memory efficiency
GPU usage (if applicable)
Wall time and total core time
Exit status and job state

For more information, refer to the official documentation:

Jobstats: https://princetonuniversity.github.io/jobstats/
reportseff: https://princetonuniversity.github.io/jobstats/tools/reportseff/

Note

The reportseff tool uses data collected by jobstats, so it only works for jobs submitted while the jobstats service was active on the cluster.