GPU partition

The GPU partition consists of 29x HPE Cray EX235n Compute Blade, each blade holding 2 nodes and 4 GPUS per node. The GPU partition contains a total of 232 GPU-s.

Processor:

  • AMD EPYC 7763 64-Core Processor (2.45GHz)

  • Max. Boost Clock: Up to 3.5GHz

1 CPU per node

2 CPU per blade

Memory:

  • DDR4 3200MHz 16GB

  • 8 DIMM module per socket

128 GB RAM per socket/node

256 GB memory per blade

GPU:

  • NVIDIA A100 TENSOR CORE GPU

  • VRAM 40 GB

4 GPU per node

8 GPU per blade

Network:

  • HPE Slingshot 200GbE

NVIDIA A100

GPU blade and node naming convention in the system:

GPU nodes are located in the x1001 cabinet.

For example: x1001c0s0b0n0

  • c - Chassis (1-8)

  • s - Slot (1-8)

  • b - Board (Blade) (0-1)

  • n - Node (0-1)

How to use GPU nodes

Example for an GPU interjactive job:

run_script.sh script file contains the following:

module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
  • these commands load the singularity module

  • The –nv option will set up environment to use NVIDIA GPU-s

  • run the env_test.py script in the ubuntu_CUDA_ai.sif container

The run_script.sh script can be run with the following command:

srun --partition=gpu --cpus-per-gpu=32 --mem-per-cpu=2000 --gres=gpu:1 bash run_script.sh

With this command, we run our run_script.sh job on 1 GPU node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core for our interactive job.

Hint

Although GPU-heavy jobs usually doesn’t rely on system memory, it is necessary to reserve enogh memory for the container. However, only 2000 MB memory can be allocated for each CPU core. This limit can only be bypassed with reserving more CPU-s.

How to run a GPU batch job:

The previous interactive job can also be run as a batch job. In this case, the content of the batch_script.sh will be the following:

#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --partition=gpu
#SBATCH --job-name=jobname
#SBATCH --cpus-per-gpu=32
#SBATCH --mem-per-cpu=2000
#SBATCH --gres=gpu:1
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py

This script can be queued with the following command:

sbatch batch_scipt.sh

According to our batch script, we will run our job on 1 GPU node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core in the interactive job.


AI’s GPU nodes are suitable for:

  • small jobs capable of utilizing only 1-4 GPUs in one node

  • massively parallel jobs utilizing more than 32 GPU-s across multiple nodes.

Installed single-node softwares:

Alphafold - https://github.com/google-deepmind/alphafold

Installed softwares for parallel jobs:

Amber - https://ambermd.org/doc12/Amber22.pdf

GROMACS - https://manual.gromacs.org/current/index.html

Q-chem - https://manual.q-chem.com/latest/

TeraChem - http://www.petachem.com/doc/userguide.pdf

NAMD - https://www.ks.uiuc.edu/Research/namd/3.0/ug/

Softwares suitable for single-node jobs currently available in container environment:

Tensorflow - https://www.tensorflow.org/guide

PyTorch - https://pytorch.org/docs/stable/index.html

Our containers available on the Komondor from:

/opt/software/packages/containers/

Further information about the hardware:

Cray Exascale Supercomputer

HPE Cray EX Liquid-Cooled Cabinet

AMD CPU

NVIDIA A100