GPU partition
The GPU partition consists of 29x HPE Cray EX235n Compute Blade, each blade holding 2 nodes and 4 GPUS per node. The GPU partition contains a total of 232 GPU-s.
Processor:
AMD EPYC 7763 64-Core Processor (2.45GHz)
Max. Boost Clock: Up to 3.5GHz
1 CPU per node
2 CPU per blade
Memory:
DDR4 3200MHz 16GB
8 DIMM module per socket
128 GB RAM per socket/node
256 GB memory per blade
GPU:
NVIDIA A100 TENSOR CORE GPU
VRAM 40 GB
4 GPU per node
8 GPU per blade
Network:
HPE Slingshot 200GbE
GPU blade and node naming convention in the system:
GPU nodes are located in the x1001 cabinet.
For example: x1001c0s0b0n0
c - Chassis (1-8)
s - Slot (1-8)
b - Board (Blade) (0-1)
n - Node (0-1)
How to use GPU nodes
Example for an GPU interjactive job:
run_script.sh script file contains the following:
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
these commands load the singularity module
The –nv option will set up environment to use NVIDIA GPU-s
run the env_test.py script in the ubuntu_CUDA_ai.sif container
The run_script.sh script can be run with the following command:
srun --partition=gpu --cpus-per-gpu=32 --mem-per-cpu=2000 --gres=gpu:1 bash run_script.sh
With this command, we run our run_script.sh job on 1 GPU node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core for our interactive job.
Hint
Although GPU-heavy jobs usually doesn’t rely on system memory, it is necessary to reserve enogh memory for the container. However, only 2000 MB memory can be allocated for each CPU core. This limit can only be bypassed with reserving more CPU-s.
How to run a GPU batch job:
The previous interactive job can also be run as a batch job. In this case, the content of the batch_script.sh will be the following:
#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --partition=gpu
#SBATCH --job-name=jobname
#SBATCH --cpus-per-gpu=32
#SBATCH --mem-per-cpu=2000
#SBATCH --gres=gpu:1
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
This script can be queued with the following command:
sbatch batch_scipt.sh
According to our batch script, we will run our job on 1 GPU node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core in the interactive job.
AI’s GPU nodes are suitable for:
small jobs capable of utilizing only 1-4 GPUs in one node
massively parallel jobs utilizing more than 32 GPU-s across multiple nodes.
Installed single-node softwares:
Alphafold - https://github.com/google-deepmind/alphafold
Installed softwares for parallel jobs:
Amber - https://ambermd.org/doc12/Amber22.pdf
GROMACS - https://manual.gromacs.org/current/index.html
Q-chem - https://manual.q-chem.com/latest/
TeraChem - http://www.petachem.com/doc/userguide.pdf
NAMD - https://www.ks.uiuc.edu/Research/namd/3.0/ug/
Softwares suitable for single-node jobs currently available in container environment:
Tensorflow - https://www.tensorflow.org/guide
PyTorch - https://pytorch.org/docs/stable/index.html
Our containers available on the Komondor from:
/opt/software/packages/containers/
Further information about the hardware: