AI partition
The AI partition consists of 4x HPE Apollo 6500 Gen10Plus Blade. Each blade contains 1 node, and 8 GPU-s per node. In summary, the whole partition contains a total of 32 GPU-s.
GPU:
NVIDIA A100 TENSOR CORE GPU
VRAM 40 GB
8 GPU per node
Processor:
AMD EPYC 7763 64-Core Processor (2.45GHz)
Max boost clock: 3.5 GHz
2 CPU per node
Memory:
DDR4 3200MHz 16GB
16 DIMM module per socket
256 GB per CPU, 512 GB memory per node.
Network:
HPE Slingshot 200GbE
AI blade and node naming convention in the system:
For example: cn01
c - Chassis
n - Node (01-04)
How to use AI nodes
Example for an AI interjactive job:
run_script.sh script file contains the following:
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
these commands load the singularity module
The –nv option will set up environment to use NVIDIA GPU-s
run the env_test.py script in the ubuntu_CUDA_ai.sif container
The run_script.sh script can be run with the following command:
srun --partition=ai --cpus-per-gpu=32 --mem-per-cpu=2000 --gres=gpu:1 bash run_script.sh
With this command, we run our run_script.sh job on 1 AI node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core for our interactive job.
Hint
Although GPU-heavy jobs usually doesn’t rely on system memory, it is necessary to reserve enogh memory for the container. However, only 2000 MB memory can be allocated for each CPU core. This limit can only be bypassed with reserving more CPU-s.
How to run a AI batch job:
The previous interactive job can also be run as a batch job. In this case, the content of the batch_script.sh will be the following:
#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --partition=ai
#SBATCH --job-name=jobname
#SBATCH --cpus-per-gpu=32
#SBATCH --mem-per-cpu=2000
#SBATCH --gres=gpu:1
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
This script can be queued with the following command:
sbatch batch_scipt.sh
According to our batch script, we will run our job on 1 AI node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core in the interactive job.
GPU nodes are suitable for:
large jobs capable of utilizing up to 8 GPUs in one node
smaller parallel jobs capable of utilizing up to 32 GPU-s across multiple nodes.
Installed softwares for parallel jobs:
Amber - https://ambermd.org/doc12/Amber22.pdf
GROMACS - https://manual.gromacs.org/current/index.html
Q-chem - https://manual.q-chem.com/latest/
TeraChem - http://www.petachem.com/doc/userguide.pdf
NAMD - https://www.ks.uiuc.edu/Research/namd/3.0/ug/
Softwares suitable for single-node jobs currently available in container environment:
Tensorflow - https://www.tensorflow.org/guide
PyTorch - https://pytorch.org/docs/stable/index.html
Our containers available on the Komondor from:
/opt/software/packages/containers/
Further information about the hardware: