AI partition

The AI partition consists of 4x HPE Apollo 6500 Gen10Plus Blade. Each blade contains 1 node, and 8 GPU-s per node. In summary, the whole partition contains a total of 32 GPU-s.

GPU:

  • NVIDIA A100 TENSOR CORE GPU

  • VRAM 40 GB

8 GPU per node

Processor:

  • AMD EPYC 7763 64-Core Processor (2.45GHz)

  • Max boost clock: 3.5 GHz

2 CPU per node

Memory:

  • DDR4 3200MHz 16GB

  • 16 DIMM module per socket

256 GB per CPU, 512 GB memory per node.

Network:

  • HPE Slingshot 200GbE

NVIDIA A100

AI blade and node naming convention in the system:

For example: cn01

  • c - Chassis

  • n - Node (01-04)

How to use AI nodes

Example for an AI interjactive job:

run_script.sh script file contains the following:

module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
  • these commands load the singularity module

  • The –nv option will set up environment to use NVIDIA GPU-s

  • run the env_test.py script in the ubuntu_CUDA_ai.sif container

The run_script.sh script can be run with the following command:

srun --partition=ai --cpus-per-gpu=32 --mem-per-cpu=2000 --gres=gpu:1 bash run_script.sh

With this command, we run our run_script.sh job on 1 AI node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core for our interactive job.

Hint

Although GPU-heavy jobs usually doesn’t rely on system memory, it is necessary to reserve enogh memory for the container. However, only 2000 MB memory can be allocated for each CPU core. This limit can only be bypassed with reserving more CPU-s.

How to run a AI batch job:

The previous interactive job can also be run as a batch job. In this case, the content of the batch_script.sh will be the following:

#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --partition=ai
#SBATCH --job-name=jobname
#SBATCH --cpus-per-gpu=32
#SBATCH --mem-per-cpu=2000
#SBATCH --gres=gpu:1
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py

This script can be queued with the following command:

sbatch batch_scipt.sh

According to our batch script, we will run our job on 1 AI node/1 GPU. We also reserve 32 core CPU for each reserved GPU, and 2000 MB memory for each reserved CPU core in the interactive job.


GPU nodes are suitable for:

  • large jobs capable of utilizing up to 8 GPUs in one node

  • smaller parallel jobs capable of utilizing up to 32 GPU-s across multiple nodes.

Installed softwares for parallel jobs:

Amber - https://ambermd.org/doc12/Amber22.pdf

GROMACS - https://manual.gromacs.org/current/index.html

Q-chem - https://manual.q-chem.com/latest/

TeraChem - http://www.petachem.com/doc/userguide.pdf

NAMD - https://www.ks.uiuc.edu/Research/namd/3.0/ug/

Softwares suitable for single-node jobs currently available in container environment:

Tensorflow - https://www.tensorflow.org/guide

PyTorch - https://pytorch.org/docs/stable/index.html

Our containers available on the Komondor from:

/opt/software/packages/containers/

Further information about the hardware:

Cray Exascale Supercomputer

HPE Cray EX Liquid-Cooled Cabinet

AMD CPU

NVIDIA A100