Pytorch
Pytorch is a popular tensor library for deep learning jobs.
Licence
Pytorch is BSD-style licenced: https://github.com/pytorch/pytorch/blob/main/LICENSE
Preinstalled Pytorch environments
JupyterHub, GPU AI Lab or Pytorch environment: https://jupyter.hpc.kifu.hu/ (not suitable for multinode jobs).
Preinstalled Pytorch containers are available through ssh. For example: /opt/software/packages/containers/ubuntu_CUDA_ai_2p0.sif
- Pytorch is also available as a module: module load pytorch/x.x.x . If you find that some packages are missing, you can often install it yourself with pip install –user.
Currently installed Pytorch verions: 2.2.2; 2.4.1
We are strongly advise against installing a new Conda environment directly into your home folder, as this method consumes a lot of Inode. You can see the Inode information with: df -i, or squota . If you need a custom environment, please use Singularity container technology.
Usage of Pytorch container
Running interactive jobs in container:
Note
To build your own AI container, please refer to: https://docs.hpc.kifu.hu/en/software/singularity.html#singularity . Recipes (or .def files) are also available through ssh at: /opt/software/packages/containers/ .
Create a run_script.sh file with the following content:
module load singularity #load Singularity module
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py #run the script in the container
Then run the run_script.sh with the following command:
srun --partition=gpu --cpus-per-gpu=32 --mem-per-cpu=2000 --gres=gpu:1 bash run_script.sh
Hint
Although GPU-heavy jobs usually don’t rely on system memory, it is necessary to reserve enough memory for the container. However, only 2000 MB memory can be allocated for each CPU core. This limit can only be bypassed with reserving more CPU cores.
Running a batch job in container:
The previous interactive job can also be run as a batch job. In this case, the content of the batch_script.sh will be the following:
#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --partition=gpu
#SBATCH --job-name=jobname
#SBATCH --cpus-per-gpu=32
#SBATCH --mem-per-cpu=2000
#SBATCH --gres=gpu:1
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
This script can be queued with the following command:
sbatch batch_script.sh
Usage of Pytorch module
Pytorch is installed in a Miniconda image file. This environment already contains the necessary driver installations, hence loading CUDA is unnecessary and could possibly be disruptive. The installed Pytorch and Tensorflow modules mutually exclude each other. This means, if Pytorch is already loaded, loading Tensorflow will unload the Pytorch module.
Check the available Pytorch versions:
module avail pytorch
Using the default version of the pytorch module in an interactive job:
srun -p gpu -c 16 --gres=gpu:1 --pty bash
module load pytorch
python your_pytorch_script.py
Parallel jobs with Pytorch (torchrun):
Distributed Data Parallel (DDP)
Note
The following examples are tested only for the Pytorch modules.
A popular solution to process a large dataset with a small model, is Distributed Data Parallel. In this example script, we submit a job that will use 8 GPUs in 2 GPU nodes:
#!/bin/bash
#SBATCH --account=<account>
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-00:15:00
#SBATCH --gres=gpu:4
#SBATCH --output=2_node_4_gpu.out
#SBATCH --exclusive # To avoid communication errors!
export RDZV_HOST=$(hostname) # The name of the master node (automatically set)
export RDZV_PORT=29400
srun torchrun \ # Torchrun will control the communication between GPUs and nodes.
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc_per_node=4 \ # Equals the number of the reserved GPUs in one node.
--rdzv_id=$SLURM_JOB_ID \ # Unique identifier for the processes, safe to use the JobID.
--rdzv_backend=c10d \
--rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
your_DDP_job.py
Example code: https://git.einfra.hu/hpc-public/AI_examples.git
DDP + Model Parallelism.
DDP can be combined with model parallelism. Keep in mind the model layers are going to be distributed among the worker GPUs.
Following the previous example, the model will be divided between 2 GPUs. In a 8 GPU/2 node setup, this means the total worker number will be 4. In this case, although the reserved GPU per node is still 4, the worker number will change:
#!/bin/bash
#SBATCH --account=hpcteszt
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-00:15:00
#SBATCH --gres=gpu:4
#SBATCH --output=2_node_4_gpu.out
#SBATCH --exclusive
export RDZV_HOST=$(hostname)
export RDZV_PORT=29400
srun torchrun \
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc_per_node=2 \ # worker number will change here
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
your_DDP_MP_job.py
Solutions for suboptimal GPU usage
GPU performance can be suboptimal, when you train a small model on a lot of small files. The reason for this, is that disk reading performance is lagging behind GPU performance. To increase GPU performance, wrap your data into Hierarchical Data Format (HDF5).
An MNIST example with Pytorch and HDF5:
Download and convert the MNIST dataset to HDF5.
train_dataset = MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True)
with h5py.File('mnist_train.hdf5', 'w') as hdf:
# Create datasets inside the HDF5 file
images_dataset = hdf.create_dataset('images', (len(train_dataset), 28, 28), dtype='float32')
labels_dataset = hdf.create_dataset('labels', (len(train_dataset),), dtype='int64')
for i, (image, label) in enumerate(train_dataset):
images_dataset[i] = image.squeeze().numpy() # Save the image
labels_dataset[i] = label # Save the label
Create a Pytorch-compatible dataset from the HDF5 file:
class H5Dataset(data.Dataset):
def __init__(self, hdf5_file='mnist_train.hdf5', transform=None, preload=False):
# Open the HDF5 file
self.hdf5_file = hdf5_file
self.transform = transform
self.preload = preload
self.hdf = h5py.File(self.hdf5_file, 'r', swmr=True)
self.images = self.hdf['images']
self.labels = self.hdf['labels']
self.dataset_size = self.images.shape[0]
if self.preload:
self.images = self.images[:]
self.labels = self.labels[:]
def __len__(self):
return self.dataset_size
def __getitem__(self, index):
# Open the file in read mode, fetch the image and label
image = self.images[index]
label = self.labels[index]
if self.transform:
image = self.transform(image)
return image, label
def __del__(self):
# Ensure the HDF5 file is properly closed
if hasattr(self, 'hdf') and self.hdf:
self.hdf.close()
Load the dataset and train the model:
train_dataset = H5Dataset(hdf5_file, transform=transforms.ToTensor(), preload=False)
train_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size,
shuffle=False, num_workers=0, pin_memory=True,
sampler=train_sampler)
The complete example code, with HDF5 and DDP: https://git.einfra.hu/hpc-public/AI_examples.git
Please note, the implementation allows to load the full dataset into memory. In this way, memory can be a limiting factor, but if your dataset is small, you can achieve further GPU performance boost.
The official H5py documentation: https://docs.h5py.org/en/stable/
The official Pytorch documentation is available here: https://pytorch.org/docs/stable/index.html