Pytorch

Pytorch is a popular tensor library for deep learning jobs.

Licence

Pytorch is BSD-style licenced: https://github.com/pytorch/pytorch/blob/main/LICENSE

Preinstalled Pytorch environments

  1. JupyterHub, GPU AI Lab or Pytorch environment: https://jupyter.hpc.kifu.hu/ (not suitable for multinode jobs).

  2. Preinstalled Pytorch containers are available through ssh. For example: /opt/software/packages/containers/ubuntu_CUDA_ai_2p0.sif

  3. Pytorch is also available as a module: module load pytorch/x.x.x . If you find that some packages are missing, you can often install it yourself with pip install –user.

    Currently installed Pytorch verions: 2.2.2; 2.4.1

We are strongly advise against installing a new Conda environment directly into your home folder, as this method consumes a lot of Inode. You can see the Inode information with: df -i, or squota . If you need a custom environment, please use Singularity container technology.

Usage of Pytorch container

Running interactive jobs in container:

Note

To build your own AI container, please refer to: https://docs.hpc.kifu.hu/en/software/singularity.html#singularity . Recipes (or .def files) are also available through ssh at: /opt/software/packages/containers/ .

Create a run_script.sh file with the following content:

module load singularity                                        #load Singularity module
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py    #run the script in the container

Then run the run_script.sh with the following command:

srun --partition=gpu --cpus-per-gpu=32 --mem-per-cpu=2000 --gres=gpu:1 bash run_script.sh

Hint

Although GPU-heavy jobs usually don’t rely on system memory, it is necessary to reserve enough memory for the container. However, only 2000 MB memory can be allocated for each CPU core. This limit can only be bypassed with reserving more CPU cores.

Running a batch job in container:

The previous interactive job can also be run as a batch job. In this case, the content of the batch_script.sh will be the following:

#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --partition=gpu
#SBATCH --job-name=jobname
#SBATCH --cpus-per-gpu=32
#SBATCH --mem-per-cpu=2000
#SBATCH --gres=gpu:1
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py

This script can be queued with the following command:

sbatch batch_script.sh

Usage of Pytorch module

Pytorch is installed in a Miniconda image file. This environment already contains the necessary driver installations, hence loading CUDA is unnecessary and could possibly be disruptive. The installed Pytorch and Tensorflow modules mutually exclude each other. This means, if Pytorch is already loaded, loading Tensorflow will unload the Pytorch module.

Check the available Pytorch versions:

module avail pytorch

Using the default version of the pytorch module in an interactive job:

srun -p gpu -c 16 --gres=gpu:1 --pty bash
module load pytorch
python your_pytorch_script.py

Parallel jobs with Pytorch (torchrun):

Distributed Data Parallel (DDP)

Note

The following examples are tested only for the Pytorch modules.

A popular solution to process a large dataset with a small model, is Distributed Data Parallel. In this example script, we submit a job that will use 8 GPUs in 2 GPU nodes:

#!/bin/bash
#SBATCH --account=<account>
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-00:15:00
#SBATCH --gres=gpu:4
#SBATCH --output=2_node_4_gpu.out
#SBATCH --exclusive                             # To avoid communication errors!

export RDZV_HOST=$(hostname)                    # The name of the master node (automatically set)
export RDZV_PORT=29400

srun torchrun \                                 # Torchrun will control the communication between GPUs and nodes.
    --nnodes=$SLURM_JOB_NUM_NODES \
    --nproc_per_node=4 \                        # Equals the number of the reserved GPUs in one node.
    --rdzv_id=$SLURM_JOB_ID \                   # Unique identifier for the processes, safe to use the JobID.
    --rdzv_backend=c10d \
    --rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
    your_DDP_job.py

Example code: https://git.einfra.hu/hpc-public/AI_examples.git

DDP + Model Parallelism.

DDP can be combined with model parallelism. Keep in mind the model layers are going to be distributed among the worker GPUs.

Following the previous example, the model will be divided between 2 GPUs. In a 8 GPU/2 node setup, this means the total worker number will be 4. In this case, although the reserved GPU per node is still 4, the worker number will change:

#!/bin/bash
#SBATCH --account=hpcteszt
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-00:15:00
#SBATCH --gres=gpu:4
#SBATCH --output=2_node_4_gpu.out
#SBATCH --exclusive

export RDZV_HOST=$(hostname)
export RDZV_PORT=29400

srun torchrun \
    --nnodes=$SLURM_JOB_NUM_NODES \
    --nproc_per_node=2 \                        # worker number will change here
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
    your_DDP_MP_job.py

Solutions for suboptimal GPU usage

GPU performance can be suboptimal, when you train a small model on a lot of small files. The reason for this, is that disk reading performance is lagging behind GPU performance. To increase GPU performance, wrap your data into Hierarchical Data Format (HDF5).

An MNIST example with Pytorch and HDF5:
  • Download and convert the MNIST dataset to HDF5.

train_dataset = MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True)

with h5py.File('mnist_train.hdf5', 'w') as hdf:
    # Create datasets inside the HDF5 file
    images_dataset = hdf.create_dataset('images', (len(train_dataset), 28, 28), dtype='float32')
    labels_dataset = hdf.create_dataset('labels', (len(train_dataset),), dtype='int64')


    for i, (image, label) in enumerate(train_dataset):
        images_dataset[i] = image.squeeze().numpy()  # Save the image
        labels_dataset[i] = label  # Save the label
  • Create a Pytorch-compatible dataset from the HDF5 file:

class H5Dataset(data.Dataset):
    def __init__(self, hdf5_file='mnist_train.hdf5', transform=None, preload=False):
        # Open the HDF5 file
        self.hdf5_file = hdf5_file
        self.transform = transform
        self.preload = preload

        self.hdf = h5py.File(self.hdf5_file, 'r', swmr=True)
        self.images = self.hdf['images']
        self.labels = self.hdf['labels']
        self.dataset_size = self.images.shape[0]

        if self.preload:
            self.images = self.images[:]
            self.labels = self.labels[:]


    def __len__(self):
        return self.dataset_size

    def __getitem__(self, index):
        # Open the file in read mode, fetch the image and label
        image = self.images[index]
        label = self.labels[index]

        if self.transform:
            image = self.transform(image)

        return image, label

    def __del__(self):
        # Ensure the HDF5 file is properly closed
        if hasattr(self, 'hdf') and self.hdf:
            self.hdf.close()
  • Load the dataset and train the model:

train_dataset = H5Dataset(hdf5_file, transform=transforms.ToTensor(), preload=False)
train_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size,
                          shuffle=False, num_workers=0, pin_memory=True,
                          sampler=train_sampler)

The complete example code, with HDF5 and DDP: https://git.einfra.hu/hpc-public/AI_examples.git

Please note, the implementation allows to load the full dataset into memory. In this way, memory can be a limiting factor, but if your dataset is small, you can achieve further GPU performance boost.

The official H5py documentation: https://docs.h5py.org/en/stable/

The official Pytorch documentation is available here: https://pytorch.org/docs/stable/index.html