Pytorch
Pytorch is a popular tensor library for deep learning jobs.
Licence
Pytorch is BSD-style licenced: https://github.com/pytorch/pytorch/blob/main/LICENSE
Preinstalled Pytorch environments
JupyterHub, GPU AI Lab or Pytorch environment: https://jupyter.hpc.kifu.hu/ (not suitable for multinode jobs).
Preinstalled Pytorch containers are available through ssh. For example: /opt/software/packages/containers/ubuntu_CUDA_ai_2p0.sif
- Pytorch is also available as a module: module load pytorch/x.x.x . If you find that some packages are missing, you can often install it yourself with pip install –user.
Currently installed Pytorch verions: 2.2.2; 2.4.1
We are strongly advise against installing a new Conda environment directly into your home folder, as this method consumes a lot of Inode. You can see the Inode information with: df -i, or squota . If you need a custom environment, please use Singularity container technology.
Usage of Pytorch container
Running interactive jobs in container:
Note
To build your own AI container, please refer to: https://docs.hpc.kifu.hu/en/software/singularity.html#singularity . Recipes (or .def files) are also available through ssh at: /opt/software/packages/containers/ .
Create a run_script.sh file with the following content:
module load singularity #load Singularity module
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py #run the script in the container
Then run the run_script.sh with the following command:
srun --partition=gpu --cpus-per-gpu=32 --mem-per-cpu=2000 --gres=gpu:1 bash run_script.sh
Hint
Although GPU-heavy jobs usually don’t rely on system memory, it is necessary to reserve enough memory for the container. However, only 2000 MB memory can be allocated for each CPU core. This limit can only be bypassed with reserving more CPU cores.
Running a batch job in container:
The previous interactive job can also be run as a batch job. In this case, the content of the batch_script.sh will be the following:
#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --partition=gpu
#SBATCH --job-name=jobname
#SBATCH --cpus-per-gpu=32
#SBATCH --mem-per-cpu=2000
#SBATCH --gres=gpu:1
module load singularity
singularity exec --nv ubuntu_CUDA_ai.sif python env_test.py
This script can be queued with the following command:
sbatch batch_script.sh
Usage of Pytorch module
Pytorch is installed in a Miniconda image file. This environment already contains the necessary driver installations, hence loading CUDA is unnecessary and could possibly be disruptive. The installed Pytorch and Tensorflow modules mutually exclude each other. This means, if Pytorch is already loaded, loading Tensorflow or another version of Pytorch will unload the currently loaded Pytorch module.
Check the available Pytorch versions:
module avail pytorch
Using the default version of the pytorch module in an interactive job:
srun -p gpu -c 16 --gres=gpu:1 --pty bash
module load pytorch
python your_pytorch_script.py
Parallel jobs with Pytorch (torchrun):
Note
The following examples are tested only for the Pytorch modules.
Distributed Data Parallel (DDP)
A simplified DDP flowchart for a 1 node job:
DDP is a popular solution for processing a large dataset with a small model in a distributed environment. During the first epoch, exact copies of the model are distributed among the available GPUs. Each copy of the model processes a subset of the input data. Data sharding and distributing data subsets among the workers is automatically handled by the DistributedSampler. Communication between the workers (GPUs and nodes) is handled by nccl and torchrun’s c10d backend. At the end of each epoch, the gradients are aggregated, and model parameters are updated synchronously. Training a model with DDP can greatly reduce model training time.
DDP method can be applied in a multi-node setup as well:
Multi-node setup: Changing the python script is not required. To set up a multi-node DDP job, the only change in the jobscript is in the requested nodes.
Jobscript for a DDP job:
#!/bin/bash
#SBATCH --account=<account>
#SBATCH --partition=gpu
#SBATCH --nodes=1 # MULTI-NODE CHANGE FOR 2 NODES: --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-00:15:00
#SBATCH --gres=gpu:4
#SBATCH --output=output.out
#SBATCH --exclusive # To avoid communication errors!
export RDZV_HOST=$(hostname) # The name of the master node (automatically set)
export RDZV_PORT=29400
srun torchrun \ # Torchrun will control the communication between GPUs and nodes.
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc_per_node=4 \ # Equals the number of the reserved GPUs in one node.
--rdzv_id=$SLURM_JOB_ID \ # Unique identifier for the processes, safe to use the JobID.
--rdzv_backend=c10d \
--rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
your_DDP_job.py
A complete example of a DDP job: https://git.einfra.hu/hpc-public/AI_examples.git
runscript: run_DDP_n2_g4.sh
pytorch script: DDP.py
Model parallelism
Model parallelism method is used when model size exceeds the vRAM capacity of a single GPU. Model parallelism involves manually assigning different layers of the model to different GPUs and managing data transfer between these GPUs during forward and backward passes. IMPORTANT: during training, GPUs containing currently inactive parts of the model are idle. Because of this, model parallelism can be inefficient! To submit a model parallel job with SLURM scheduler, use the script bellow:
#!/bin/bash #SBATCH --account=<account>
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-00:15:00
#SBATCH --gres=gpu:4
#SBATCH --output=model_parallel_4GPU.out
#SBATCH --exclusive
module load pytorch
torchrun your_MP_job.py --epochs=100
(Torchrun is not necessarily required to run model parallelism.)
A complete example of a model parallelism job: https://git.einfra.hu/hpc-public/AI_examples.git
runscript: run_MP_n1_g4.sh
pytorch script: MP_n1_g4.py
DDP + Model Parallelism
DDP can be combined with model parallelism. In this simplified flowchart, the model is distributed on 2 GPUs. This way, each node can hold 2 workers (groups). Local ranking on the node is handled by torchrun, device ordinals are dynamically set among workers. This hybrid method can be used when both the model size and the dataset is too large for a single GPU. Keep in mind: GPUs holding currently inactive parts of the modell will be idle.
Compared to the previous simple DDP jobscript, in this hybrid method, the process number is reduced to 2, since each node can only hold 2 workers.
#!/bin/bash
#SBATCH --account=<account>
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-00:15:00
#SBATCH --gres=gpu:4
#SBATCH --output=output.out
#SBATCH --exclusive
export RDZV_HOST=$(hostname)
export RDZV_PORT=29400
srun torchrun \
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc_per_node=2 \ # worker number will change here
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
your_DDP_MP_job.py
A complete example of a model parallelism job: https://git.einfra.hu/hpc-public/AI_examples.git
runscript: run_DDP_MP.sh
pytorch script: DDP_MP.py
Solutions for suboptimal GPU usage
GPU performance can be suboptimal, when you train a small model on a lot of small files. The reason for this, is that disk reading performance is lagging behind GPU performance. To increase GPU performance, wrap your data into Hierarchical Data Format (HDF5).
Download and convert the MNIST dataset to HDF5.
train_dataset = MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True)
with h5py.File('mnist_train.hdf5', 'w') as hdf:
# Create datasets inside the HDF5 file
images_dataset = hdf.create_dataset('images', (len(train_dataset), 28, 28), dtype='float32')
labels_dataset = hdf.create_dataset('labels', (len(train_dataset),), dtype='int64')
for i, (image, label) in enumerate(train_dataset):
images_dataset[i] = image.squeeze().numpy() # Save the image
labels_dataset[i] = label # Save the label
Create a Pytorch-compatible dataset from the HDF5 file:
class H5Dataset(data.Dataset):
def __init__(self, hdf5_file='mnist_train.hdf5', transform=None, preload=False):
# Open the HDF5 file
self.hdf5_file = hdf5_file
self.transform = transform
self.preload = preload
self.hdf = h5py.File(self.hdf5_file, 'r', swmr=True)
self.images = self.hdf['images']
self.labels = self.hdf['labels']
self.dataset_size = self.images.shape[0]
if self.preload:
self.images = self.images[:]
self.labels = self.labels[:]
def __len__(self):
return self.dataset_size
def __getitem__(self, index):
# Open the file in read mode, fetch the image and label
image = self.images[index]
label = self.labels[index]
if self.transform:
image = self.transform(image)
return image, label
def __del__(self):
# Ensure the HDF5 file is properly closed
if hasattr(self, 'hdf') and self.hdf:
self.hdf.close()
Load the dataset and train the model:
train_dataset = H5Dataset(hdf5_file, transform=transforms.ToTensor(), preload=False)
train_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size,
shuffle=False, num_workers=0, pin_memory=True,
sampler=train_sampler)
The complete example code, with HDF5 and DDP: https://git.einfra.hu/hpc-public/AI_examples.git
Please note, the implementation allows to load the full dataset into memory. In this way, memory can be a limiting factor, but if your dataset is small, you can achieve further GPU performance boost.
The official H5py documentation: https://docs.h5py.org/en/stable/
The official Pytorch documentation is available here: https://pytorch.org/docs/stable/index.html