Hybrid MPI

Hybrid MPI parallel applications use both MPICH and OpenMP. The MPI data transfer mechanism can be combined with the OpenMP shared memory access methods. In the case of hybrid linking, applications can consume less memory in MPI workloads.

Hybrid Example

The example below shows the structure of a hybrid MPI program. The application uses functions both from mpi.h and omp.h header files and combines both methods in order to achieve the best possible performance.

#include <stdio.h>
#include <mpi.h>
#include <omp.h>

int main(int argc, char *argv[])
{
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int iam = 0, np = 1;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  #pragma omp parallel default(shared) private(iam, np)
  {
    np = omp_get_num_threads();
    iam = omp_get_thread_num();
    printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
    iam, np, rank, numprocs, processor_name);
  }
MPI_Finalize();
}

Hybrid OpenMP presentation: https://www.openmp.org/wp-content/uploads/HybridPP_Slides.pdf

Hybrid MPI CPU

Compiling the application for CPU with the CCE compiler:

cc hybrid.c -fopenmp

The following executable is dymanically linked with the CEE compiler:

$ ldd a.out |grep mp
     libmpi_cray.so.12 => /opt/cray/pe/lib64/libmpi_cray.so.12 (0x00007fa7b7e6a000)
     libcraymp.so.1 => /opt/cray/pe/cce/16.0.1/cce/x86_64/lib/libcraymp.so.1 (0x00007fa7b7286000)

Compiling the application for CPU with the GNU compiler:

module swap PrgEnv-cray PrgEnv-gnu
cc hybrid.c -fopenmp

The following executable is dymanically linked with the GNU compiler:

$ ldd a.out | grep mp
     libmpi_gnu_103.so.12 => /opt/cray/pe/lib64/libmpi_gnu_103.so.12 (0x00007f11b0e54000)
     libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f11b0c1c000)

Hybrid MPI GPU

Compiling the application for GPU usage with the CCE compiler:

module load craype-accel-nvidia80
export CRAY_ACCEL_TARGET=nvidia80
cc hybrid.c -fopenmp

The following executable is dymanically linked with the CEE compiler:

$ ldd a.out | grep mp
     libmpi_gnu_103.so.12 => /opt/cray/pe/lib64/libmpi_gnu_103.so.12 (0x00007f350e4d5000)
     libmpi_gtl_cuda.so.0 => /opt/cray/pe/lib64/libmpi_gtl_cuda.so.0 (0x00007f350e28f000)
     libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f350e057000)

Compiling the application for GPU usage with the GNU compiler using the Nvidia HPC SDK:

module swap PrgEnv-cray PrgEnv-gnu
module load nvhpc
cc hybrid.c -fopenmp

The following executable is dymanically linked with the GNU compiler:

$ ldd a.out | grep mp
     libmpi_gnu_103.so.12 => /opt/cray/pe/lib64/libmpi_gnu_103.so.12 (0x00007f4ae6c22000)
     libgomp.so.1 => /opt/software/packages/nvhpc/Linux_x86_64/23.11/compilers/lib/libgomp.so.1 (0x00007f4ae5c21000)

Hybrid MPI Nvidia

Compiling the application for GPU usage with the Nvidia compiler:

module swap PrgEnv-cray PrgEnv-nvhpc
cc hybrid.c -mp=gpu -gpu=cc80

The following executable is dymanically linked with the Nvidia compiler:

$ ldd a.out | grep libmp
     libmpi_nvidia.so.12 => /opt/cray/pe/lib64/libmpi_nvidia.so.12 (0x00007f6cce0c8000)

Hybrid MPI CPU Batch Job

We sent the batch job to run the application using 4 tasks and 16 CPUs per each.

#!/bin/bash
#SBATCH -A hpcteszt
#SBATCH --partition=cpu
#SBATCH --job-name=hybrid-cpu
#SBATCH --output=hybrid-cpu.out
#SBATCH --time=06:00:00
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=16
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./hybrid-cpu

Hybrid MPI GPU Batch Job

In case GPU offload the OMP_TARGET_OFFLOAD environment variable must be set to MANDATORY.

#!/bin/bash
#SBATCH -A hpcteszt
#SBATCH --partition=gpu
#SBATCH --job-name=hybrid-gpu
#SBATCH --output=hybrid-gpu.out
#SBATCH --time=06:00:00
#SBATCH --ntasks=4
#SBATCH --gres=gpu:1
export OMP_TARGET_OFFLOAD=MANDATORY
srun ./hybrid-gpu