Lab: Introduction to OpenMP and MPI.

Click on the link for the updated version of this Lab

Part I: OpenMP

Objective: install compilers and openmpi libraries on the head node, compile and run OpenMP applications.

Steps:

  • Install gcc and openmp support packages.

  • Download a tarball with OpenMP source codes.

  • Compile the applications, and run with different number of threads.

  • Performance scalability of openmp computations.

Install gcc with OpenMP support (on Rocky)

To compile a C source code, we need to install a GNU C compiler, GCC.

On Red Hat 8.7, you can install different versions of GCC compiler, including 8.5 (the default), 9, 10, 11, and 12. We’ll install the default, 8.5, and the 12 for demonstration purposes.

  • Install GCC 8.5 and GCC 12 and openmp support packages:

sudo dnf install -y gcc  # to install a default gcc 8.5 compiler on Red Hat 8.7
sudo dnf install -y gcc-toolset-12-gcc # to install the latest gcc 12 on Red Hat 8.7
sudo dnf install -y libomp-devel

Switch over between the installed compilers (on Rocky)

Verify the current GCC compiler version:

gcc -v

It should show: gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)

If we want to compile a source code, for example with GCC 12, we’ll setup the compiler environment in a new shell:

scl enable gcc-toolset-12 'bash'
gcc -v

It should show: gcc version 12.1.1 20220628 (Red Hat 12.1.1-3) (GCC)

If command gcc is invoked in this shell, a code would be compiled with GCC 12.

Now we switch back to the default compiler by exiting the shell:

exit
gcc -v

Compilation and run with OpenMP

Download tar ball with OpenMP source code files, and extract it:

wget  https://linuxcourse.rutgers.edu/LCI_2023/OpenMP.tgz
tar -zxvf OpenMP.tgz

Step into directory OpenMP:

cd OpenMP
  • Compile hello.c with OpenMP libraries:

gcc -fopenmp -o hello.x hello.c

Run command ldd on the generated executable to see the dynamic libraries it is using:

ldd hello.x

It shows the following:

output:

linux-vdso.so.1 (0x00007ffdc0d58000)

libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f32521a9000)

libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3251f89000)

libc.so.6 => /lib64/libc.so.6 (0x00007f3251bc4000)

libdl.so.2 => /lib64/libdl.so.2 (0x00007f32519c0000)

/lib64/ld-linux-x86-64.so.2 (0x00007f32523e1000)

Notice libgomp.so.1 and libpthread.so.0 shared object libraries. Executables compiled with OpenMP support always have them loaded.

If we were to compile hello.c with GCC 12 on Rocky, we would run scl enable gcc-toolset-12 'bash', or compile it as follows:

scl enable gcc-toolset-12 'gcc -fopenmp -o hello.x hello.c'

Now run executable hello.x and see what happens:

./hello.x

The output comes with two threads. By default, when you run an executable compiled with the openmp support, the number of threads becomes equal to the number of the CPU cores available on the system.

Define environment variable OMP_NUM_THREADS=4, then run the executable again:

export OMP_NUM_THREADS=4
./hello.x
  • Compile for.c and run executable for.x several times:

gcc -fopenmp -o for.x for.c
./for.x
./for.x

Notice the order of the thread output changing.

  • Compile sections.c:

gcc -fopenmp -o sections.x sections.c

Set the number of threads to 2, and run the executable:

export OMP_NUM_THREADS=2
./sections.x
  • Compile reduction.c, and execute on 4 threads:

gcc -fopenmp -o reduction.x reduction.c
export OMP_NUM_THREADS=4
./reduction.x
  • Compile sum.c with the -fopenmp and run executable across 2 threads:

export  OMP_NUM_THREADS=2
gcc   -fopenmp   -o sum.x  sum.c
./sum.x

Modify file sum.c and comment the line with construct pragma omp critical:

#include <omp.h>
#include <stdio.h>

int main() {
    double a[1000000];
    int i;
    double sum;

    sum=0;

    #pragma omp parallel for 
    for (i=0; i<1000000; i++) a[i]=i;
    #pragma omp parallel for shared (sum) private (i) 
    for ( i=0; i < 1000000; i++) {
//       #pragma omp critical 
        sum = sum + a[i];
    }
    printf("sum=%lf\n",sum);
}


Recompile it again and run several times. Notice the different output results.

Performance demonstration with OpenMP

Compile the code for solving the steady state heat equation, heated_plate_openmp.c:

gcc -fopenmp -o heated_plate.x heated_plate_openmp.c -lm

Open another terminal on your laptop and ssh to the cluster. Run command top on the head node, then hit keys 1 H t.

While top is running, in the original terminal, run heated_plate.x with one thread:

export OMP_NUM_THREADS=1
./heated_plate.x

During the runtime, notice the resource utilization in the top: one CPU core being utilized at 100%, essentially by one process, heated_plate.x:

%Cpu0  :   0.3/0.3     1[                                                     ]
%Cpu1  :  99.7/0.3   100[|||||||||||||||||||||||||||||||||||||||||||||||||||||]
MiB Mem :   3731.1 total,    670.9 free,    353.9 used,   2706.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   3083.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
 224075 instruc+  20   0   18272   6052   2116 R  99.7   0.2   0:21.57 heated_+

After the run completes, it prints out the Wallclock time of about 49 seconds.

Increase the number of the openmp threads to 2 and run the code again:

export OMP_NUM_THREADS=2
./heated_plate.x

During the runtime, the top shows the both CPU core utilization at 100% by two threads:

%Cpu0  :  99.0/1.0   100[|||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu1  :  99.3/0.7   100[|||||||||||||||||||||||||||||||||||||||||||||||||||||]
MiB Mem :   3731.1 total,    627.6 free,    357.3 used,   2746.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   3079.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
 224592 instruc+  20   0   26472   6012   2052 R  99.0   0.2   0:08.00 heated_+
 224591 instruc+  20   0   26472   6012   2052 R  97.7   0.2   0:07.98 heated_+

After the run completes, Wallclock time shows up at about 25 sec

Exit top with q key stroke.


Part II: MPI

Objective: Setup MPI environment on the cluster. Compile and run MPI applications.

steps:

  • Install openmpi package on the head and the compute nodes.

  • Create mpiuser account on the NFS shared file system.

  • Compile and run MPI applications.

  • Performance scalability of MPI applications.

Install openmpi 4.1.5 package on the cluster

First, install GNU C++ on the head node.

On Rocky:

sudo dnf install gcc-c++

On Ubuntu:

sudo apt install g++

Download archive with Ansible scripts from the link below.

On Rocky:

cd
wget http://linuxcourse.rutgers.edu/LCI_2023/Lab_MPI.tgz 
tar -zxvf Lab_MPI.tgz
cd Lab_MPI/Ansible

On Ubuntu:

cd
wget http://linuxcourse.rutgers.edu/LCI_2023/Lab_MPI_ubuntu.tgz 
tar -zxvf Lab_MPI_ubuntu.tgz
cd Lab_MPI_ubuntu/Ansible

Install openmpi on the cluster via Ansible playbook:

ansible-playbook install_mpi.yml

Note: It takes time to compile and build rpm or deb for MPI on the head node, so please don’t do it. It is already installed on your cluster after completing the procedure above.


Create user mpiuser on the cluster

ansible-playbook setup_mpiuser.yml

Become mpiuser:

sudo su - mpiuser

All the MPI exercises below should be done as user mpiuser.

Setup environment variables for MPI by placing the snipped below into .bashrc file of mpiuser:

export MPI_HOME=/opt/openmpi/4.1.5
export PATH=$PATH:$MPI_HOME/bin

Run

source .bashrc

Download applications, setup passwordless ssh, and test mpirun

Create directory MPI, download a tar ball with MPI apps:

wget http://linuxcourse.rutgers.edu/LCI_2023/MPI.tgz
tar -zxvf MPI.tgz

Verify that mpirun works:

mpirun -n 8 -hostfile nodes.txt uname -n

If it works, it should have every compute node to print out its host name twice.

File nodes.txt contains the list of the compute nodes and the number of CPU cores on each node:

compute1 slots=2
compute2 slots=2
compute3 slots=2
compute4 slots=2

Compile and run MPI applications

Compile helloc and run on 8 CPU cores:

mpicc -o hello.x hello.c
mpirun -n 8 -hostfile nodes.txt hello.x

Compile b_send_receive.x and run on 2 CPU cores:

mpicc -o b_send_receive.x b_send_receive.c
mpirun -n 2 -hostfile nodes.txt b_send_receive.x

Same with nonb_send_receive.c:

mpicc -o nonb_send_receive.x nonb_send_receive.c
mpirun -n 2 -hostfile nodes.txt nonb_send_receive.x

Compile mpi_reduce.c and run on 8 processors:

mpicc -o mpi_reduce.x mpi_reduce.c 
mpirun -n 8 -hostfile nodes.txt mpi_reduce.x

Compile mpi_heat2D.c and run it on 4, 6, and 8 CPU cores. Notice the wallclock time difference between the runs:

mpicc -o mpi_heat2D.x mpi_heat2D.c
mpirun -n 4 -hostfile nodes.txt mpi_heat2D.x
mpirun -n 6 -hostfile nodes.txt mpi_heat2D.x
mpirun -n 8 -hostfile nodes.txt mpi_heat2D.x

There is a noticeable speedup with increasing the number of CPU cores here.