Lab: Introduction to OpenMP and MPI.
Contents
Lab: Introduction to OpenMP and MPI.¶
Click on the link for the updated version of this Lab
Part I: OpenMP¶
Objective: install compilers and openmpi libraries on the head node, compile and run OpenMP applications.
Steps:
Install gcc and openmp support packages.
Download a tarball with OpenMP source codes.
Compile the applications, and run with different number of threads.
Performance scalability of openmp computations.
Install gcc with OpenMP support (on Rocky)¶
To compile a C source code, we need to install a GNU C compiler, GCC.
On Red Hat 8.7, you can install different versions of GCC compiler, including 8.5 (the default), 9, 10, 11, and 12. We’ll install the default, 8.5, and the 12 for demonstration purposes.
Install GCC 8.5 and GCC 12 and openmp support packages:
sudo dnf install -y gcc # to install a default gcc 8.5 compiler on Red Hat 8.7
sudo dnf install -y gcc-toolset-12-gcc # to install the latest gcc 12 on Red Hat 8.7
sudo dnf install -y libomp-devel
Switch over between the installed compilers (on Rocky)¶
Verify the current GCC compiler version:
gcc -v
It should show: gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)
If we want to compile a source code, for example with GCC 12, we’ll setup the compiler environment in a new shell:
scl enable gcc-toolset-12 'bash'
gcc -v
It should show: gcc version 12.1.1 20220628 (Red Hat 12.1.1-3) (GCC)
If command gcc
is invoked in this shell, a code would be compiled with GCC 12.
Now we switch back to the default compiler by exiting the shell:
exit
gcc -v
Compilation and run with OpenMP¶
Download tar ball with OpenMP source code files, and extract it:
wget https://linuxcourse.rutgers.edu/LCI_2023/OpenMP.tgz
tar -zxvf OpenMP.tgz
Step into directory OpenMP:
cd OpenMP
Compile
hello.c
with OpenMP libraries:
gcc -fopenmp -o hello.x hello.c
Run command ldd
on the generated executable to see the dynamic libraries it is using:
ldd hello.x
It shows the following:
output:
linux-vdso.so.1 (0x00007ffdc0d58000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f32521a9000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3251f89000)
libc.so.6 => /lib64/libc.so.6 (0x00007f3251bc4000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f32519c0000)
/lib64/ld-linux-x86-64.so.2 (0x00007f32523e1000)
Notice libgomp.so.1
and libpthread.so.0
shared object libraries. Executables compiled with OpenMP support always have them loaded.
If we were to compile hello.c
with GCC 12 on Rocky, we would run scl enable gcc-toolset-12 'bash'
, or compile it as follows:
scl enable gcc-toolset-12 'gcc -fopenmp -o hello.x hello.c'
Now run executable hello.x
and see what happens:
./hello.x
The output comes with two threads. By default, when you run an executable compiled with the openmp support, the number of threads becomes equal to the number of the CPU cores available on the system.
Define environment variable OMP_NUM_THREADS=4, then run the executable again:
export OMP_NUM_THREADS=4
./hello.x
Compile
for.c
and run executablefor.x
several times:
gcc -fopenmp -o for.x for.c
./for.x
./for.x
Notice the order of the thread output changing.
Compile
sections.c
:
gcc -fopenmp -o sections.x sections.c
Set the number of threads to 2, and run the executable:
export OMP_NUM_THREADS=2
./sections.x
Compile
reduction.c
, and execute on 4 threads:
gcc -fopenmp -o reduction.x reduction.c
export OMP_NUM_THREADS=4
./reduction.x
Compile sum.c with the -fopenmp and run executable across 2 threads:
export OMP_NUM_THREADS=2
gcc -fopenmp -o sum.x sum.c
./sum.x
Modify file sum.c
and comment the line with construct pragma omp critical
:
#include <omp.h>
#include <stdio.h>
int main() {
double a[1000000];
int i;
double sum;
sum=0;
#pragma omp parallel for
for (i=0; i<1000000; i++) a[i]=i;
#pragma omp parallel for shared (sum) private (i)
for ( i=0; i < 1000000; i++) {
// #pragma omp critical
sum = sum + a[i];
}
printf("sum=%lf\n",sum);
}
Recompile it again and run several times. Notice the different output results.
Performance demonstration with OpenMP¶
Compile the code for solving the steady state heat equation, heated_plate_openmp.c
:
gcc -fopenmp -o heated_plate.x heated_plate_openmp.c -lm
Open another terminal on your laptop and ssh to the cluster.
Run command top
on the head node, then hit keys 1 H t.
While top
is running, in the original terminal, run heated_plate.x
with one thread:
export OMP_NUM_THREADS=1
./heated_plate.x
During the runtime, notice the resource utilization in the top
: one CPU core being utilized at 100%, essentially by one process, heated_plate.x
:
%Cpu0 : 0.3/0.3 1[ ]
%Cpu1 : 99.7/0.3 100[|||||||||||||||||||||||||||||||||||||||||||||||||||||]
MiB Mem : 3731.1 total, 670.9 free, 353.9 used, 2706.3 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 3083.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
224075 instruc+ 20 0 18272 6052 2116 R 99.7 0.2 0:21.57 heated_+
After the run completes, it prints out the Wallclock time of about 49 seconds.
Increase the number of the openmp threads to 2 and run the code again:
export OMP_NUM_THREADS=2
./heated_plate.x
During the runtime, the top
shows the both CPU core utilization at 100% by two threads:
%Cpu0 : 99.0/1.0 100[|||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu1 : 99.3/0.7 100[|||||||||||||||||||||||||||||||||||||||||||||||||||||]
MiB Mem : 3731.1 total, 627.6 free, 357.3 used, 2746.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 3079.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
224592 instruc+ 20 0 26472 6012 2052 R 99.0 0.2 0:08.00 heated_+
224591 instruc+ 20 0 26472 6012 2052 R 97.7 0.2 0:07.98 heated_+
After the run completes, Wallclock time shows up at about 25 sec
Exit top
with q key stroke.
Part II: MPI¶
Objective: Setup MPI environment on the cluster. Compile and run MPI applications.
steps:
Install openmpi package on the head and the compute nodes.
Create
mpiuser
account on the NFS shared file system.Compile and run MPI applications.
Performance scalability of MPI applications.
Install openmpi 4.1.5 package on the cluster¶
First, install GNU C++ on the head node.
On Rocky:
sudo dnf install gcc-c++
On Ubuntu:
sudo apt install g++
Download archive with Ansible scripts from the link below.
On Rocky:
cd
wget http://linuxcourse.rutgers.edu/LCI_2023/Lab_MPI.tgz
tar -zxvf Lab_MPI.tgz
cd Lab_MPI/Ansible
On Ubuntu:
cd
wget http://linuxcourse.rutgers.edu/LCI_2023/Lab_MPI_ubuntu.tgz
tar -zxvf Lab_MPI_ubuntu.tgz
cd Lab_MPI_ubuntu/Ansible
Install openmpi on the cluster via Ansible playbook:
ansible-playbook install_mpi.yml
Note: It takes time to compile and build rpm or deb for MPI on the head node, so please don’t do it. It is already installed on your cluster after completing the procedure above.
MPI rpm build procedure on Rocky:
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5-1.src.rpm
rpmbuild --define 'configure_options --with-pmi' --define 'install_in_opt 1' --rebuild openmpi-4.1.5-1.src.rpm
cp /home/instructor/rpmbuild/RPMS/x86_64/openmpi-4.1.5-1.el8.x86_64.rpm /home/instructor/Ansible/Files
MPI deb build on Ubuntu:
./configure --prefix=/opt/openmpi/4.1.5 --with-pmi
make
echo 'OpenMPI for the cluster' | sudo checkinstall -D --nodoc
Create user mpiuser on the cluster¶
ansible-playbook setup_mpiuser.yml
Become mpiuser
:
sudo su - mpiuser
All the MPI exercises below should be done as user mpiuser
.
Setup environment variables for MPI by placing the snipped below into .bashrc file of mpiuser
:
export MPI_HOME=/opt/openmpi/4.1.5
export PATH=$PATH:$MPI_HOME/bin
Run
source .bashrc
Download applications, setup passwordless ssh, and test mpirun¶
Create directory MPI, download a tar ball with MPI apps:
wget http://linuxcourse.rutgers.edu/LCI_2023/MPI.tgz
tar -zxvf MPI.tgz
Verify that mpirun
works:
mpirun -n 8 -hostfile nodes.txt uname -n
If it works, it should have every compute node to print out its host name twice.
File nodes.txt
contains the list of the compute nodes and the number of CPU cores on each node:
compute1 slots=2
compute2 slots=2
compute3 slots=2
compute4 slots=2
Compile and run MPI applications¶
Compile helloc
and run on 8 CPU cores:
mpicc -o hello.x hello.c
mpirun -n 8 -hostfile nodes.txt hello.x
Compile b_send_receive.x
and run on 2 CPU cores:
mpicc -o b_send_receive.x b_send_receive.c
mpirun -n 2 -hostfile nodes.txt b_send_receive.x
Same with nonb_send_receive.c
:
mpicc -o nonb_send_receive.x nonb_send_receive.c
mpirun -n 2 -hostfile nodes.txt nonb_send_receive.x
Compile mpi_reduce.c
and run on 8 processors:
mpicc -o mpi_reduce.x mpi_reduce.c
mpirun -n 8 -hostfile nodes.txt mpi_reduce.x
Compile mpi_heat2D.c
and run it on 4, 6, and 8 CPU cores. Notice the wallclock time difference
between the runs:
mpicc -o mpi_heat2D.x mpi_heat2D.c
mpirun -n 4 -hostfile nodes.txt mpi_heat2D.x
mpirun -n 6 -hostfile nodes.txt mpi_heat2D.x
mpirun -n 8 -hostfile nodes.txt mpi_heat2D.x
There is a noticeable speedup with increasing the number of CPU cores here.