Click on the link for the updated version of the Lab

Lab: Build a Cluster: Run Application via Scheduler

Objective: learn SLURM commands to submit, monitor, terminate computational jobs, and check completed job accounting info.

Steps:

  • Create accounts and users in SLURM

  • Browse the cluster resources with sinfo

  • Resource allocation via salloc for application runs

  • Using srun for interactive runs

  • sbatch to submit job scripts

  • Terminate a job with scancel

  • Concurrently submitted jobs and resources

  • Job monitoring with squeue and sstat commands.

  • Job statistics with sacct commands

  • Command scontrol show exercises

1. Create cluster account and users in SLURM

For correct accounting, association, and resource assignments, users and accounts should be created in SLURM.

Accounts in SLURM have the meaning like posix groups in Linux.

We create account (group) lci2023:

sudo sacctmgr -i add account lci2023 Description="LCI 2023 workshop"

We create users mpiuser and instructor and assign them to cluster “cluster” and account (group) lci2023:

sudo sacctmgr -i create user name=mpiuser cluster=cluster account=lci2023
sudo sacctmgr -i create user name=instructor cluster=cluster account=lci2023

Check the accounts and users:

sacctmgr list associations format=user,account

2. Cluster resources monitoring

To see what nodes are allocated, used, and idle (available), run command sinfo:

sinfo

To see the details about available and allocated resources on the nodes:

sinfo -N -l

To see running and pending jobs in the queue:

squeue

3. Cluster resource allocation with salloc

OpenMP run:

Copy directory OpenMP into mpiuser home directory:

sudo cp -r OpenMP ~mpiuser/OpenMP
sudo chown -R mpiuser:mpiuser ~mpiuser/OpenMP

Become user mpiuser:

sudo su - mpiuser

Run command salloc to allocate 2 CPU cores on any of the nodes in the cluster:

salloc -n 2

See the allocated resources:

sinfo
squeue 
sinfo -N -l

Step into directory OpenMP, setup 2 threads for a run, then run heated_plate_openmp.x:

cd OpenMP
export OMP_NUM_THREADS=2
./heated_plate.x 

After the run finishes, exit from salloc, and check the resources:

exit
sinfo

Check the job accounting info for mpiuser:

sacct -u mpiuser

It should show one completed job.

4. Using command srun

Execute srun to allocate two CPUs on the cluster and get a shell on a compute node:

srun -n 2 --pty bash

Notice, you get a shell on one of the compute nodes. Now you can run interactive applications.

export OMP_NUM_THREADS=2
./heated_plate.x 

See how many CPUs the run is utilizing. One CPU is dedicated to bash itself.

Exit from srun:

exit

MPI runs:

Change directory to MPI, and run mpi_heat2D.x through srun with 4, 6, and 8 processes:

cd 
cd MPI
srun -n 4  mpi_heat2D.x
srun -n 6  mpi_heat2D.x
srun -n 8  mpi_heat2D.x

Run command sacct to check out the job accounting:

sacct -u mpiuser

5. Using submit scripts and command sbatch

In directory MPI, check out submit script mpi_batch.sh:

#!/bin/bash

#SBATCH --job-name=MPI_test_case
#SBATCH --ntasks-per-node=2
#SBATCH --nodes=4
#SBATCH --partition=lcilab

mpirun  mpi_heat2D.x

Notice, the mpirun is not using the number of processes, neither referencing the hosts file. The SLURM is taking care of the CPU and node allocation for mpirun through its environment variables.

Submit the script to run with command sbatch:

sbatch mpi_batch.sh

Run command squeue to see the running job:

squeue

Copy the submit script, mpi_batch.sh, into mpi_srun.sh:

cp mpi_batch.sh  mpi_srun.sh

Edit the new submit script, and replace mpirun with srun, and change --nodes=4 for --nodes=2. The modified submit script, mpi_srun.sh, should look as follows:

#!/bin/bash

#SBATCH --job-name=MPI_test_case
#SBATCH --ntasks-per-node=2
#SBATCH --nodes=2
#SBATCH --partition=lcilab

srun  mpi_heat2D.x

Submit the job to run on the cluster:

sbatch mpi_srun.sh

Check out the stdo output file, slurm-<job_id>.out

OpenMP runs:

Step into directory OpenMP:

cd 
cd Application/OpenMP

Check out submit script openmp_batch.sh. It is using the SLURM environment variables and a scratch directory. I/O to the local to the node scratch directory runs faster than to the NFS shared file system.

#!/bin/bash

#SBATCH --job-name=OMP_run
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=lci
#SBATCH --ntasks-per-node=2


myrun=heated_plate.x                          # executable to run

export OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE  # assign the number of threads
MYHDIR=$SLURM_SUBMIT_DIR            # directory with input/output files 
MYTMP="/tmp/$USER/$SLURM_JOB_ID"    # local scratch directory on the node
mkdir -p $MYTMP                     # create scratch directory on the node  
cp $MYHDIR/$myrun  $MYTMP           # copy the executable and input files into the scratch
cd $MYTMP                           # step into the scratch dir, nd run tasks in the scratch 

./$myrun > run.out-$SLURM_JOB_ID

cp $MYTMP/run.out-$SLURM_JOB_ID  $MYHDIR     # copy everything back into the home dir
rm -rf  $MYTMP                               # remove the scratch directory

Submit the script toa run on the cluster via command sbatch:

sbatch openmp_batch.sh

After the job completes, check out the content of the output file, run.out-<jobid>, and the stdo output file slurm.out

6. Terminate a job with command scancel

Submit the OpenMP job with sbatch to run on node compute2. Check out its status with command squeue. Terminate the job with command scancel:

sbatch -w compute2 openmp_batch.sh
squeue
scancel -j <jobid>

7. Concurrently submitted jobs

If resources are unavailable, jobs will stay in the queue.

Launch srun with bash, and requesting 2 tasks:

srun -n 2 --pty bash

Submit mpi_batch.sh:

sbatch  mpi_batch.sh

Run command squeue. It should show you that the last job is waiting in the queue due to unavailable resources, namely, the CPU cores:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                31    lcilab MPI_test  mpiuser PD       0:00      4 (Resources)
                30    lcilab     bash  mpiuser  R       1:38      1 compute1

exit from srun, and run squeue again. The MPI job should begin running.

8. Command sstat

Compile poisson_mpi.c:

mpicc -o poisson_mpi.x poisson_mpi.c

To see the resource utilization of running jobs, use command sstat

Submit poisson_batch.sh:

sbatch poisson_batch.sh

Run command to see the resource utilization by the running job.

sstat -j <jobid>

Selected and formatted output for job with, for example, jobid 30:

sstat -p --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 30
AveCPU|AvePages|AveRSS|AveVMSize|JobID|
00:02:30|7|57000K|65404K|29.0|

9. Job accounting information

Command sacct can give formatted output.

To get help on the format options, run:

sacct --helpformat

For example, to get Job ID, Job name, Exit state, start date-time, and end date-time for a job with JobID 8:

sacct -j 8 --format="JobID,JobName,State,Start,End"

Let’s review the accounting info for an MPI job, previously submitted with script mpi_batch.sh. Assuming, the JobID was 5:

sacct -j 5 --format="JobID,JobName,State,NodeList,CPUTime"

The output should be similar to below:

JobID           JobName      State        NodeList    CPUTime 
------------ ---------- ---------- --------------- ---------- 
5            MPI_test_+  COMPLETED    compute[1-4]   00:00:48 
5.batch           batch  COMPLETED        compute1   00:00:12 
5.0               orted  COMPLETED    compute[2-4]   00:00:30 

The first line shows the total summary for the job with JobID 5.

The second line shows the first step summary, for the batch script submission.

The third line shows the main part related to the mpirun (orted).

10. Command scontrol

It allows you to read the information about the SLURM configuration, compute nodes, running jobs and commit modifications and updates at runtime.

For example, to see the information about SLURM configuration:

scontrol show config

To get the info about a compute node, for example compute2:

scontrol show node compute2

To see a detailed information about submitted job, say with jobid #12

scontrol show job 12

Submit another openmp_batch.sh job, and check its information with

scontrol show job <jobid>