Lab: Build a Cluster: Run Application via Scheduler
Contents
Click on the link for the updated version of the Lab
Lab: Build a Cluster: Run Application via Scheduler¶
Objective: learn SLURM commands to submit, monitor, terminate computational jobs, and check completed job accounting info.
Steps:
Create accounts and users in SLURM
Browse the cluster resources with
sinfo
Resource allocation via
salloc
for application runsUsing
srun
for interactive runssbatch
to submit job scriptsTerminate a job with
scancel
Concurrently submitted jobs and resources
Job monitoring with
squeue
andsstat
commands.Job statistics with
sacct
commandsCommand
scontrol show
exercises
1. Create cluster account and users in SLURM¶
For correct accounting, association, and resource assignments, users and accounts should be created in SLURM.
Accounts in SLURM have the meaning like posix groups in Linux.
We create account (group) lci2023:
sudo sacctmgr -i add account lci2023 Description="LCI 2023 workshop"
We create users mpiuser
and instructor
and assign them to cluster “cluster” and account (group) lci2023:
sudo sacctmgr -i create user name=mpiuser cluster=cluster account=lci2023
sudo sacctmgr -i create user name=instructor cluster=cluster account=lci2023
Check the accounts and users:
sacctmgr list associations format=user,account
2. Cluster resources monitoring¶
To see what nodes are allocated, used, and idle (available), run command sinfo
:
sinfo
To see the details about available and allocated resources on the nodes:
sinfo -N -l
To see running and pending jobs in the queue:
squeue
3. Cluster resource allocation with salloc
¶
OpenMP run:
Copy directory OpenMP into mpiuser
home directory:
sudo cp -r OpenMP ~mpiuser/OpenMP
sudo chown -R mpiuser:mpiuser ~mpiuser/OpenMP
Become user mpiuser
:
sudo su - mpiuser
Run command salloc
to allocate 2 CPU cores on any of the nodes in the cluster:
salloc -n 2
See the allocated resources:
sinfo
squeue
sinfo -N -l
Step into directory OpenMP, setup 2 threads for a run, then run heated_plate_openmp.x:
cd OpenMP
export OMP_NUM_THREADS=2
./heated_plate.x
After the run finishes, exit from salloc, and check the resources:
exit
sinfo
Check the job accounting info for mpiuser:
sacct -u mpiuser
It should show one completed job.
4. Using command srun
¶
Execute srun
to allocate two CPUs on the cluster and get a shell on a compute node:
srun -n 2 --pty bash
Notice, you get a shell on one of the compute nodes. Now you can run interactive applications.
export OMP_NUM_THREADS=2
./heated_plate.x
See how many CPUs the run is utilizing. One CPU is dedicated to bash itself.
Exit from srun
:
exit
MPI runs:
Change directory to MPI, and run mpi_heat2D.x
through srun
with 4, 6, and 8 processes:
cd
cd MPI
srun -n 4 mpi_heat2D.x
srun -n 6 mpi_heat2D.x
srun -n 8 mpi_heat2D.x
Run command sacct
to check out the job accounting:
sacct -u mpiuser
5. Using submit scripts and command sbatch
¶
In directory MPI, check out submit script mpi_batch.sh
:
#!/bin/bash
#SBATCH --job-name=MPI_test_case
#SBATCH --ntasks-per-node=2
#SBATCH --nodes=4
#SBATCH --partition=lcilab
mpirun mpi_heat2D.x
Notice, the mpirun
is not using the number of processes, neither referencing the hosts file.
The SLURM is taking care of the CPU and node allocation for mpirun through its environment variables.
Submit the script to run with command sbatch
:
sbatch mpi_batch.sh
Run command squeue
to see the running job:
squeue
Copy the submit script, mpi_batch.sh
, into mpi_srun.sh
:
cp mpi_batch.sh mpi_srun.sh
Edit the new submit script, and replace mpirun
with srun
, and change --nodes=4
for --nodes=2
.
The modified submit script, mpi_srun.sh, should look as follows:
#!/bin/bash
#SBATCH --job-name=MPI_test_case
#SBATCH --ntasks-per-node=2
#SBATCH --nodes=2
#SBATCH --partition=lcilab
srun mpi_heat2D.x
Submit the job to run on the cluster:
sbatch mpi_srun.sh
Check out the stdo output file, slurm-<job_id>.out
OpenMP runs:
Step into directory OpenMP:
cd
cd Application/OpenMP
Check out submit script openmp_batch.sh
. It is using the SLURM environment variables and a scratch directory.
I/O to the local to the node scratch directory runs faster than to the NFS shared file system.
#!/bin/bash
#SBATCH --job-name=OMP_run
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=lci
#SBATCH --ntasks-per-node=2
myrun=heated_plate.x # executable to run
export OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE # assign the number of threads
MYHDIR=$SLURM_SUBMIT_DIR # directory with input/output files
MYTMP="/tmp/$USER/$SLURM_JOB_ID" # local scratch directory on the node
mkdir -p $MYTMP # create scratch directory on the node
cp $MYHDIR/$myrun $MYTMP # copy the executable and input files into the scratch
cd $MYTMP # step into the scratch dir, nd run tasks in the scratch
./$myrun > run.out-$SLURM_JOB_ID
cp $MYTMP/run.out-$SLURM_JOB_ID $MYHDIR # copy everything back into the home dir
rm -rf $MYTMP # remove the scratch directory
Submit the script toa run on the cluster via command sbatch
:
sbatch openmp_batch.sh
After the job completes, check out the content of the output file, run.out-<jobid>, and the stdo output file slurm.out
6. Terminate a job with command scancel
¶
Submit the OpenMP job with sbatch
to run on node compute2
. Check out its status with command squeue
.
Terminate the job with command scancel
:
sbatch -w compute2 openmp_batch.sh
squeue
scancel -j <jobid>
7. Concurrently submitted jobs¶
If resources are unavailable, jobs will stay in the queue.
Launch srun
with bash, and requesting 2 tasks:
srun -n 2 --pty bash
Submit mpi_batch.sh
:
sbatch mpi_batch.sh
Run command squeue
. It should show you that the last job is waiting in the queue due to unavailable resources,
namely, the CPU cores:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
31 lcilab MPI_test mpiuser PD 0:00 4 (Resources)
30 lcilab bash mpiuser R 1:38 1 compute1
exit from srun
, and run squeue
again. The MPI job should begin running.
8. Command sstat¶
Compile poisson_mpi.c
:
mpicc -o poisson_mpi.x poisson_mpi.c
To see the resource utilization of running jobs, use command sstat
Submit poisson_batch.sh
:
sbatch poisson_batch.sh
Run command to see the resource utilization by the running job.
sstat -j <jobid>
Selected and formatted output for job with, for example, jobid 30:
sstat -p --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 30
AveCPU|AvePages|AveRSS|AveVMSize|JobID|
00:02:30|7|57000K|65404K|29.0|
9. Job accounting information¶
Command sacct
can give formatted output.
To get help on the format options, run:
sacct --helpformat
For example, to get Job ID, Job name, Exit state, start date-time, and end date-time for a job with JobID 8:
sacct -j 8 --format="JobID,JobName,State,Start,End"
Let’s review the accounting info for an MPI job, previously submitted with script mpi_batch.sh
.
Assuming, the JobID was 5:
sacct -j 5 --format="JobID,JobName,State,NodeList,CPUTime"
The output should be similar to below:
JobID JobName State NodeList CPUTime
------------ ---------- ---------- --------------- ----------
5 MPI_test_+ COMPLETED compute[1-4] 00:00:48
5.batch batch COMPLETED compute1 00:00:12
5.0 orted COMPLETED compute[2-4] 00:00:30
The first line shows the total summary for the job with JobID 5.
The second line shows the first step summary, for the batch script submission.
The third line shows the main part related to the mpirun (orted).
10. Command scontrol
¶
It allows you to read the information about the SLURM configuration, compute nodes, running jobs and commit modifications and updates at runtime.
For example, to see the information about SLURM configuration:
scontrol show config
To get the info about a compute node, for example compute2
:
scontrol show node compute2
To see a detailed information about submitted job, say with jobid #12
scontrol show job 12
Submit another openmp_batch.sh
job, and check its information with
scontrol show job <jobid>