Lab - Build a Cluster: install Scheduler
Contents
Click on the link for the updated version of the Lab
Lab - Build a Cluster: install Scheduler¶
Objective: SLURM scheduler installation and configuration on the head and compute nodes by using Ansible playbook.
Steps:
Create SLURM user & group.
Create directories for SLURM services and log files.
Build & Install SLURM packages on the head node.
Setup munge on the head node.
Create mysql database for SLURM accounting on the head node.
Start slurmdbd and slurmctld on the head node.
Copy munge key to the compute nodes.
Install the SLURM packages on the compute nodes.
Start slurmd on the compute nodes.
Create SLURM user, group, and directories¶
The SLURM services, including slurmctld
, slurmdbd
, and slurmd
, should run as user slurm
.
Therefore, we need to create group and account slurm
on all the nodes.
The SLURM services need the following directories with user slurm ownership:
/var/spool/slurm
/var/spool/slurmctld
/var/spool/slurm/cluster_state
/var/log/slurm
In your build.yml playbook, add slurm-user
role:
- slurm-user
Run the playbook:
ansible-playbook build.yml
Check if user slurm has been created:
ansible all -m ansible.builtin.shell -a 'id slurm'
Build and install SLURM on the head node¶
The latest SLURM source can be downloaded from schemd.com, compiled, and packaged into rpm or deb on the head node, then installed.
The head node already has all the development tools for that.
Add the role for the SLURM package build. On Rocky head node:
- slurm-rpm-build
On Ubuntu head node:
- slurm_build
Run build.yml playbook:
ansible-playbook build.yml
Setup munge on the head node¶
Munge is used for SLURM service authentication between the head and compute nodes.
munge description:
Munge is an authentication service for creating and validating credentials. It is designed to be highly scalable for use in an HPC cluster environment. It allows a process to authenticate the UID and GID of another local or remote process within a group of hosts having common users and groups. These hosts form a security realm that is defined by a shared cryptographic key. Clients within this security realm can create and validate credentials without the use of root privileges, reserved ports, or platform-specific methods.
Munge can be installed with dnf
or apt
.
The key can be generated by the command below:
/usr/bin/dd if=/dev/urandom bs=1 count=1024 of=munge.key
Then copied into /etc/munge directory, and assigned ownership munge with 400 permission.
This is accomplished by role head-node_munge_key
in the playbook:
- head-node_munge_key
Commit it with
ansible-playbook build.yml
Check if munge is running on the cluster:
ansible all -m ansible.builtin.shell -a 'systemctl status munge'
Create mysql database for SLURM accounting¶
For SLURM to store the associations, accounts, QOS, job accounting, etc, it needs database slurm_acct_db created in MySQL.
db setup procedure:
systemctl restart mariadb
mysql -u root -p
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('lcilab2023');
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
Ctrl-d
systemctl restart slurmdbd
systemctl restart slurmctld
/usr/bin/sacctmgr -i add cluster cluster
Ansible playbook entry for the head node play:
- slurmdbd
Run the playbook:
ansible-playbook build.yml
Copy munge key and install the SLURM packages on the compute nodes¶
Copy the munge key to the compute nodes.
Copy the rpm/deb slurm packages to the compute nodes, install them.
Copy the configuration files into /etc/slurm.
Start slurmd on the nodes.
The roules to accomplish these tasks:
- compute-node_munge_key
- slurmd
The final playbook, build.yml
on Rocky linux:
build.yml on Rocky:
---
####### head node play ############
- name: Head node configuration
become: yes
hosts: head
connection: local
tags: head_node_play
roles:
- powertools
- timesync
- head-node_pkg_inst
- head-node_nfs_server
- slurm-user
- slurm-rpm-build
- head-node_munge_key
- slurmdbd
####### compute node play ########
- name: Compute node configuration
become: yes
hosts: all_nodes
tags: compute_node_play
roles:
- powertools
- timesync
- compute-node_pkg_inst
- compute-node_autofs
- slurm-user
- compute-node_munge_key
- slurmd
On Ubuntu linux:
build.yml on Ubuntu:
---
####### head node play ############
- name: Head node configuration
become: yes
hosts: head
connection: local
tags: head_node_play
roles:
- timesync
- head-node_pkg_inst
- head-node_nfs_server
- slurm-user
- slurm_build
- head-node_munge_key
- slurmdbd
####### compute node play ########
- name: Compute node configuration
become: yes
hosts: all_nodes
tags: compute_node_play
roles:
- timesync
- compute-node_pkg_inst
- compute-node_autofs
- slurm-user
- compute-node_munge_key
- slurmd
Run the playbook:
ansible-playbook build.yml
Check if you can run SLURM commands:
sinfo -Nl
scontrol show node compute2
Your HPC cluster is completely functional, ready for HPC application installation.