Name		Name	Last commit message	Last commit date
parent directory ..
ood-images		ood-images
README.md		README.md
dcgm.md		dcgm.md
large-deployments.md		large-deployments.md
ood.md		ood.md
slurm-monitor.md		slurm-monitor.md
slurm-nfs.md		slurm-nfs.md
slurm-perf-cluster.md		slurm-perf-cluster.md
slurm-usage.md		slurm-usage.md
software-modules.md		software-modules.md

README.md

Slurm Deployment Guide

Instructions for deploying a GPU cluster with Slurm

Requirements

Control system to run the install process
One server to act as the Slurm controller/login node
One or more servers to act as the Slurm compute nodes

Installation Steps

Install a supported operating system on all nodes.

Install a supported operating system on all servers via a 3rd-party solution (i.e. MAAS, Foreman) or utilize the provided OS install container.
Set up your provisioning machine.

This will install Ansible and other software on the provisioning machine which will be used to deploy all other software to the cluster. For more information on Ansible and why we use it, consult the Ansible Guide.
```
# Install software prerequisites and copy default configuration
./scripts/setup.sh
```
Create and edit the Ansible inventory.

Ansible uses an inventory which outlines the servers in your cluster. The setup script from the previous step will copy an example inventory configuration to the config directory.

Edit the inventory:
```
# Edit inventory
# Add Slurm controller/login host to `slurm-master` group
# Add Slurm worker/compute hosts to the `slurm-node` groups
vi config/inventory

# (optional) Modify `config/group_vars/*.yml` to set configuration parameters
```
Note: Multiple hosts can be added to the slurm-master group for high-availability. You must also set slurm_enable_ha: true in config/group_vars/slurm-cluster.yml. For more information about HA Slurm deployments, see: https://slurm.schedmd.com/quickstart_admin.html#HA
Verify the configuration.
```
ansible all -m raw -a "hostname"
```

Install Slurm.

# NOTE: If SSH requires a password, add: `-k`
# NOTE: If sudo on remote machine requires a password, add: `-K`
# NOTE: If SSH user is different than current user, add: `-u ubuntu`
ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml

Verify Pyxis and Enroot can run GPU jobs across all nodes.

# NOTE: This will use Pyxis to download a container and verify GPU functionality across all compute nodes 
ansible-playbook -l slurm-cluster playbooks/slurm-validation.yml -e '{num_gpus: 1}'

Using Slurm

Now that Slurm is installed, try a "Hello World" example using MPI.

Read through the slurm usage guide and Open OnDemand guide for more information.

Node Health Check

The default Slurm deployment includes setting up Node Health Check. This tool will run periodically on idle nodes to validate that the hardware and software is set up as expected. Nodes which fail this check will be automatically drained in Slurm to prevent jobs running on potentially broken nodes.

However, the default configuration that is generated by DeepOps is very basic, only checking that CPU, memory, and GPUs are present and that a few essential services are running. To customize this file, you can set the nhc_config_template variable to point to your custom file. The NHC docs go into detail about the configuration language.

If you want to disable NHC completely, you can do so by setting slurm_install_nhc: no and un-defining the slurm_health_check_program variable.

Monitoring Slurm

As part of the Slurm installation, Grafana and Prometheus are both deployed.

The services can be reached from the following addresses:

Grafana: http://<slurm-master>:3000
Prometheus: http://<slurm-master>:9090

Configuring shared filesystems

For information about configuring a shared NFS filesystem on your Slurm cluster, see the documentation on Slurm and NFS.

Installing tools and applications

You may optionally choose to install a tool for managing additional packages on your Slurm cluster. See the documentation on software modules for information on how to set this up.

Installing Open on Demand

Open OnDemand can be installed by setting the install_open_ondemand variable to yes before running the slurm-cluster.yml playbook.

Pyxis, Enroot, and Singularity

Pyxis and Enroot are installed by default and can be disabled by setting slurm_install_enroot and slurm_install_pyxis to no. Singularity can be installed by setting the slurm_cluster_install_singularity variable to yes before running the slurm-cluster.yml playbook.

Large deployments

To minimize the requirements for the cluster management services, DeepOps deploys a single Slurm head node for cluster management, shared filesystems, and user login. However, for larger deployments, it often makes sense to run these functions on multiple separate machines. For instructions on separating these functions, see the large deployment guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurm-cluster

slurm-cluster

README.md

Slurm Deployment Guide

Requirements

Installation Steps

Using Slurm

Node Health Check

Monitoring Slurm

Configuring shared filesystems

Installing tools and applications

Installing Open on Demand

Pyxis, Enroot, and Singularity

Large deployments

Files

slurm-cluster

Directory actions

More options

Directory actions

More options

Latest commit

History

slurm-cluster

Folders and files

parent directory

README.md

Slurm Deployment Guide

Requirements

Installation Steps

Using Slurm

Node Health Check

Monitoring Slurm

Configuring shared filesystems

Installing tools and applications

Installing Open on Demand

Pyxis, Enroot, and Singularity

Large deployments