Nomad tests

These scripts test that a given set of Nomad nodes are correctly configured for the project requirements.

The tests are useful for checking:

when a given actor adds their resources to the existing Nomad cluster.
whether everyone in the cluster has applied Ansible role updates.
check is jobs can be still deployed because some resource usage (eg. disk) is not monitor by Nomad
test that newly develop Ansible roles do not break the expected functionality of the Nomad cluster

Requirements

You need to have the appropriate certs to connect to the Nomad cluster

You need a nomad version >= 1.5, to support the node meta command.
Before being able to run the tests you should provide an EGI token:
```
export TMP_EGI_TOKEN="$(oidc-token egi-checkin)"
```
See PAPI docs in order know how to generate an EGI token.

Once this is ready, you can install the tests suite with:

python -m venv --system-site-packages myenv
source myenv/bin/activate
pip install -e .
deactivate

Usage

You can test:

the whole cluster:
```
ai4-nomad-tests --cluster
```

a single datacenter:

ai4-nomad-tests --datacenter ifca-ai4eosc

a list of individual nodes:

ai4-nomad-tests --nodes ifca-node-gpu-1 --nodes ifca-node-gpu-2

💡 Tip: Periodic testing

You can schedule a cronjob that executes periodically the tests in the whole cluster and automatically mark the nodes that fail to pass the tests as ineligible, to avoid having jobs landing there (and failing). For this use the monitoring-cluster.sh script.

Once the admin responsible for that Nomad nodes fixes the issues, node can be manually marked again as eligible in the Nomad UI.

Implementation notes

GPU deployment testing is using a Nomad job that doesn't ask for GPUs. This Nomad job is mainly intended to check that the Traefik endpoints of that node work fine. We don't test with a GPU job because GPUs are a scarce resource so the deployment could fail because no GPUs are available, not because malfunctioning.

To find any GPU misconfigurations without actually making a GPU deployment, we try to catch as many error as possible parsing the node metadata (cf. gpu.node_info()). This is not perfect but it catches most GPU errors (eg. device not available).

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
ai4_nomad_tests		ai4_nomad_tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
monitor-cluster.sh		monitor-cluster.sh
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nomad tests

Requirements

Usage

Implementation notes

About

Releases 1

Languages

License

ai4os/ai4-nomad_tests

Folders and files

Latest commit

History

Repository files navigation

Nomad tests

Requirements

Usage

Implementation notes

About

Resources

License

Stars

Watchers

Forks

Releases 1

Languages