Skip to content

Commit

Permalink
Merge branch 'main' into feat/compute-init-cookiecutter
Browse files Browse the repository at this point in the history
  • Loading branch information
bertiethorpe authored Jan 14, 2025
2 parents 81c316a + 2cac614 commit bccc88b
Show file tree
Hide file tree
Showing 6 changed files with 48 additions and 27 deletions.
28 changes: 24 additions & 4 deletions .github/workflows/nightly-cleanup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,11 +67,31 @@ jobs:
for cluster_prefix in ${ci_clusters}
do
echo "Processing cluster: $cluster_prefix"
TAGS=$(openstack server show ${cluster_prefix}-control --column tags --format value)
if [[ $TAGS =~ "keep" ]]; then
echo "Skipping ${cluster_prefix} - control instance is tagged as keep"
# Get all servers with the matching name for control node
CONTROL_SERVERS=$(openstack server list --name ${cluster_prefix}-control --format json)
SERVER_COUNT=$(echo "$CONTROL_SERVERS" | jq length)
if [[ $SERVER_COUNT -gt 1 ]]; then
echo "Multiple servers found for control node '${cluster_prefix}-control'. Checking tags for each..."
for server in $(echo "$CONTROL_SERVERS" | jq -r '.[].ID'); do
# Get tags for each control node
TAGS=$(openstack server show "$server" --column tags --format value)
if [[ $TAGS =~ "keep" ]]; then
echo "Skipping ${cluster_prefix} (server ${server}) - control instance is tagged as keep"
else
./dev/delete-cluster.py ${cluster_prefix} --force
fi
done
else
./dev/delete-cluster.py ${cluster_prefix} --force
# If only one server, extract its tags and proceed
TAGS=$(echo "$CONTROL_SERVERS" | jq -r '.[0].Tags')
if [[ $TAGS =~ "keep" ]]; then
echo "Skipping ${cluster_prefix} - control instance is tagged as keep"
else
./dev/delete-cluster.py ${cluster_prefix} --force
fi
fi
done
shell: bash
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ This repository contains playbooks and configuration to define a Slurm-based HPC
- [Rocky Linux](https://rockylinux.org/)-based hosts.
- [OpenTofu](https://opentofu.org/) configurations to define the cluster's infrastructure-as-code.
- Packages for Slurm and MPI software stacks from [OpenHPC](https://openhpc.community/).
- Shared fileystem(s) using NFS (with in-cluster or external servers) or [CephFS](https://docs.ceph.com/en/latest/cephfs/) via [Openstack Manila](https://wiki.openstack.org/wiki/Manila).
- Shared fileystem(s) using NFS (with in-cluster or external servers) or [CephFS](https://docs.ceph.com/en/latest/cephfs/) via [OpenStack Manila](https://wiki.openstack.org/wiki/Manila).
- Slurm accounting using a MySQL database.
- Monitoring integrated with Slurm jobs using Prometheus, ElasticSearch and Grafana.
- A web-based portal from [OpenOndemand](https://openondemand.org/).
- A web-based portal from [Open OnDemand](https://openondemand.org/).
- Production-ready default Slurm configurations for access and memory limits.
- [Packer](https://developer.hashicorp.com/packer)-based image build configurations for node images.

Expand All @@ -25,15 +25,15 @@ The default configuration in this repository may be used to create a cluster to
- Persistent state backed by an OpenStack volume.
- NFS-based shared file system backed by another OpenStack volume.

Note that the OpenOndemand portal and its remote apps are not usable with this default configuration.
Note that the Open OnDemand portal and its remote apps are not usable with this default configuration.

It requires an OpenStack cloud, and an Ansible "deploy host" with access to that cloud.

Before starting ensure that:
- You have root access on the deploy host.
- You can create instances using a Rocky 9 GenericCloud image (or an image based on that).
- **NB**: In general it is recommended to use the [latest released image](https://github.com/stackhpc/ansible-slurm-appliance/releases) which already contains the required packages. This is built and tested in StackHPC's CI.
- You have a SSH keypair defined in OpenStack, with the private key available on the deploy host.
- You have an SSH keypair defined in OpenStack, with the private key available on the deploy host.
- Created instances have access to internet (note proxies can be setup through the appliance if necessary).
- Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance).

Expand Down Expand Up @@ -66,10 +66,11 @@ Use the `cookiecutter` template to create a new environment to hold your configu

and follow the prompts to complete the environment name and description.

**NB:** In subsequent sections this new environment is refered to as `$ENV`.
**NB:** In subsequent sections this new environment is referred to as `$ENV`.

Activate the new environment:
Go back to the root folder and activate the new environment:

cd ..
. environments/$ENV/activate

And generate secrets for it:
Expand Down Expand Up @@ -124,8 +125,8 @@ where the IP of the login node is given in `environments/$ENV/inventory/hosts.ym
## Overview of directory structure

- `environments/`: See [docs/environments.md](docs/environments.md).
- `ansible/`: Contains the ansible playbooks to configure the infrastruture.
- `packer/`: Contains automation to use Packer to build machine images for an enviromment - see the README in this directory for further information.
- `ansible/`: Contains the ansible playbooks to configure the infrastructure.
- `packer/`: Contains automation to use Packer to build machine images for an environment - see the README in this directory for further information.
- `dev/`: Contains development tools.

For further information see the [docs](docs/) directory.
2 changes: 1 addition & 1 deletion docs/image-build.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ To build either a site-specific fat image from scratch, or to extend an existing
openstack image unset --property signature_verified $SOURCE_IMAGE
then delete the failed volume, select cancelling the build when Packer queries, and then retry. This is [Openstack bug 1823445](https://bugs.launchpad.net/cinder/+bug/1823445).
then delete the failed volume, select cancelling the build when Packer queries, and then retry. This is [OpenStack bug 1823445](https://bugs.launchpad.net/cinder/+bug/1823445).
6. The built image will be automatically uploaded to OpenStack with a name prefixed `openhpc` and including a timestamp and a shortened git hash.
Expand Down
2 changes: 1 addition & 1 deletion docs/monitoring-and-logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ The `grafana` group controls the placement of the grafana service. Load balancin

### Access

If Open Ondemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first . See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `3000`.
If Open OnDemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first . See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `3000`.

The default credentials for the admin user are:

Expand Down
24 changes: 12 additions & 12 deletions docs/openondemand.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,36 @@
# Overview

The appliance can deploy the Open Ondemand portal. This page describes how to enable this and the default appliance configuration/behaviour. Note that detailed configuration documentation is provided by:
The appliance can deploy the Open OnDemand portal. This page describes how to enable this and the default appliance configuration/behaviour. Note that detailed configuration documentation is provided by:

- The README for the included `openondemand` role in this repo - [ansible/roles/openondemand/README.md](../ansible/roles/openondemand/README.md).
- The README and default variables for the underlying "official" role which the above wraps - [Open OnDemand Ansible Role](https://github.com/OSC/ood-ansible)
- The documentation for Open Ondemand [itself](https://osc.github.io/ood-documentation/latest/index.html)
- The documentation for Open OnDemand [itself](https://osc.github.io/ood-documentation/latest/index.html)

This appliance can deploy and configure:
- The Open Ondemand server itself (usually on a single login node).
- The Open OnDemand server itself (usually on a single login node).
- User authentication using one of:
- An external OIDC provider.
- HTTP basic authenication and PAM.
- Virtual desktops on compute nodes.
- Jupyter nodebook servers on compute nodes.
- Proxying of Grafana (usually deployed on the control node) via the Open Ondemand portal.
- Links to additional filesystems and pages from the Open Ondemand Dashboard.
- A Prometheus exporter for the Open Ondemand server and related Grafana dashboard
- Proxying of Grafana (usually deployed on the control node) via the Open OnDemand portal.
- Links to additional filesystems and pages from the Open OnDemand Dashboard.
- A Prometheus exporter for the Open OnDemand server and related Grafana dashboard

For examples of all of the above see the `smslabs-example` environment in this repo.

# Enabling Open Ondemand
To enable the Open Ondemand server, add single host to the `openondemand` inventory group. Generally, this should be a node in the `login` group, as Open Ondemand must be able to access Slurm commands.
# Enabling Open OnDemand
To enable the Open OnDemand server, add single host to the `openondemand` inventory group. Generally, this should be a node in the `login` group, as Open OnDemand must be able to access Slurm commands.

To enable compute nodes for virtual desktops or Jupyter notebook servers (accessed through the Open Ondemand portal), add nodes/groups to the `openondemand_desktop` and `openondemand_jupyter` inventory groups respectively. These may be all or a subset of the `compute` group.
To enable compute nodes for virtual desktops or Jupyter notebook servers (accessed through the Open OnDemand portal), add nodes/groups to the `openondemand_desktop` and `openondemand_jupyter` inventory groups respectively. These may be all or a subset of the `compute` group.

The above functionality is configured by running the `ansible/portal.yml` playbook. This is automatically run as part of `ansible/site.yml`.

# Default configuration

See the [ansible/roles/openondemand/README.md](../ansible/roles/openondemand/README.md) for more details on the variables described below.

The following variables have been given default values to allow Open Ondemand to work in a newly created environment without additional configuration, but generally should be overridden in `environment/site/inventory/group_vars/all/` with site-specific values:
The following variables have been given default values to allow Open OnDemand to work in a newly created environment without additional configuration, but generally should be overridden in `environment/site/inventory/group_vars/all/` with site-specific values:
- `openondemand_servername` - this must be defined for both `openondemand` and `grafana` hosts (when Grafana is enabled). Default is `ansible_host` (i.e. the IP address) of the first host in the `openondemand` group.
- `openondemand_auth` and any corresponding options. Defaults to `basic_pam`.
- `openondemand_desktop_partition` and `openondemand_jupyter_partition` if the corresponding inventory groups are defined. Defaults to the first compute group defined in the `compute` Terraform variable in `environments/$ENV/terraform`.
Expand All @@ -41,9 +41,9 @@ It is also recommended to set:

If shared filesystems other than `$HOME` are available, add paths to `openondemand_filesapp_paths`.

The appliance automatically configures Open Ondemand to proxy Grafana and adds a link to it on the Open Ondemand dashboard. This means no external IP (or SSH proxying etc) is required to access Grafana (which by default is deployed on the control node). To allow users to authenticate to Grafana, the simplest option is to enable anonymous (View-only) login by setting `grafana_auth_anonymous` (see [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml)[^1]).
The appliance automatically configures Open OnDemand to proxy Grafana and adds a link to it on the Open OnDemand dashboard. This means no external IP (or SSH proxying etc) is required to access Grafana (which by default is deployed on the control node). To allow users to authenticate to Grafana, the simplest option is to enable anonymous (View-only) login by setting `grafana_auth_anonymous` (see [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml)[^1]).

[^1]: Note that if `openondemand_auth` is `basic_pam` and anonymous Grafana login is enabled, the appliance will (by default) configure Open Ondemand's Apache server to remove the Authorisation header from proxying of all `node/` addresses. This is done as otherwise Grafana tries to use this header to authenticate, which fails with the default configuration where only the admin Grafana user `grafana` is created. Note that the removal of this header in this configuration means it cannot be used to authenticate proxied interactive applications - however the appliance-deployed remote desktop and Jupyter Notebook server applications use other authentication methods. An alternative if using `basic_pam` is not to enable anonymous Grafana login and to create Grafana users matching the local users (e.g. in `environments/<env>/hooks/post.yml`).
[^1]: Note that if `openondemand_auth` is `basic_pam` and anonymous Grafana login is enabled, the appliance will (by default) configure Open OnDemand's Apache server to remove the Authorisation header from proxying of all `node/` addresses. This is done as otherwise Grafana tries to use this header to authenticate, which fails with the default configuration where only the admin Grafana user `grafana` is created. Note that the removal of this header in this configuration means it cannot be used to authenticate proxied interactive applications - however the appliance-deployed remote desktop and Jupyter Notebook server applications use other authentication methods. An alternative if using `basic_pam` is not to enable anonymous Grafana login and to create Grafana users matching the local users (e.g. in `environments/<env>/hooks/post.yml`).

# Access
By default the appliance authenticates against OOD with basic auth through PAM. When creating a new environment, a new user with username `demo_user` will be created. Its password is found under `vault_openondemand_default_user` in the appliance secrets store in `environments/{ENV}/inventory/group_vars/all/secrets.yml`. Other users can be defined by overriding the `basic_users_users` variable in your environment (templated into `environments/{ENV}/inventory/group_vars/all/basic_users.yml` by default).
2 changes: 1 addition & 1 deletion docs/production.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ and referenced from the `site` and `production` environments, e.g.:
cluster
```

- Configure Open OpenOndemand - see [specific documentation](openondemand.README.md).
- Configure Open OnDemand - see [specific documentation](openondemand.README.md).

- Remove the `demo_user` user from `environments/$ENV/inventory/group_vars/all/basic_users.yml`

Expand Down

0 comments on commit bccc88b

Please sign in to comment.