Skip to content

Latest commit

 

History

History
169 lines (127 loc) · 4.87 KB

install.md

File metadata and controls

169 lines (127 loc) · 4.87 KB

Manual Installation of Data Accelerator

While there is an work in progress ansible based development setup this document looks at the manual setup of the data-acc components.

DACD and DACCTL

Both software components are built statically. There is a tagged release with a binary built using the CI system that is ready to download from github: https://github.com/RSE-Cambridge/data-acc/releases

The programs are statically compiled and can be copied around as per required.

The files are currently installed to /usr/local/bin on each DAC node. The following systemd unit file is used to control the dacd program:

[Unit]
Description=dacd
ConditionPathExists=/usr/local/bin/dacd
After=network.target

[Service]
Type=simple
User=dacd
Group=dacd
Restart=on-failure
RestartSec=10

WorkingDirectory=/var/lib/data-acc
EnvironmentFile=/etc/dacd/dacd.conf
ExecStart=/usr/local/bin/dacd

# make sure log directory exists and owned by syslog
PermissionsStartOnly=true
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=dacd

[Install]
WantedBy=multi-user.target

The configuration in /etc/dacd/dacd.conf is covered in more detail below.

On the Slurm master node, the dacctl binary is used as a command line too. Below you can see the Slurm configuration GetSysState can be modified to point to the location where you installed the binary.

etcd and TLS config

You need to install an etcd cluster. It can be installed as required via EPEL or from the repository

We recommend you run the last 3.3.x release of etcd.

To secure the communication with etcd, TLS certificates should be used.

For more details please see the etcd docs, such as this page: https://etcd.io/docs/v3.3.12/op-guide/maintenance/

Example configuration

On each DAC node this environment file is used in the dacd unit file to inform the dacd code the details of the environment, and in particular have the correct certs needed to communicate with etcd:

dacd

ETCDCTL_API=3
ETCDCTL_ENDPOINTS=https://10.43.21.71:2379 ETCD_CLIENT_DEBUG=1
ETCDCTL_CERT_FILE=/dac/dacd/cert/dac-e-24.data.cluster.pem
ETCDCTL_KEY_FILE=/dac/dacd/cert/dac-e-24.data.cluster-key.pem
ETCDCTL_CA_FILE=/dac/dacd/cert/ca.pem
DAC_LNET_SUFFIX="-opa@o2ib1"
DEVICE_COUNT=12
DAC_DEVICE_CAPACITY_GB=1400
DAC_POOL_NAME=default
DAC_MGS_DEV=sdb
DAC_HOST_GROUP=dac-prod
DAC_SKIP_ANSIBLE=false
DAC_MDT_SIZE="20g"
DAC_ANSIBLE_DIR=/var/lib/data-acc/fs-ansible/

Note that /var/lib/data-acc/fs-ansible/ should contain the fs-ansible scripts from the release tarball. In addition there should be a working virtual environment in /var/lib/data-acc/fs-ansible/.venv. You can can create this on each dacd node as follows:

virtualenv /var/lib/data-acc/fs-ansible/.venv
. /var/lib/data-acc/fs-ansible/.venv/bin/activate
pip install -U pip
pip install -U ansible

Note failed ansible runs are current left in /tmp. This aids debugging, but left unchecked will consume all disk space.

The Slurm master unit file is updated to have this environment file such that the dacctl command line can communicate with etcd:

slurmctld

ETCDCTL_API=3
ETCDCTL_ENDPOINTS=https://10.43.21.71:2379
ETCDCTL_CERT_FILE=/dac/dacd/cert/slurm-master.data.cluster.pem
ETCDCTL_KEY_FILE=/dac/dacd/cert/slurm-master.data.cluster-key.pem
ETCDCTL_CA_FILE=/dac/dacd/cert/ca.pem

Slurm Configuration

Here are import parts of the Slurm configuration files that need merging with your existing Slurm configuration:

slurm.conf

BurstBufferType=burst_buffer/cray
AccountingStorageTRES="bb/cray"

From Slurm 19.04 you instead put this:

BurstBufferType=burst_buffer/datawarp
AccountingStorageTRES="bb/datawarp"

burst_buffer.conf

AllowUsers=root,slurm
Flags=EnablePersistent,PrivateData
 
StageInTimeout=3600
StageOutTimeout=3600
OtherTimeout=300
 
DefaultPool=default
 
GetSysState=/usr/local/bin/dacctl

SSH and Service Accounts

The dacd daemon makes use of Ansible to create the filesystem. It also uses ssh to then mount the file system on the compute nodes. As such, the user that is running dacd needs to be able to ssh to all the servers running dacd and all the compute nodes, using the hostnames recorded in etcd and Slurm respectively.

For running Ansible, password-less sudo is needed for the user across all the nodes running dacd. On the compute nodes, it seems possible to restrict sudo access to the following:

dacd ALL=(ALL) NOPASSWD: /usr/bin/mkdir -p /dac/*, /usr/bin/chmod 770 /dac/*, /usr/bin/chmod 0600 /dac/*, /usr/bin/chown * /dac/*, /usr/bin/mount -t lustre * /dac/*, /usr/bin/umount -l /dac/*, /usr/bin/umount /dac/*, /usr/sbin/losetup /dev/loop* /dac/*, /usr/sbin/losetup -d /dev/loop*, /usr/sbin/mkswap /dev/loop*, /usr/sbin/swapon /dev/loop*, /usr/sbin/swapoff /dev/loop*, /usr/bin/ln -s /dac/* /dac/*, /usr/bin/dd if=/dev/zero of=/dac/*, /usr/bin/rm -rf /dac/*, /bin/grep /dac/* /etc/mtab