Skip to content

Commit

Permalink
Added compute doc, adding other operations docs (e.g. backups)
Browse files Browse the repository at this point in the history
  • Loading branch information
dotsdl committed Aug 19, 2023
1 parent b831756 commit 057baa4
Show file tree
Hide file tree
Showing 3 changed files with 197 additions and 1 deletion.
193 changes: 193 additions & 0 deletions docs/compute.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
.. _compute:

#######
Compute
#######

In order to actually execute ``Transformation``\s to obtain free energy estimates, you must deploy compute services to resources suitable for executing these types of calculations.
This document details how to do this on several different types of compute resources.

There currently exists a single implementation of an ``alchemiscale`` compute service: the py:class:`~alchemiscale.compute.service.SynchronousComputeService`.
Other variants will likely be created in the future, optimized for different use cases.
This documentation will expand over time as these variants become available; for now, it assumes use of this variant.

In all cases, you will need to define a configuration file for your compute services to consume on startup.
A template for this file can be found here; replace ``$ALCHEMISCALE_VERSION`` with the version tag, e.g. ``v0.1.4``, you have deployed for your server::

https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/configs/synchronous-compute-settings.yaml


***********
Single-Host
***********

To deploy a compute service (or multiple services) to a single host, we recommend one of two routes.
* installing all dependencies in a ``conda``/``mamba`` environment
* running the services as Docker containers, with all dependencies baked in


.. _compute_conda:

Deploying with conda/mamba
==========================

To deploy via ``conda``/``mamba``, first create an environment (we recommend ``mamba`` for its performance)::

mamba env create -n alchemiscale-compute-$ALCHEMISCALE_VERSION -f https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/conda-envs/alchemiscale-compute.yml

Once created, activate the environment in your current shell::

conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION

Then start a compute service, assuming your configuration file is in the current working directory, with::

alchemiscale compute synchronous -c synchronous-compute-settings.yaml


.. _compute_docker:

Deploying with Docker
=====================

Assuming your configuration file is in the current working directory, to deploy with Docker, you might use::

docker run --gpus all --rm -v $(pwd):/mnt ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION compute synchronous -c /mnt/synchronous-compute-settings.yaml


See the `official Docker documentation on GPU use`_ for details on how to specify individual GPUs for each container you launch.
It may also make sense to apply constraints to the number of CPUs available to each container to avoid oversubscription.


***********
HPC Cluster
***********

To deploy compute services to an HPC cluster, we recommend submitting them as individual jobs to the HPC cluster's scheduler.
Different clusters feature different schedulers (e.g. SLURM, LSF, TORQUE/PBS, etc.), and vary widely in their hardware and queue configurations.
You will need to tailor your specific approach to the constraints of the cluster you are targeting.

The following is an example of the *content* of a script submitted to an HPC cluster.
We have left off the top matter that is specific to the queueing system, and certain environment variables (e.g. ``JOBID``, ``JOBINDEX``) should be tailored to those presented by the queueing system.
Note that for this case we've made use of a ``conda``/``mamba``-based deployment, detailed above in :ref:`deployment_conda`::

# don't limit stack size
ulimit -s unlimited
# make scratch space
mkdir -p /scratch/${USER}/${JOBID}-${JOBINDEX}
# activate environment
conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION
# create a YAML file with specific substitutions
# each service in this job can share the same config
envsubst < settings.yaml > configs/settings.${JOBID}-${JOBINDEX}.yaml
# start up a single service
alchemiscale compute synchronous -c configs/settings.${LSB_JOBID}-${LSB_JOBINDEX}.yaml
# remove scratch space
rm -r /scratch/${USER}/${JOBID}-${JOBINDEX}


The ``envsubst`` line in particular will make a config specific to this job, with environment variable substitutions.
A subset of options used in the config file are given below::

---
# options for service initialization
init:
# Filesystem path to use for `ProtocolDAG` `shared` space.
shared_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/shared"
# Filesystem path to use for `ProtocolUnit` `scratch` space.
scratch_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/scratch"
# Path to file for logging output; if not set, logging will only go to
# STDOUT.
logfile: /home/${USER}/logs/service.${JOBID}.log
# options for service execution
start:
# Max number of Tasks to execute before exiting. If `null`, the service will
# have no task limit.
max_tasks: 1
# Max number of seconds to run before exiting. If `null`, the service will
# have no time limit.
max_time: 300


For HPC job-based execution, we recommend limiting the number of ``Task``\s the compute service executes to a small number, preferrably 1, and setting a time limit beyond which the compute service will shut down.
With this configuration, when a compute service comes up and claims a ``Task``, it will have nearly the full walltime of its job to execute it.
Any compute service that fails to claim a ``Task`` will shut itself down, and the job will exit, avoiding waste and a scenario where a ``Task`` is claimed without enough walltime left on the job to complete it.


******************
Kubernetes Cluster
******************

To deploy compute services to a Kubernetes ("k8s") cluster, we make use of a similar approach to deployment with Docker detailed above in :ref:`deployment_docker`.
We define a k8s `Deployment`_ featuring a single container spec as the file ``compute-services.yaml``::

apiVersion: apps/v1
kind: Deployment
metadata:
name: alchemiscale-synchronouscompute
labels:
app: alchemiscale-synchronouscompute
spec:
replicas: 1
selector:
matchLabels:
app: alchemiscale-synchronouscompute
template:
metadata:
labels:
app: alchemiscale-synchronouscompute
spec:
containers:
- name: alchemiscale-synchronous-container
image: ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION
args: ["compute", "synchronous", "-c", "/mnt/settings/synchronous-compute-settings.yaml"]
resources:
limits:
cpu: 2
memory: 12Gi
ephemeral-storage: 48Gi
nvidia.com/gpu: 1
requests:
cpu: 2
memory: 12Gi
ephemeral-storage: 48Gi
volumeMounts:
- name: alchemiscale-compute-settings-yaml
mountPath: "/mnt/settings"
readOnly: true
env:
- name: OPENMM_CPU_THREADS
value: "2"
volumes:
- name: alchemiscale-compute-settings-yaml
secret:
secretName: alchemiscale-compute-settings-yaml


This assumes our configuration file has been defined as a *secret* in the cluster.
Assuming the file is in the current working directory, we can add it as a secret with::

kubectl create secret generic alchemiscale-compute-settings-yaml --from-file=synchronous-compute-settings.yaml


The we can then deploy the compute services with::

kubectl apply -f compute-services.yaml

To scale up the number of compute services, increase the number of ``replicas`` to the number desired, and re-run the ``kubectl apply`` command above.

A more complete example of this type of deployment can be found in `alchemiscale-k8s`_.


.. _Deployment: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
.. _alchemiscale-k8s: https://github.com/datryllic/alchemiscale-k8s/tree/main/compute
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ in particular the `OpenForceField`_ and `OpenFreeEnergy`_ ecosystems.
./overview
./user_guide
./deployment
./compute
./operations
./API_docs

Expand Down
4 changes: 3 additions & 1 deletion docs/operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Add Users
To add a new user identity, you will generally use the ``alchemiscale`` CLI::


$ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>7687
$ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>:7687
$ export NEO4J_USER=<NEO4J_USERNAME>
$ export NEO4J_PASS=<NEO4J_PASSWORD>
$
Expand Down Expand Up @@ -51,3 +51,5 @@ The important bits here are:
Backups
*******

Performing regular backups of the state store is an important component for any production deployment of ``alchemiscale``.
To

0 comments on commit 057baa4

Please sign in to comment.