-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added compute doc, adding other operations docs (e.g. backups)
- Loading branch information
Showing
3 changed files
with
197 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
.. _compute: | ||
|
||
####### | ||
Compute | ||
####### | ||
|
||
In order to actually execute ``Transformation``\s to obtain free energy estimates, you must deploy compute services to resources suitable for executing these types of calculations. | ||
This document details how to do this on several different types of compute resources. | ||
|
||
There currently exists a single implementation of an ``alchemiscale`` compute service: the py:class:`~alchemiscale.compute.service.SynchronousComputeService`. | ||
Other variants will likely be created in the future, optimized for different use cases. | ||
This documentation will expand over time as these variants become available; for now, it assumes use of this variant. | ||
|
||
In all cases, you will need to define a configuration file for your compute services to consume on startup. | ||
A template for this file can be found here; replace ``$ALCHEMISCALE_VERSION`` with the version tag, e.g. ``v0.1.4``, you have deployed for your server:: | ||
|
||
https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/configs/synchronous-compute-settings.yaml | ||
|
||
|
||
*********** | ||
Single-Host | ||
*********** | ||
|
||
To deploy a compute service (or multiple services) to a single host, we recommend one of two routes. | ||
* installing all dependencies in a ``conda``/``mamba`` environment | ||
* running the services as Docker containers, with all dependencies baked in | ||
|
||
|
||
.. _compute_conda: | ||
|
||
Deploying with conda/mamba | ||
========================== | ||
|
||
To deploy via ``conda``/``mamba``, first create an environment (we recommend ``mamba`` for its performance):: | ||
|
||
mamba env create -n alchemiscale-compute-$ALCHEMISCALE_VERSION -f https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/conda-envs/alchemiscale-compute.yml | ||
|
||
Once created, activate the environment in your current shell:: | ||
|
||
conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION | ||
|
||
Then start a compute service, assuming your configuration file is in the current working directory, with:: | ||
|
||
alchemiscale compute synchronous -c synchronous-compute-settings.yaml | ||
|
||
|
||
.. _compute_docker: | ||
|
||
Deploying with Docker | ||
===================== | ||
|
||
Assuming your configuration file is in the current working directory, to deploy with Docker, you might use:: | ||
|
||
docker run --gpus all --rm -v $(pwd):/mnt ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION compute synchronous -c /mnt/synchronous-compute-settings.yaml | ||
|
||
|
||
See the `official Docker documentation on GPU use`_ for details on how to specify individual GPUs for each container you launch. | ||
It may also make sense to apply constraints to the number of CPUs available to each container to avoid oversubscription. | ||
|
||
|
||
*********** | ||
HPC Cluster | ||
*********** | ||
|
||
To deploy compute services to an HPC cluster, we recommend submitting them as individual jobs to the HPC cluster's scheduler. | ||
Different clusters feature different schedulers (e.g. SLURM, LSF, TORQUE/PBS, etc.), and vary widely in their hardware and queue configurations. | ||
You will need to tailor your specific approach to the constraints of the cluster you are targeting. | ||
|
||
The following is an example of the *content* of a script submitted to an HPC cluster. | ||
We have left off the top matter that is specific to the queueing system, and certain environment variables (e.g. ``JOBID``, ``JOBINDEX``) should be tailored to those presented by the queueing system. | ||
Note that for this case we've made use of a ``conda``/``mamba``-based deployment, detailed above in :ref:`deployment_conda`:: | ||
|
||
# don't limit stack size | ||
ulimit -s unlimited | ||
# make scratch space | ||
mkdir -p /scratch/${USER}/${JOBID}-${JOBINDEX} | ||
# activate environment | ||
conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION | ||
# create a YAML file with specific substitutions | ||
# each service in this job can share the same config | ||
envsubst < settings.yaml > configs/settings.${JOBID}-${JOBINDEX}.yaml | ||
# start up a single service | ||
alchemiscale compute synchronous -c configs/settings.${LSB_JOBID}-${LSB_JOBINDEX}.yaml | ||
# remove scratch space | ||
rm -r /scratch/${USER}/${JOBID}-${JOBINDEX} | ||
|
||
|
||
The ``envsubst`` line in particular will make a config specific to this job, with environment variable substitutions. | ||
A subset of options used in the config file are given below:: | ||
|
||
--- | ||
# options for service initialization | ||
init: | ||
# Filesystem path to use for `ProtocolDAG` `shared` space. | ||
shared_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/shared" | ||
# Filesystem path to use for `ProtocolUnit` `scratch` space. | ||
scratch_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/scratch" | ||
# Path to file for logging output; if not set, logging will only go to | ||
# STDOUT. | ||
logfile: /home/${USER}/logs/service.${JOBID}.log | ||
# options for service execution | ||
start: | ||
# Max number of Tasks to execute before exiting. If `null`, the service will | ||
# have no task limit. | ||
max_tasks: 1 | ||
# Max number of seconds to run before exiting. If `null`, the service will | ||
# have no time limit. | ||
max_time: 300 | ||
|
||
|
||
For HPC job-based execution, we recommend limiting the number of ``Task``\s the compute service executes to a small number, preferrably 1, and setting a time limit beyond which the compute service will shut down. | ||
With this configuration, when a compute service comes up and claims a ``Task``, it will have nearly the full walltime of its job to execute it. | ||
Any compute service that fails to claim a ``Task`` will shut itself down, and the job will exit, avoiding waste and a scenario where a ``Task`` is claimed without enough walltime left on the job to complete it. | ||
|
||
|
||
****************** | ||
Kubernetes Cluster | ||
****************** | ||
|
||
To deploy compute services to a Kubernetes ("k8s") cluster, we make use of a similar approach to deployment with Docker detailed above in :ref:`deployment_docker`. | ||
We define a k8s `Deployment`_ featuring a single container spec as the file ``compute-services.yaml``:: | ||
|
||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: alchemiscale-synchronouscompute | ||
labels: | ||
app: alchemiscale-synchronouscompute | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: alchemiscale-synchronouscompute | ||
template: | ||
metadata: | ||
labels: | ||
app: alchemiscale-synchronouscompute | ||
spec: | ||
containers: | ||
- name: alchemiscale-synchronous-container | ||
image: ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION | ||
args: ["compute", "synchronous", "-c", "/mnt/settings/synchronous-compute-settings.yaml"] | ||
resources: | ||
limits: | ||
cpu: 2 | ||
memory: 12Gi | ||
ephemeral-storage: 48Gi | ||
nvidia.com/gpu: 1 | ||
requests: | ||
cpu: 2 | ||
memory: 12Gi | ||
ephemeral-storage: 48Gi | ||
volumeMounts: | ||
- name: alchemiscale-compute-settings-yaml | ||
mountPath: "/mnt/settings" | ||
readOnly: true | ||
env: | ||
- name: OPENMM_CPU_THREADS | ||
value: "2" | ||
volumes: | ||
- name: alchemiscale-compute-settings-yaml | ||
secret: | ||
secretName: alchemiscale-compute-settings-yaml | ||
|
||
|
||
This assumes our configuration file has been defined as a *secret* in the cluster. | ||
Assuming the file is in the current working directory, we can add it as a secret with:: | ||
|
||
kubectl create secret generic alchemiscale-compute-settings-yaml --from-file=synchronous-compute-settings.yaml | ||
|
||
|
||
The we can then deploy the compute services with:: | ||
|
||
kubectl apply -f compute-services.yaml | ||
|
||
To scale up the number of compute services, increase the number of ``replicas`` to the number desired, and re-run the ``kubectl apply`` command above. | ||
|
||
A more complete example of this type of deployment can be found in `alchemiscale-k8s`_. | ||
|
||
|
||
.. _Deployment: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ | ||
.. _alchemiscale-k8s: https://github.com/datryllic/alchemiscale-k8s/tree/main/compute |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters