Added compute doc, adding other operations docs (e.g. backups)

OpenFreeEnergy · Aug 19, 2023 · 057baa4 · 057baa4
1 parent b831756
commit 057baa4
Show file tree

Hide file tree

Showing 3 changed files with 197 additions and 1 deletion.
diff --git a/docs/compute.rst b/docs/compute.rst
@@ -0,0 +1,193 @@
+.. _compute:
+
+#######
+Compute
+#######
+
+In order to actually execute ``Transformation``\s to obtain free energy estimates, you must deploy compute services to resources suitable for executing these types of calculations.
+This document details how to do this on several different types of compute resources.
+
+There currently exists a single implementation of an ``alchemiscale`` compute service: the py:class:`~alchemiscale.compute.service.SynchronousComputeService`.
+Other variants will likely be created in the future, optimized for different use cases.
+This documentation will expand over time as these variants become available; for now, it assumes use of this variant.
+
+In all cases, you will need to define a configuration file for your compute services to consume on startup.
+A template for this file can be found here; replace ``$ALCHEMISCALE_VERSION`` with the version tag, e.g. ``v0.1.4``, you have deployed for your server::
+
+    https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/configs/synchronous-compute-settings.yaml
+
+
+***********
+Single-Host
+***********
+
+To deploy a compute service (or multiple services) to a single host, we recommend one of two routes.
+* installing all dependencies in a ``conda``/``mamba`` environment
+* running the services as Docker containers, with all dependencies baked in
+
+
+.. _compute_conda:
+
+Deploying with conda/mamba
+==========================
+
+To deploy via ``conda``/``mamba``, first create an environment (we recommend ``mamba`` for its performance)::
+
+    mamba env create -n alchemiscale-compute-$ALCHEMISCALE_VERSION -f https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/conda-envs/alchemiscale-compute.yml
+
+Once created, activate the environment in your current shell::
+
+    conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION
+
+Then start a compute service, assuming your configuration file is in the current working directory, with::
+
+    alchemiscale compute synchronous -c synchronous-compute-settings.yaml
+
+
+.. _compute_docker:
+
+Deploying with Docker
+=====================
+
+Assuming your configuration file is in the current working directory, to deploy with Docker, you might use::
+
+    docker run --gpus all --rm -v $(pwd):/mnt ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION compute synchronous -c /mnt/synchronous-compute-settings.yaml
+
+
+See the `official Docker documentation on GPU use`_ for details on how to specify individual GPUs for each container you launch.
+It may also make sense to apply constraints to the number of CPUs available to each container to avoid oversubscription.
+
+
+***********
+HPC Cluster
+***********
+
+To deploy compute services to an HPC cluster, we recommend submitting them as individual jobs to the HPC cluster's scheduler.
+Different clusters feature different schedulers (e.g. SLURM, LSF, TORQUE/PBS, etc.), and vary widely in their hardware and queue configurations.
+You will need to tailor your specific approach to the constraints of the cluster you are targeting.
+
+The following is an example of the *content* of a script submitted to an HPC cluster. 
+We have left off the top matter that is specific to the queueing system, and certain environment variables (e.g. ``JOBID``, ``JOBINDEX``) should be tailored to those presented by the queueing system.
+Note that for this case we've made use of a ``conda``/``mamba``-based deployment, detailed above in :ref:`deployment_conda`::
+
+    # don't limit stack size
+    ulimit -s unlimited
+    
+    # make scratch space
+    mkdir -p /scratch/${USER}/${JOBID}-${JOBINDEX}
+    
+    # activate environment
+    conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION
+    
+    # create a YAML file with specific substitutions
+    # each service in this job can share the same config
+    envsubst < settings.yaml > configs/settings.${JOBID}-${JOBINDEX}.yaml
+    
+    # start up a single service
+    alchemiscale compute synchronous -c configs/settings.${LSB_JOBID}-${LSB_JOBINDEX}.yaml
+    
+    # remove scratch space
+    rm -r /scratch/${USER}/${JOBID}-${JOBINDEX}
+
+
+The ``envsubst`` line in particular will make a config specific to this job, with environment variable substitutions.
+A subset of options used in the config file are given below::
+
+    ---
+    # options for service initialization
+    init:
+    
+      # Filesystem path to use for `ProtocolDAG` `shared` space.
+      shared_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/shared"
+    
+      # Filesystem path to use for `ProtocolUnit` `scratch` space.
+      scratch_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/scratch"
+    
+      # Path to file for logging output; if not set, logging will only go to
+      # STDOUT.
+      logfile: /home/${USER}/logs/service.${JOBID}.log
+    
+    # options for service execution
+    start:
+    
+      # Max number of Tasks to execute before exiting. If `null`, the service will
+      # have no task limit.
+      max_tasks: 1
+    
+      # Max number of seconds to run before exiting. If `null`, the service will
+      # have no time limit.
+      max_time: 300
+
+
+For HPC job-based execution, we recommend limiting the number of ``Task``\s the compute service executes to a small number, preferrably 1, and setting a time limit beyond which the compute service will shut down.
+With this configuration, when a compute service comes up and claims a ``Task``, it will have nearly the full walltime of its job to execute it.
+Any compute service that fails to claim a ``Task`` will shut itself down, and the job will exit, avoiding waste and a scenario where a ``Task`` is claimed without enough walltime left on the job to complete it.
+
+
+******************
+Kubernetes Cluster
+******************
+
+To deploy compute services to a Kubernetes ("k8s") cluster, we make use of a similar approach to deployment with Docker detailed above in :ref:`deployment_docker`.
+We define a k8s `Deployment`_ featuring a single container spec as the file ``compute-services.yaml``::
+
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+      name: alchemiscale-synchronouscompute
+      labels:
+        app: alchemiscale-synchronouscompute
+    spec:
+      replicas: 1
+      selector:
+        matchLabels:
+          app: alchemiscale-synchronouscompute
+      template:
+        metadata:
+          labels:
+            app: alchemiscale-synchronouscompute
+        spec:
+          containers:
+          - name: alchemiscale-synchronous-container
+            image: ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION
+            args: ["compute", "synchronous", "-c", "/mnt/settings/synchronous-compute-settings.yaml"]
+            resources:
+              limits:
+                cpu: 2
+                memory: 12Gi
+                ephemeral-storage: 48Gi
+                nvidia.com/gpu: 1
+              requests:
+                cpu: 2
+                memory: 12Gi
+                ephemeral-storage: 48Gi
+            volumeMounts:
+              - name: alchemiscale-compute-settings-yaml
+                mountPath: "/mnt/settings"
+                readOnly: true
+            env:
+              - name: OPENMM_CPU_THREADS
+                value: "2"
+          volumes:
+            - name: alchemiscale-compute-settings-yaml
+              secret:
+                secretName: alchemiscale-compute-settings-yaml
+
+
+This assumes our configuration file has been defined as a *secret* in the cluster.
+Assuming the file is in the current working directory, we can add it as a secret with::
+
+    kubectl create secret generic alchemiscale-compute-settings-yaml --from-file=synchronous-compute-settings.yaml
+
+
+The we can then deploy the compute services with::
+
+    kubectl apply -f compute-services.yaml
+
+To scale up the number of compute services, increase the number of ``replicas`` to the number desired, and re-run the ``kubectl apply`` command above.
+
+A more complete example of this type of deployment can be found in `alchemiscale-k8s`_.
+
+
+.. _Deployment: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
+.. _alchemiscale-k8s: https://github.com/datryllic/alchemiscale-k8s/tree/main/compute
diff --git a/docs/index.rst b/docs/index.rst
@@ -30,6 +30,7 @@ in particular the `OpenForceField`_ and `OpenFreeEnergy`_ ecosystems.
    ./overview
    ./user_guide
    ./deployment
+   ./compute
    ./operations
    ./API_docs
 

diff --git a/docs/operations.rst b/docs/operations.rst
@@ -9,7 +9,7 @@ Add Users
 To add a new user identity, you will generally use the ``alchemiscale`` CLI::
 
 
-    $ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>7687
+    $ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>:7687
     $ export NEO4J_USER=<NEO4J_USERNAME>
     $ export NEO4J_PASS=<NEO4J_PASSWORD>
     $
@@ -51,3 +51,5 @@ The important bits here are:
 Backups
 *******
 
+Performing regular backups of the state store is an important component for any production deployment of ``alchemiscale``.
+To