WIP based on the work of
Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng. 2023. A GPU-accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow. WOCC'23
Christopher Woods, University of Bristol, UK "Running Serverless HPC Workloads on Top of Kubernetes and Jupyter Notebooks" CNCF KubeCon18
A workflow for molecular docking using AutoDock4. The workflow is implemented as a DAG, and can be run in Apache Airflow, on a Kubernetes cluster.
The main DAG is contained in autodock.py
, we also provide with the following folders:
docker/
contains docker builds for images. each includes aDockerfile
, along with bash scripts which are included in the image;iac/
contains files related to infrastructure as code, this includes ansible files for a quick deploymentrke2/
contains folders for each deployment with associated values and configurations
##All of this shit below is related to the last readme from Daniel
- Kubernetes: PersistentVolume with name
pv-autodock
- Kubernetes: PersistentVolumeClaim with name
pvc-autodock
- Docker: Docker image available in public registry with name
example/autodock:1.5.3
- Apache Airflow: pool with name
gpu_pool
is created - Apache Airflow: autodock.py is in Apache Airflow's DAG folder.
- DAG: in autodock.py,
PVC_NAME
is set - DAG: in autodock.py,
IMAGE_NAME
is set - DAG: a
.sdf
ligand database is stored in the root of the PersitentVolume.
- A Kubernetes cluster
- A working Apache Airflow setup: in particular, Apache Airflow must be configured to be able to run tasks on the Kubernetes cluster.
The workflow relies on a specific PersitentVolumeClaim to be present on the Kubernetes cluster to store files during execution. In this step, we describe how to create a PersistentVolume, and a PersistentVolumeClaim attached to this volume.
PersistentVolume. Your Kubernetes cluster administrator provides you with the name of the PersistentVolume you need to use. However, if you manage your own Kubernetes cluster, you need to create a PersistentVolume yourself, we provide an example in misc/persistentVolume.yaml
, that you can deploy using:
kubectl create -f misc/persistentVolume.yaml
In this example, and in the rest of this tutorial, the PersistentVolume is named pv-autodock
. Please refer to Kubernetes documentation to learn more on PersistentVolume.
PersistentVolumeClaim. Once you know the name of the PersistentVolume (in this example pv-autodock
), you need to create a PersistentVolumeClaim, containing information on the storage size, and referring to the underlying PersistentVolume. An example is provided in misc/pvclaim.yaml
, you can create it using:
kubectl create -f misc/pvclaim.yaml -n airflow
In this example, and in the rest of this tutorial, the PersistentVolumeClaim is named pvc-autodock
. Furthermore, it has the same namespace as the one under which the Apache Airflow setup is deployed, here we use airflow
.
A Dockerfile is provided, along with scripts that will be included in the image, in the the docker
folder.
To build and publish the image:
cd docker/
docker build -t gabinsc/autodock-gpu:1.5.3
docker push gabinsc/autodock-gpu:1.5.3
Please refer to Docker documentation for more details on building an image, and publishing it. Make sure that the image is published to a public Docker registry, or at least to a registry which is accessible from the Apache Airflow setup.
In order for the DAG to be executed in your specific environment, some adjusments are required.
- Place the
autodock.py
file in the DAG folder of your Apache Airflow setup. - Adjust the following constants in autodock.py:
IMAGE_NAME
: name of the image that will be used for the containers.PVC_NAME
: name of the PersistentVolumeClaim created in step 1,pvc-autodock
.
- Validate that you can see the DAG under the name
autodock
in the Apache Airflow UI. If not, DAG import errors are reported in the top of the UI.
The administrator of the Apache Airflow setup must create a pool named gpu_pool
, which will group (and limit) execution of GPU tasks. You can create a pool in the Apache Airflow UI under Admin > Pools
. A recommandation is to set the pool size to the desired maximum number of GPUs to be used in parallel.
Before you run the DAG, place your database of ligands, in .sdf
format, in the root of the PersistentVolume you defined when configuring your Kubernetes cluster. For example, sweetlead.sdf
.
Click on "Trigger DAG" in the Apache Airlfow UI to start the DAG with the default parameters. You can customize the DAG parameters to your needs by clicking "Trigger DAG w/ config":
pdbid
: PDBID of the protein you want to use as a receptor. Note that in the PDB database, this generally refers to a protein-ligand complex; the workflow automatically keeps the longest chain in the complex.ligand_db
: the name of the ligand database, without the.sdf
extension.ligands_chunk_size
: batch size
We provide python scripts to create readable Gantt charts, based on the workflow execution. Note that a Gantt chart can be found for each DAG execution in Apache Airflow UI, however, this chart offers limited interactivity and can be hard to read for complex or long-running DAGs.
Requirements:
python
(≥ 3.9)- python libraries:
plotly
,requests
Three scripts are available, each plotting a different Gantt chart:
- Resource view: each line in the chart represents a slot in a pool (note that multi-slot tasks are not supported)
- Task view: each line represents a tasks
- Multi-execution resource view: several DAG runs can be presented on the same Gantt chart, each run has its own color.
Before running those scripts, you need to set the constants in plot/constants.py
:
BASE_URL
: base URL to access Airflow APISESSION_COOKIE
: session cookie, can typically be obtained from the Network section of your browser's DevTools when logged in on the Apache Airflow UIDAG_ID
: the name of the DAG,autodock
POOL_ALIAS
: alias names for the various pools, will be shown in the legend
To execute a script for a specific DAG execution, you need to provide DAG_RUN_ID
, which is the ID of the name of the particular DAG run instance you want to plot, this can be retrieved in Apache Airflow UI.
When running the scripts, figures will be written to the figures/
folder.
- Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng. 2023. A GPU-accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow. WOCC'23