AutoDock4 workflow on Apache Airflow

WIP based on the work of

Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng. 2023. A GPU-accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow. WOCC'23

Christopher Woods, University of Bristol, UK "Running Serverless HPC Workloads on Top of Kubernetes and Jupyter Notebooks" CNCF KubeCon18

AutoDock4 workflow on Apache Airflow

A workflow for molecular docking using AutoDock4. The workflow is implemented as a DAG, and can be run in Apache Airflow, on a Kubernetes cluster.

Quickstart

Folders

The main DAG is contained in autodock.py, we also provide with the following folders:

docker/ contains docker builds for images. each includes a Dockerfile, along with bash scripts which are included in the image;
iac/ contains files related to infrastructure as code, this includes ansible files for a quick deployment
rke2/ contains folders for each deployment with associated values and configurations

##All of this shit below is related to the last readme from Daniel

Installation checklist (see Setup & Installation)

Kubernetes: PersistentVolume with name pv-autodock
Kubernetes: PersistentVolumeClaim with name pvc-autodock
Docker: Docker image available in public registry with name example/autodock:1.5.3
Apache Airflow: pool with name gpu_pool is created
Apache Airflow: autodock.py is in Apache Airflow's DAG folder.
DAG: in autodock.py, PVC_NAME is set
DAG: in autodock.py, IMAGE_NAME is set
DAG: a .sdf ligand database is stored in the root of the PersitentVolume.

Setup & Installation

Requirements

A Kubernetes cluster
A working Apache Airflow setup: in particular, Apache Airflow must be configured to be able to run tasks on the Kubernetes cluster.

1. Kubernetes PersistentVolume and PersistentVolumeClaim

The workflow relies on a specific PersitentVolumeClaim to be present on the Kubernetes cluster to store files during execution. In this step, we describe how to create a PersistentVolume, and a PersistentVolumeClaim attached to this volume.

PersistentVolume. Your Kubernetes cluster administrator provides you with the name of the PersistentVolume you need to use. However, if you manage your own Kubernetes cluster, you need to create a PersistentVolume yourself, we provide an example in misc/persistentVolume.yaml, that you can deploy using:

kubectl create -f misc/persistentVolume.yaml

In this example, and in the rest of this tutorial, the PersistentVolume is named pv-autodock. Please refer to Kubernetes documentation to learn more on PersistentVolume.

PersistentVolumeClaim. Once you know the name of the PersistentVolume (in this example pv-autodock), you need to create a PersistentVolumeClaim, containing information on the storage size, and referring to the underlying PersistentVolume. An example is provided in misc/pvclaim.yaml, you can create it using:

kubectl create -f misc/pvclaim.yaml -n airflow

In this example, and in the rest of this tutorial, the PersistentVolumeClaim is named pvc-autodock. Furthermore, it has the same namespace as the one under which the Apache Airflow setup is deployed, here we use airflow.

2. Building the Docker image

A Dockerfile is provided, along with scripts that will be included in the image, in the the docker folder.

To build and publish the image:

cd docker/
docker build -t gabinsc/autodock-gpu:1.5.3
docker push gabinsc/autodock-gpu:1.5.3

Please refer to Docker documentation for more details on building an image, and publishing it. Make sure that the image is published to a public Docker registry, or at least to a registry which is accessible from the Apache Airflow setup.

3. Deploying and adapting the DAG

In order for the DAG to be executed in your specific environment, some adjusments are required.

Place the autodock.py file in the DAG folder of your Apache Airflow setup.
Adjust the following constants in autodock.py:
- IMAGE_NAME: name of the image that will be used for the containers.
- PVC_NAME: name of the PersistentVolumeClaim created in step 1, pvc-autodock.
Validate that you can see the DAG under the name autodock in the Apache Airflow UI. If not, DAG import errors are reported in the top of the UI.

The administrator of the Apache Airflow setup must create a pool named gpu_pool, which will group (and limit) execution of GPU tasks. You can create a pool in the Apache Airflow UI under Admin > Pools. A recommandation is to set the pool size to the desired maximum number of GPUs to be used in parallel.

Run

Before you run the DAG, place your database of ligands, in .sdf format, in the root of the PersistentVolume you defined when configuring your Kubernetes cluster. For example, sweetlead.sdf.

Click on "Trigger DAG" in the Apache Airlfow UI to start the DAG with the default parameters. You can customize the DAG parameters to your needs by clicking "Trigger DAG w/ config":

pdbid: PDBID of the protein you want to use as a receptor. Note that in the PDB database, this generally refers to a protein-ligand complex; the workflow automatically keeps the longest chain in the complex.
ligand_db: the name of the ligand database, without the .sdf extension.
ligands_chunk_size: batch size

Results analysis (experimental)

We provide python scripts to create readable Gantt charts, based on the workflow execution. Note that a Gantt chart can be found for each DAG execution in Apache Airflow UI, however, this chart offers limited interactivity and can be hard to read for complex or long-running DAGs.

Requirements:

python (≥ 3.9)
python libraries: plotly, requests

Three scripts are available, each plotting a different Gantt chart:

Resource view: each line in the chart represents a slot in a pool (note that multi-slot tasks are not supported)
Task view: each line represents a tasks
Multi-execution resource view: several DAG runs can be presented on the same Gantt chart, each run has its own color.

Before running those scripts, you need to set the constants in plot/constants.py:

BASE_URL: base URL to access Airflow API
SESSION_COOKIE: session cookie, can typically be obtained from the Network section of your browser's DevTools when logged in on the Apache Airflow UI
DAG_ID: the name of the DAG, autodock
POOL_ALIAS: alias names for the various pools, will be shown in the legend

To execute a script for a specific DAG execution, you need to provide DAG_RUN_ID, which is the ID of the name of the particular DAG run instance you want to plot, this can be retrieved in Apache Airflow UI.

When running the scripts, figures will be written to the figures/ folder.

Relevant publications

Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng. 2023. A GPU-accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow. WOCC'23

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
docker/autodock4		docker/autodock4
docs/iac		docs/iac
iac/ansible		iac/ansible
rke2		rke2
test/containers		test/containers
.gitignore		.gitignore
README.md		README.md
screenshot_workflow.jpg		screenshot_workflow.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoDock4 workflow on Apache Airflow

Quickstart

Folders

Installation checklist (see Setup & Installation)

Setup & Installation

Requirements

1. Kubernetes PersistentVolume and PersistentVolumeClaim

2. Building the Docker image

3. Deploying and adapting the DAG

Run

Results analysis (experimental)

Relevant publications

FAQ

About

Releases

Packages

Contributors 2

Languages

hwcopeland/auto-docker

Folders and files

Latest commit

History

Repository files navigation

AutoDock4 workflow on Apache Airflow

Quickstart

Folders

Installation checklist (see Setup & Installation)

Setup & Installation

Requirements

1. Kubernetes PersistentVolume and PersistentVolumeClaim

2. Building the Docker image

3. Deploying and adapting the DAG

Run

Results analysis (experimental)

Relevant publications

FAQ

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages