Update README.md

IntelLabs · Sep 4, 2024 · befe1bd · befe1bd
1 parent e27110b
commit befe1bd
Showing 1 changed file with 43 additions and 30 deletions.
diff --git a/pipelines/deepvariant-based-germline-variant-calling-fq2vcf/README.md b/pipelines/deepvariant-based-germline-variant-calling-fq2vcf/README.md
@@ -1,29 +1,18 @@
-# OpenOmics Deepvariant Pipeline
-### OpenOmics Deepvariant Pipeline is a highly optimized, distributed, deep-learning-based short-read germline variant calling pipeline on x86 CPU clusters. The pipeline comprises of: 1. bwa-mem2 (a highly optimized version of bwa-mem) for sequence mapping, 2. distributed SAM sorting using samtools, and 3. an optimized version of DeepVariant tools for variant calling. The following figure illustrates the pipeline.
+# fq2vcf: OpenOmics Deepvariant based Variant Calling Pipeline  
+### Overview:  
+OpenOmics's fq2vcf is a highly optimized, distributed, deep learning-based short-read germline variant calling pipeline for x86 CPUs. 
+The pipeline comprises of:   
+1. bwa-mem2 (a highly optimized version of bwa-mem) for sequence mapping  
+2. SortSAM using samtools  
+3. An optimized version of DeepVariant tools for variant calling.
+The following figure illustrates the pipeline.
 
 <p align="center">
 <img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/deepvariant-fq2vcf.jpg"/a></br>
 </p> 
 
 
-### 0. Notes:
-* The source code of bwa-mem2, samtools, and DeepVariant are residing in:
-```Open-Omics-Acceleration-Framework/applications/ ```.
-* We build bwa-mem2 and samtools from source; while for DeepVariant, we build a docker image and then use the built image while running the pipeline. Note that, the pre-built image is not available on dockerhub and the image needs to be built from source.
-* We provide scripts for setting-up miniconda environment (setup_env.sh), compilation of the required pipeline tools, building & loading of DeepVariant docker image (setup.py). These scripts located at _Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf_. Note that, all the scripts are, by default, written for docker.  
-
-* Prerequisite : The following are pre-requisites for the pipeline -  our scripts assumed these packages are already installed in the system.  
-        * Docker / Podman  
-        * gcc >= 8.5.0  
-        * MPI  
-        * make >= 4.3  
-        * autoconf >= 2.69  
-        * zlib1g-dev   
-        * libncurses5-dev  
-        * libbz2-dev  
-        * liblzma-dev  
-
-# Instructions to run pipeline using Dockerfile on single node
+# Using Dockerfile  (Single Node)  
 ### 1. Download the code :  
 
 ```bash
@@ -32,22 +21,46 @@ cd Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-varian
 ```
 ### 2. Build Docker images
 ```bash
-docker build -f Dockerfile_part1 -t deepvariant:part1 .
-docker build -f Dockerfile_part2 -t deepvariant:part2 .
+docker build -f Dockerfile_part1 -t deepvariant:part1 .      ## Part I: fq2bam
+docker build -f Dockerfile_part2 -t deepvariant:part2 .      ## Part II: bam2vcf    
 ```
-### 3. Edit config file
-Update the variables R1,R2 and REF in config file in current repository  \
-Update the REF variable in extra_scripts/config file.
+
+### 3. Setup Input Paramters through ```Config``` file<sup>1</sup>  
+INPUT_DIR=\<readdir\>       ## Location of input reads files (default location /reads in docker)    
+OUTPUT_DIR=\<outdir\>       ## Location of output files (default location /output in docker)  
+REF_DIR=\<refdir\>          ## Location of reference sequence (default location /ref in docker)  
+R1=HG001_1.fastq.gz         ## name of input reads file1  
+R2=HG001_2.fastq.gz         ## name of input reads file2  
+REF=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna   ## name of reference sequence  
+
+Update the above fields in ./config file and in extra_scripts/config file    
+<sup>**1**</sup>  **All fields are mandatory**  
 
 ### 4. Run Docker images
 ```
-docker run -v $PWD/config:/Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf/scripts/aws/config -v <path to read dataset directory>:/reads -v <path to reference dataset directory>:/ref -v <path to output directory>:/output -it deepvariant:part1 bash run_pipeline_ec2_part1.sh
+docker run -v ./config:/Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf/scripts/aws/config -v <readdir>:/reads -v <refdir>:/ref -v <refdir>:/output -it deepvariant:part1 bash run_pipeline_ec2_part1.sh
 
-docker run -v $PWD/extra_scripts/config:/opt/deepvariant/config  -v <path to reference dataset directory>:/ref -v <path to output directory>:/output -it deepvariant:part2 bash run_pipeline_ec2_part2.sh
+docker run -v ./extra_scripts/config:/opt/deepvariant/config  -v <redir>:/ref -v <outdir>:/output -it deepvariant:part2 bash run_pipeline_ec2_part2.sh
 ```
 
+### General Notes:
+* The source code of bwa-mem2, samtools, and DeepVariant are residing in:
+```Open-Omics-Acceleration-Framework/applications/ ```.
+* We build bwa-mem2 and samtools from source; while for DeepVariant, we build a docker image and then use the built image while running the pipeline. Note that, the pre-built image is not available on dockerhub and the image needs to be built from source.
+* We provide scripts for setting-up miniconda environment (setup_env.sh), compilation of the required pipeline tools, building & loading of DeepVariant docker image (setup.py). These scripts located at _Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf_. Note that, all the scripts are, by default, written for docker.  
+
+* Prerequisite : The following are pre-requisites for the pipeline -  our scripts assumed these packages are already installed in the system.  
+        * Docker / Podman  
+        * gcc >= 8.5.0  
+        * MPI  
+        * make >= 4.3  
+        * autoconf >= 2.69  
+        * zlib1g-dev   
+        * libncurses5-dev  
+        * libbz2-dev  
+        * liblzma-dev  
 
-# Instructions to run the pipeline on on-prem
+# Instructions to run the pipeline on on-prem (Single node & Multi-node)  
 ### 1. Download the latest release:  
 ```bash
 wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/2.1/Source_code_with_submodules.tar.gz
@@ -112,7 +125,7 @@ bash run_pipeline_cluster.sh [docker | podman | "sudo docker"]
 Our latest results are published in this [blog](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-Xeon-is-all-you-need-for-AI-inference-Performance/post/1506083).
 
 
-# Instructions to run the pipeline on an AWS ec2 instance
+# Instructions to run the pipeline on an AWS ec2 instance (Single Node)
 The following instructions run seamlessly on a standalone AWS ec2 instance. To run the following steps, create an ec2 instance with Ubuntu-22.04 having at least 60GB of memory and 500GB of disk. The input reference sequence and the paired-ended read datasets must be downloaded and stored on the disk.
 
 ### One-time setup
@@ -149,7 +162,7 @@ bash run_pipeline_ec2.sh
 ```
 
 
-# Instructions to run the pipeline on an AWS ParallelCluster
+# Instructions to run the pipeline on an AWS ParallelCluster (Multi-node)   
 
 The following instructions run seamlessly on AWS ParallelCluster. To run the following steps, first create an AWS parallelCluster as follows,
 - Cluster setup: follow these steps to setup an [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/v2/ug/what-is-aws-parallelcluster.html).  Please see example [config file](scripts/aws/pcluster_example_config) to setup pcluster (_config_ file resides at ~/.parallelcluster/config/ on local machine). Please note: for best performance use shared file system with Amazon EBS _volume\_type = io2_ and _volume\_iops = 64000_ in the config file.