Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
vasimuddin authored Sep 4, 2024
1 parent e27110b commit befe1bd
Showing 1 changed file with 43 additions and 30 deletions.
Original file line number Diff line number Diff line change
@@ -1,29 +1,18 @@
# OpenOmics Deepvariant Pipeline
### OpenOmics Deepvariant Pipeline is a highly optimized, distributed, deep-learning-based short-read germline variant calling pipeline on x86 CPU clusters. The pipeline comprises of: 1. bwa-mem2 (a highly optimized version of bwa-mem) for sequence mapping, 2. distributed SAM sorting using samtools, and 3. an optimized version of DeepVariant tools for variant calling. The following figure illustrates the pipeline.
# fq2vcf: OpenOmics Deepvariant based Variant Calling Pipeline
### Overview:
OpenOmics's fq2vcf is a highly optimized, distributed, deep learning-based short-read germline variant calling pipeline for x86 CPUs.
The pipeline comprises of:
1. bwa-mem2 (a highly optimized version of bwa-mem) for sequence mapping
2. SortSAM using samtools
3. An optimized version of DeepVariant tools for variant calling.
The following figure illustrates the pipeline.

<p align="center">
<img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/deepvariant-fq2vcf.jpg"/a></br>
</p>


### 0. Notes:
* The source code of bwa-mem2, samtools, and DeepVariant are residing in:
```Open-Omics-Acceleration-Framework/applications/ ```.
* We build bwa-mem2 and samtools from source; while for DeepVariant, we build a docker image and then use the built image while running the pipeline. Note that, the pre-built image is not available on dockerhub and the image needs to be built from source.
* We provide scripts for setting-up miniconda environment (setup_env.sh), compilation of the required pipeline tools, building & loading of DeepVariant docker image (setup.py). These scripts located at _Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf_. Note that, all the scripts are, by default, written for docker.

* Prerequisite : The following are pre-requisites for the pipeline - our scripts assumed these packages are already installed in the system.
* Docker / Podman
* gcc >= 8.5.0
* MPI
* make >= 4.3
* autoconf >= 2.69
* zlib1g-dev
* libncurses5-dev
* libbz2-dev
* liblzma-dev

# Instructions to run pipeline using Dockerfile on single node
# Using Dockerfile (Single Node)
### 1. Download the code :

```bash
Expand All @@ -32,22 +21,46 @@ cd Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-varian
```
### 2. Build Docker images
```bash
docker build -f Dockerfile_part1 -t deepvariant:part1 .
docker build -f Dockerfile_part2 -t deepvariant:part2 .
docker build -f Dockerfile_part1 -t deepvariant:part1 . ## Part I: fq2bam
docker build -f Dockerfile_part2 -t deepvariant:part2 . ## Part II: bam2vcf
```
### 3. Edit config file
Update the variables R1,R2 and REF in config file in current repository \
Update the REF variable in extra_scripts/config file.

### 3. Setup Input Paramters through ```Config``` file<sup>1</sup>
INPUT_DIR=\<readdir\> ## Location of input reads files (default location /reads in docker)
OUTPUT_DIR=\<outdir\> ## Location of output files (default location /output in docker)
REF_DIR=\<refdir\> ## Location of reference sequence (default location /ref in docker)
R1=HG001_1.fastq.gz ## name of input reads file1
R2=HG001_2.fastq.gz ## name of input reads file2
REF=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna ## name of reference sequence

Update the above fields in ./config file and in extra_scripts/config file
<sup>**1**</sup> **All fields are mandatory**

### 4. Run Docker images
```
docker run -v $PWD/config:/Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf/scripts/aws/config -v <path to read dataset directory>:/reads -v <path to reference dataset directory>:/ref -v <path to output directory>:/output -it deepvariant:part1 bash run_pipeline_ec2_part1.sh
docker run -v ./config:/Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf/scripts/aws/config -v <readdir>:/reads -v <refdir>:/ref -v <refdir>:/output -it deepvariant:part1 bash run_pipeline_ec2_part1.sh
docker run -v $PWD/extra_scripts/config:/opt/deepvariant/config -v <path to reference dataset directory>:/ref -v <path to output directory>:/output -it deepvariant:part2 bash run_pipeline_ec2_part2.sh
docker run -v ./extra_scripts/config:/opt/deepvariant/config -v <redir>:/ref -v <outdir>:/output -it deepvariant:part2 bash run_pipeline_ec2_part2.sh
```

### General Notes:
* The source code of bwa-mem2, samtools, and DeepVariant are residing in:
```Open-Omics-Acceleration-Framework/applications/ ```.
* We build bwa-mem2 and samtools from source; while for DeepVariant, we build a docker image and then use the built image while running the pipeline. Note that, the pre-built image is not available on dockerhub and the image needs to be built from source.
* We provide scripts for setting-up miniconda environment (setup_env.sh), compilation of the required pipeline tools, building & loading of DeepVariant docker image (setup.py). These scripts located at _Open-Omics-Acceleration-Framework/pipelines/deepvariant-based-germline-variant-calling-fq2vcf_. Note that, all the scripts are, by default, written for docker.

* Prerequisite : The following are pre-requisites for the pipeline - our scripts assumed these packages are already installed in the system.
* Docker / Podman
* gcc >= 8.5.0
* MPI
* make >= 4.3
* autoconf >= 2.69
* zlib1g-dev
* libncurses5-dev
* libbz2-dev
* liblzma-dev

# Instructions to run the pipeline on on-prem
# Instructions to run the pipeline on on-prem (Single node & Multi-node)
### 1. Download the latest release:
```bash
wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/2.1/Source_code_with_submodules.tar.gz
Expand Down Expand Up @@ -112,7 +125,7 @@ bash run_pipeline_cluster.sh [docker | podman | "sudo docker"]
Our latest results are published in this [blog](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-Xeon-is-all-you-need-for-AI-inference-Performance/post/1506083).


# Instructions to run the pipeline on an AWS ec2 instance
# Instructions to run the pipeline on an AWS ec2 instance (Single Node)
The following instructions run seamlessly on a standalone AWS ec2 instance. To run the following steps, create an ec2 instance with Ubuntu-22.04 having at least 60GB of memory and 500GB of disk. The input reference sequence and the paired-ended read datasets must be downloaded and stored on the disk.

### One-time setup
Expand Down Expand Up @@ -149,7 +162,7 @@ bash run_pipeline_ec2.sh
```


# Instructions to run the pipeline on an AWS ParallelCluster
# Instructions to run the pipeline on an AWS ParallelCluster (Multi-node)

The following instructions run seamlessly on AWS ParallelCluster. To run the following steps, first create an AWS parallelCluster as follows,
- Cluster setup: follow these steps to setup an [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/v2/ug/what-is-aws-parallelcluster.html). Please see example [config file](scripts/aws/pcluster_example_config) to setup pcluster (_config_ file resides at ~/.parallelcluster/config/ on local machine). Please note: for best performance use shared file system with Amazon EBS _volume\_type = io2_ and _volume\_iops = 64000_ in the config file.
Expand Down

0 comments on commit befe1bd

Please sign in to comment.