Merge pull request #9 from IntelLabs/ashish615-patch-1

Update README.md
IntelLabs · Mar 22, 2024 · d345722 · d345722
2 parents 48b2d33 + 1dba972
commit d345722
Showing 1 changed file with 75 additions and 35 deletions.
diff --git a/benchmarking/AWS-Intel-blog-v2.1-2024/README.md b/benchmarking/AWS-Intel-blog-v2.1-2024/README.md
@@ -17,9 +17,7 @@
 
 
 ## AlphaFold2-based Protein Folding Pipeline
-
 ### Configuration Details
-
 **BASELINE on m7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 384GB (1 slot/ 384GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, PyTorch - v1.12.1, JAX - v0.4.14, OpenFold - v 1.0.1, Hmmer - v3.3.2, hh-suite - v3.3.0, Kalign2 – v2.04, model name & version: AlphaFold2
 
 **Open Omics on m7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 384GB (1 slot/ 384GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, workload version: Intel-python - 2022.1.0, JAX - v0.4.21, Open Omics Acceleration Framework v2.1, Open Omics AlphaFold2, - v1.0, IntelLabs Hmmer v1.0, IntelLabs hh-suite v1.0), Kalign2 – v2.04, framework version: PyTorch – v2.0.1, model name & version: AlphaFold2
@@ -33,34 +31,65 @@
  cd ~
  git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework.git
 ```
+#### Baseline ([OpenFold](https://github.com/aqlaboratory/openfold))
+EC2Instance: m7i.24xlarge\
+Disk: 3.2TB(gp2)
 
+#### Dataset Download
 - Test dataset can be downloaded from https://www.uniprot.org/proteomes/UP000001940. Click on 'Download' and select options **Download only reviewed (Swiss-Prot:) canonical proteins (4,463)**, Format: **Fasta** and Compressed: **No**.
 
 - Save the file as 'uniprotkb_proteome.fasta' inside  folder ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/
 
 
-#### Baseline ([OpenFold](https://github.com/aqlaboratory/openfold))
+#### Prepare Protein Dataset
+```sh
+#Below script generate dataset folder: ~/celegans_samples contains 77 files and ~/celegans_samples_long contain 18 files
+cd ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/
+python3 proteome.py
+```
+#### Donwload Dataset
+```sh
+mkdir -p ~/data
+bash ~/Open-Omics-Acceleration-Framework/applications/alphafold/alphafold/scripts/download_all_data.sh ~/data/
+```
 
-EC2Instance: m7i.24xlarge
 
+#### Download Model
 ```sh
 mkdir -p ~/data/models
 bash ~/Open-Omics-Acceleration-Framework/applications/alphafold/alphafold/scripts/download_alphafold_params.sh ~/data/models/
-cd ~/Open-Omics-Acceleration-Framework/
-#Below script generate dataset folder: ~/celegans_samples contains 77 files and ~/celegans_samples_long contain 18 files
-cd ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/
-python3 proteome.py
+```
+
+```sh
 cd ~
 git clone --recursive https://github.com/aqlaboratory/openfold.git
 cd openfold
 #use mamba for faster depedency solve
 mamba env create -n openfold_env -f environment.yml
 conda activate openfold_env
-bash scripts/download_alphafold_dbs.sh ~/data/
-fastadir=./celegans_samples
+SAMPLES_DIR=$HOME/celegans_samples
 #Run script
 python3 run_pretrained_openfold.py \
-    $fasta_dir \
+    $SAMPLES_DIR \
+    ~/data/pdb_mmcif/mmcif_files/ \
+    --uniref90_database_path ~/data/uniref90/uniref90.fasta \
+    --mgnify_database_path ~/data/mgnify/mgy_clusters_2018_12.fa \
+    --pdb70_database_path ~/data/pdb70/pdb70 \
+    --uniclust30_database_path ~/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
+    --bfd_database_path ~/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
+    --jackhmmer_binary_path ~/miniconda3/envs/openfold_venv/bin/jackhmmer \
+    --hhblits_binary_path ~/miniconda3/envs/openfold_venv/bin/hhblits \
+    --hhsearch_binary_path ~/miniconda3/envs/openfold_venv/bin/hhsearch \
+    --kalign_binary_path ~/miniconda3/envs/openfold_venv/bin/kalign \
+    --config_preset "model_1" \
+    --model_device "cpu" \
+    --output_dir ./ \
+    --jax_param_path ~/models/params/params_model_1.npz \
+    --skip_relaxation  \
+    --cpus 96
+SAMPLES_DIR==$HOME/celegans_samples_long
+python3 run_pretrained_openfold.py \
+    $SAMPLES_DIR \
     ~/data/pdb_mmcif/mmcif_files/ \
     --uniref90_database_path ~/data/uniref90/uniref90.fasta \
     --mgnify_database_path ~/data/mgnify/mgy_clusters_2018_12.fa \
@@ -81,20 +110,16 @@ python3 run_pretrained_openfold.py \
 ```
 Note 1: If you are using a different instance, modify --cpus flag to available vcpus. 
 
-Note 2: Use --long_sequence_inference option if you are running long sequences.
-
-
 #### Open Omics Acceleration Framework alphafold2-based-protein-folding pipeline
 
-EC2Instance: m7i.24xlarge, m7i.48xlarge
-
-Follow [steps](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding) for more details:
-
+EC2Instance: m7i.24xlarge, m7i.48xlarge\
+Disk: 3.2TB(gp2)\
+#Note: Same data disk can be shared with different instances.
 ```sh
 cd ~/Open-Omics-Acceleration-Framework/pipelines/alphafold2-based-protein-folding
 docker build -t alphafold .           # Build a docker image named alphafold
 export DATA_DIR=~/data
-export SAMPLES_DIR=<fasta_dir>
+export SAMPLES_DIR=$HOME/celegans_samples # or $HOME/celegans_samples_long
 export OUTPUT_DIR=<path-to-output-directory>
 export LOG_DIR=<path-to-log-directory>
 docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \
@@ -105,7 +130,6 @@ docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \
 ```
 
 ## DeepVariant-based germline variant calling pipeline (fq2vcf)
-
 ### Configuration Details
 
 **BASELINE on c7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 192GB (1 slot/ 192GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, workload version: bwa-mem v0.7.17, Samtools v. 1.16.1, DeepVariant v1.5, framework version: Intel-tensorflow 2.11.0, model name & version: Inception V3
@@ -119,16 +143,16 @@ docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \
 
 #### Dataset 
 ```sh
+sudo apt install awscli
 mkdir -p ~/HG001
-wget https://genomics-benchmark-datasets.s3.amazonaws.com/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R1.fastq.gz -P ~/HG001/
-wget https://genomics-benchmark-datasets.s3.amazonaws.com/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R2.fastq.gz -P ~/HG001/
-wget https://broad-references.s3.amazonaws.com/hg38/v0/Homo_sapiens_assembly38.fasta -P ~/HG001/
-
+aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R1.fastq.gz ~/HG001/
+aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R2.fastq.gz ~/HG001/
+aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta ~/HG001/
 ```
 
 #### Baseline
-
-EC2Instance: c7i.24xlarge
+EC2Instance: c7i.24xlarge\
+Disk: 500GB(gp2)
 
 ```sh
 cd ~
@@ -173,10 +197,11 @@ cd ../../pipelines/deepvariant-based-germline-variant-calling-fq2vcf/
 bash run_pipe_bwa.sh
 ```
 #### Open Omics Acceleration Framework deepvariant-based-germline-variant-calling-fq2vcf pipeline
+EC2Instance: c7i.24xlarge, c7i.48xlarge\
+Disk: 500GB(gp2)
 
-EC2Instance: c7i.24xlarge, c7i.48xlarge
-
-pcluster: 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge,
+pcluster: 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge\
+Disk: 500GB(io2)
 
 To run on c7i.24xlarge, c7i.48xlarge follow [link](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/deepvariant-based-germline-variant-calling-fq2vcf#instructions-to-run-the-pipeline-on-an-aws-ec2-instance).
 
@@ -196,27 +221,42 @@ To run on 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge follow [link](htt
 
 ### Step by step instructions to benchmark baseline and Open Omics Acceleration Framework
 
-#### Baseline ([rapids-single-cell-examples](https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-example))
-
 EC2Instance: r7i.24xlarge
+#### Download Dataset
+```sh
+cd ~
+mkdir -p data
+wget -P ./data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad
+```
+#### Baseline ([rapids-single-cell-examples](https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-example))
 
 ```sh
+cd ~
 git clone https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-examples.git
 cd rapids-single-cell-examples
 conda env create --name rapidgenomics -f conda/cpu_notebook_env.yml
 conda activate rapidgenomics
-mkdir data
-wget -P ./data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad
+ln -s ~/data/ data
 cd notebooks
 python -m ipykernel install --user --display-name "Python (rapidgenomics)"
 ```
 Note: Open Jupyter notebook and Select **1M_brain_cpu_analysis.ipynb** file and run all cells.
 
 #### Open Omics Acceleration Framework single-cell-RNA-seq-analysis pipeline
 
-EC2Instance: r7i.24xlarge and c7i.24xlarge
+EC2Instance: r7i.24xlarge and c7i.24xlarge 
+```sh
+cd ~
+git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework.git
+cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/
+docker build -t scanpy .           # Create a docker image named scanpy
 
-Follow [link](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/single-cell-RNA-seq-analysis#run-with-jupyter-notebook-interactive) to run Open Omics Accelaration Framework single-cell-RNA-seq-analysis pipeline using interactive mode.
+# Download dataset
+cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/
+ln -s ~/data/ data
+docker run -it -p 8888:8888 -v ~/data:/data scanpy   # run docker container with the data folder as volume
+```
+Note: Open Jupyter notebook and Select **1.3_million_single_cell_analysis.ipynb** file and run all cells.
 
 # Cleanup