Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ashish615 authored Mar 6, 2024
1 parent 48b2d33 commit 1dba972
Showing 1 changed file with 75 additions and 35 deletions.
110 changes: 75 additions & 35 deletions benchmarking/AWS-Intel-blog-v2.1-2024/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@


## AlphaFold2-based Protein Folding Pipeline

### Configuration Details

**BASELINE on m7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 384GB (1 slot/ 384GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, PyTorch - v1.12.1, JAX - v0.4.14, OpenFold - v 1.0.1, Hmmer - v3.3.2, hh-suite - v3.3.0, Kalign2 – v2.04, model name & version: AlphaFold2

**Open Omics on m7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 384GB (1 slot/ 384GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, workload version: Intel-python - 2022.1.0, JAX - v0.4.21, Open Omics Acceleration Framework v2.1, Open Omics AlphaFold2, - v1.0, IntelLabs Hmmer v1.0, IntelLabs hh-suite v1.0), Kalign2 – v2.04, framework version: PyTorch – v2.0.1, model name & version: AlphaFold2
Expand All @@ -33,34 +31,65 @@
cd ~
git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework.git
```
#### Baseline ([OpenFold](https://github.com/aqlaboratory/openfold))
EC2Instance: m7i.24xlarge\
Disk: 3.2TB(gp2)

#### Dataset Download
- Test dataset can be downloaded from https://www.uniprot.org/proteomes/UP000001940. Click on 'Download' and select options **Download only reviewed (Swiss-Prot:) canonical proteins (4,463)**, Format: **Fasta** and Compressed: **No**.

- Save the file as 'uniprotkb_proteome.fasta' inside folder ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/


#### Baseline ([OpenFold](https://github.com/aqlaboratory/openfold))
#### Prepare Protein Dataset
```sh
#Below script generate dataset folder: ~/celegans_samples contains 77 files and ~/celegans_samples_long contain 18 files
cd ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/
python3 proteome.py
```
#### Donwload Dataset
```sh
mkdir -p ~/data
bash ~/Open-Omics-Acceleration-Framework/applications/alphafold/alphafold/scripts/download_all_data.sh ~/data/
```

EC2Instance: m7i.24xlarge

#### Download Model
```sh
mkdir -p ~/data/models
bash ~/Open-Omics-Acceleration-Framework/applications/alphafold/alphafold/scripts/download_alphafold_params.sh ~/data/models/
cd ~/Open-Omics-Acceleration-Framework/
#Below script generate dataset folder: ~/celegans_samples contains 77 files and ~/celegans_samples_long contain 18 files
cd ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/
python3 proteome.py
```

```sh
cd ~
git clone --recursive https://github.com/aqlaboratory/openfold.git
cd openfold
#use mamba for faster depedency solve
mamba env create -n openfold_env -f environment.yml
conda activate openfold_env
bash scripts/download_alphafold_dbs.sh ~/data/
fastadir=./celegans_samples
SAMPLES_DIR=$HOME/celegans_samples
#Run script
python3 run_pretrained_openfold.py \
$fasta_dir \
$SAMPLES_DIR \
~/data/pdb_mmcif/mmcif_files/ \
--uniref90_database_path ~/data/uniref90/uniref90.fasta \
--mgnify_database_path ~/data/mgnify/mgy_clusters_2018_12.fa \
--pdb70_database_path ~/data/pdb70/pdb70 \
--uniclust30_database_path ~/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--bfd_database_path ~/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--jackhmmer_binary_path ~/miniconda3/envs/openfold_venv/bin/jackhmmer \
--hhblits_binary_path ~/miniconda3/envs/openfold_venv/bin/hhblits \
--hhsearch_binary_path ~/miniconda3/envs/openfold_venv/bin/hhsearch \
--kalign_binary_path ~/miniconda3/envs/openfold_venv/bin/kalign \
--config_preset "model_1" \
--model_device "cpu" \
--output_dir ./ \
--jax_param_path ~/models/params/params_model_1.npz \
--skip_relaxation \
--cpus 96
SAMPLES_DIR==$HOME/celegans_samples_long
python3 run_pretrained_openfold.py \
$SAMPLES_DIR \
~/data/pdb_mmcif/mmcif_files/ \
--uniref90_database_path ~/data/uniref90/uniref90.fasta \
--mgnify_database_path ~/data/mgnify/mgy_clusters_2018_12.fa \
Expand All @@ -81,20 +110,16 @@ python3 run_pretrained_openfold.py \
```
Note 1: If you are using a different instance, modify --cpus flag to available vcpus.

Note 2: Use --long_sequence_inference option if you are running long sequences.


#### Open Omics Acceleration Framework alphafold2-based-protein-folding pipeline

EC2Instance: m7i.24xlarge, m7i.48xlarge

Follow [steps](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding) for more details:

EC2Instance: m7i.24xlarge, m7i.48xlarge\
Disk: 3.2TB(gp2)\
#Note: Same data disk can be shared with different instances.
```sh
cd ~/Open-Omics-Acceleration-Framework/pipelines/alphafold2-based-protein-folding
docker build -t alphafold . # Build a docker image named alphafold
export DATA_DIR=~/data
export SAMPLES_DIR=<fasta_dir>
export SAMPLES_DIR=$HOME/celegans_samples # or $HOME/celegans_samples_long
export OUTPUT_DIR=<path-to-output-directory>
export LOG_DIR=<path-to-log-directory>
docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \
Expand All @@ -105,7 +130,6 @@ docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \
```

## DeepVariant-based germline variant calling pipeline (fq2vcf)

### Configuration Details

**BASELINE on c7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 192GB (1 slot/ 192GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, workload version: bwa-mem v0.7.17, Samtools v. 1.16.1, DeepVariant v1.5, framework version: Intel-tensorflow 2.11.0, model name & version: Inception V3
Expand All @@ -119,16 +143,16 @@ docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \

#### Dataset
```sh
sudo apt install awscli
mkdir -p ~/HG001
wget https://genomics-benchmark-datasets.s3.amazonaws.com/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R1.fastq.gz -P ~/HG001/
wget https://genomics-benchmark-datasets.s3.amazonaws.com/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R2.fastq.gz -P ~/HG001/
wget https://broad-references.s3.amazonaws.com/hg38/v0/Homo_sapiens_assembly38.fasta -P ~/HG001/

aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R1.fastq.gz ~/HG001/
aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R2.fastq.gz ~/HG001/
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta ~/HG001/
```

#### Baseline

EC2Instance: c7i.24xlarge
EC2Instance: c7i.24xlarge\
Disk: 500GB(gp2)

```sh
cd ~
Expand Down Expand Up @@ -173,10 +197,11 @@ cd ../../pipelines/deepvariant-based-germline-variant-calling-fq2vcf/
bash run_pipe_bwa.sh
```
#### Open Omics Acceleration Framework deepvariant-based-germline-variant-calling-fq2vcf pipeline
EC2Instance: c7i.24xlarge, c7i.48xlarge\
Disk: 500GB(gp2)

EC2Instance: c7i.24xlarge, c7i.48xlarge

pcluster: 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge,
pcluster: 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge\
Disk: 500GB(io2)

To run on c7i.24xlarge, c7i.48xlarge follow [link](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/deepvariant-based-germline-variant-calling-fq2vcf#instructions-to-run-the-pipeline-on-an-aws-ec2-instance).

Expand All @@ -196,27 +221,42 @@ To run on 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge follow [link](htt

### Step by step instructions to benchmark baseline and Open Omics Acceleration Framework

#### Baseline ([rapids-single-cell-examples](https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-example))

EC2Instance: r7i.24xlarge
#### Download Dataset
```sh
cd ~
mkdir -p data
wget -P ./data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad
```
#### Baseline ([rapids-single-cell-examples](https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-example))

```sh
cd ~
git clone https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-examples.git
cd rapids-single-cell-examples
conda env create --name rapidgenomics -f conda/cpu_notebook_env.yml
conda activate rapidgenomics
mkdir data
wget -P ./data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad
ln -s ~/data/ data
cd notebooks
python -m ipykernel install --user --display-name "Python (rapidgenomics)"
```
Note: Open Jupyter notebook and Select **1M_brain_cpu_analysis.ipynb** file and run all cells.

#### Open Omics Acceleration Framework single-cell-RNA-seq-analysis pipeline

EC2Instance: r7i.24xlarge and c7i.24xlarge
EC2Instance: r7i.24xlarge and c7i.24xlarge
```sh
cd ~
git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework.git
cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/
docker build -t scanpy . # Create a docker image named scanpy

Follow [link](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/single-cell-RNA-seq-analysis#run-with-jupyter-notebook-interactive) to run Open Omics Accelaration Framework single-cell-RNA-seq-analysis pipeline using interactive mode.
# Download dataset
cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/
ln -s ~/data/ data
docker run -it -p 8888:8888 -v ~/data:/data scanpy # run docker container with the data folder as volume
```
Note: Open Jupyter notebook and Select **1.3_million_single_cell_analysis.ipynb** file and run all cells.

# Cleanup

Expand Down

0 comments on commit 1dba972

Please sign in to comment.