diff --git a/benchmarking/AWS-Intel-blog-v2.1-2024/README.md b/benchmarking/AWS-Intel-blog-v2.1-2024/README.md index 1b8b616..6ff322c 100644 --- a/benchmarking/AWS-Intel-blog-v2.1-2024/README.md +++ b/benchmarking/AWS-Intel-blog-v2.1-2024/README.md @@ -17,9 +17,7 @@ ## AlphaFold2-based Protein Folding Pipeline - ### Configuration Details - **BASELINE on m7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 384GB (1 slot/ 384GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, PyTorch - v1.12.1, JAX - v0.4.14, OpenFold - v 1.0.1, Hmmer - v3.3.2, hh-suite - v3.3.0, Kalign2 – v2.04, model name & version: AlphaFold2 **Open Omics on m7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 384GB (1 slot/ 384GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, workload version: Intel-python - 2022.1.0, JAX - v0.4.21, Open Omics Acceleration Framework v2.1, Open Omics AlphaFold2, - v1.0, IntelLabs Hmmer v1.0, IntelLabs hh-suite v1.0), Kalign2 – v2.04, framework version: PyTorch – v2.0.1, model name & version: AlphaFold2 @@ -33,34 +31,65 @@ cd ~ git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework.git ``` +#### Baseline ([OpenFold](https://github.com/aqlaboratory/openfold)) +EC2Instance: m7i.24xlarge\ +Disk: 3.2TB(gp2) +#### Dataset Download - Test dataset can be downloaded from https://www.uniprot.org/proteomes/UP000001940. Click on 'Download' and select options **Download only reviewed (Swiss-Prot:) canonical proteins (4,463)**, Format: **Fasta** and Compressed: **No**. - Save the file as 'uniprotkb_proteome.fasta' inside folder ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/ -#### Baseline ([OpenFold](https://github.com/aqlaboratory/openfold)) +#### Prepare Protein Dataset +```sh +#Below script generate dataset folder: ~/celegans_samples contains 77 files and ~/celegans_samples_long contain 18 files +cd ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/ +python3 proteome.py +``` +#### Donwload Dataset +```sh +mkdir -p ~/data +bash ~/Open-Omics-Acceleration-Framework/applications/alphafold/alphafold/scripts/download_all_data.sh ~/data/ +``` -EC2Instance: m7i.24xlarge +#### Download Model ```sh mkdir -p ~/data/models bash ~/Open-Omics-Acceleration-Framework/applications/alphafold/alphafold/scripts/download_alphafold_params.sh ~/data/models/ -cd ~/Open-Omics-Acceleration-Framework/ -#Below script generate dataset folder: ~/celegans_samples contains 77 files and ~/celegans_samples_long contain 18 files -cd ~/Open-Omics-Acceleration-Framework/benchmarking/AWS-Intel-blog-v2.1-2024/ -python3 proteome.py +``` + +```sh cd ~ git clone --recursive https://github.com/aqlaboratory/openfold.git cd openfold #use mamba for faster depedency solve mamba env create -n openfold_env -f environment.yml conda activate openfold_env -bash scripts/download_alphafold_dbs.sh ~/data/ -fastadir=./celegans_samples +SAMPLES_DIR=$HOME/celegans_samples #Run script python3 run_pretrained_openfold.py \ - $fasta_dir \ + $SAMPLES_DIR \ + ~/data/pdb_mmcif/mmcif_files/ \ + --uniref90_database_path ~/data/uniref90/uniref90.fasta \ + --mgnify_database_path ~/data/mgnify/mgy_clusters_2018_12.fa \ + --pdb70_database_path ~/data/pdb70/pdb70 \ + --uniclust30_database_path ~/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ + --bfd_database_path ~/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ + --jackhmmer_binary_path ~/miniconda3/envs/openfold_venv/bin/jackhmmer \ + --hhblits_binary_path ~/miniconda3/envs/openfold_venv/bin/hhblits \ + --hhsearch_binary_path ~/miniconda3/envs/openfold_venv/bin/hhsearch \ + --kalign_binary_path ~/miniconda3/envs/openfold_venv/bin/kalign \ + --config_preset "model_1" \ + --model_device "cpu" \ + --output_dir ./ \ + --jax_param_path ~/models/params/params_model_1.npz \ + --skip_relaxation \ + --cpus 96 +SAMPLES_DIR==$HOME/celegans_samples_long +python3 run_pretrained_openfold.py \ + $SAMPLES_DIR \ ~/data/pdb_mmcif/mmcif_files/ \ --uniref90_database_path ~/data/uniref90/uniref90.fasta \ --mgnify_database_path ~/data/mgnify/mgy_clusters_2018_12.fa \ @@ -81,20 +110,16 @@ python3 run_pretrained_openfold.py \ ``` Note 1: If you are using a different instance, modify --cpus flag to available vcpus. -Note 2: Use --long_sequence_inference option if you are running long sequences. - - #### Open Omics Acceleration Framework alphafold2-based-protein-folding pipeline -EC2Instance: m7i.24xlarge, m7i.48xlarge - -Follow [steps](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding) for more details: - +EC2Instance: m7i.24xlarge, m7i.48xlarge\ +Disk: 3.2TB(gp2)\ +#Note: Same data disk can be shared with different instances. ```sh cd ~/Open-Omics-Acceleration-Framework/pipelines/alphafold2-based-protein-folding docker build -t alphafold . # Build a docker image named alphafold export DATA_DIR=~/data -export SAMPLES_DIR= +export SAMPLES_DIR=$HOME/celegans_samples # or $HOME/celegans_samples_long export OUTPUT_DIR= export LOG_DIR= docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \ @@ -105,7 +130,6 @@ docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \ ``` ## DeepVariant-based germline variant calling pipeline (fq2vcf) - ### Configuration Details **BASELINE on c7i.24xlarge**: Test by Intel as of <11/30/23>. 1 instance, 1-socket, 1x Intel® Xeon® Platinum 8488C, 48 cores, HT On, Turbo On, Total Memory 192GB (1 slot/ 192GB/ DDR5 4800 MT/s), bios: Amazon EC2 v 1.0, ucode version: 0x2b0004b1, OS Version: Ubuntu 22.04.3 LTS, kernel version: 6.2.0-1017-aws, compiler version: g++ 11.4.0, workload version: bwa-mem v0.7.17, Samtools v. 1.16.1, DeepVariant v1.5, framework version: Intel-tensorflow 2.11.0, model name & version: Inception V3 @@ -119,16 +143,16 @@ docker run -it --cap-add SYS_NICE -v $DATA_DIR:/data \ #### Dataset ```sh +sudo apt install awscli mkdir -p ~/HG001 -wget https://genomics-benchmark-datasets.s3.amazonaws.com/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R1.fastq.gz -P ~/HG001/ -wget https://genomics-benchmark-datasets.s3.amazonaws.com/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R2.fastq.gz -P ~/HG001/ -wget https://broad-references.s3.amazonaws.com/hg38/v0/Homo_sapiens_assembly38.fasta -P ~/HG001/ - +aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R1.fastq.gz ~/HG001/ +aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R2.fastq.gz ~/HG001/ +aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta ~/HG001/ ``` #### Baseline - -EC2Instance: c7i.24xlarge +EC2Instance: c7i.24xlarge\ +Disk: 500GB(gp2) ```sh cd ~ @@ -173,10 +197,11 @@ cd ../../pipelines/deepvariant-based-germline-variant-calling-fq2vcf/ bash run_pipe_bwa.sh ``` #### Open Omics Acceleration Framework deepvariant-based-germline-variant-calling-fq2vcf pipeline +EC2Instance: c7i.24xlarge, c7i.48xlarge\ +Disk: 500GB(gp2) -EC2Instance: c7i.24xlarge, c7i.48xlarge - -pcluster: 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge, +pcluster: 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge\ +Disk: 500GB(io2) To run on c7i.24xlarge, c7i.48xlarge follow [link](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/deepvariant-based-germline-variant-calling-fq2vcf#instructions-to-run-the-pipeline-on-an-aws-ec2-instance). @@ -196,17 +221,22 @@ To run on 2 x c7i.48xlarge, 4 x c7i.48xlarge, 8 x c7i.48xlarge follow [link](htt ### Step by step instructions to benchmark baseline and Open Omics Acceleration Framework -#### Baseline ([rapids-single-cell-examples](https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-example)) - EC2Instance: r7i.24xlarge +#### Download Dataset +```sh +cd ~ +mkdir -p data +wget -P ./data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad +``` +#### Baseline ([rapids-single-cell-examples](https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-example)) ```sh +cd ~ git clone https://github.com/NVIDIA-Genomics-Research/rapids-single-cell-examples.git cd rapids-single-cell-examples conda env create --name rapidgenomics -f conda/cpu_notebook_env.yml conda activate rapidgenomics -mkdir data -wget -P ./data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad +ln -s ~/data/ data cd notebooks python -m ipykernel install --user --display-name "Python (rapidgenomics)" ``` @@ -214,9 +244,19 @@ Note: Open Jupyter notebook and Select **1M_brain_cpu_analysis.ipynb** file and #### Open Omics Acceleration Framework single-cell-RNA-seq-analysis pipeline -EC2Instance: r7i.24xlarge and c7i.24xlarge +EC2Instance: r7i.24xlarge and c7i.24xlarge +```sh +cd ~ +git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework.git +cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/ +docker build -t scanpy . # Create a docker image named scanpy -Follow [link](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/single-cell-RNA-seq-analysis#run-with-jupyter-notebook-interactive) to run Open Omics Accelaration Framework single-cell-RNA-seq-analysis pipeline using interactive mode. +# Download dataset +cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/ +ln -s ~/data/ data +docker run -it -p 8888:8888 -v ~/data:/data scanpy # run docker container with the data folder as volume +``` +Note: Open Jupyter notebook and Select **1.3_million_single_cell_analysis.ipynb** file and run all cells. # Cleanup