df-parallel

This repo demonstrates how to setup CONDA environments for popular Dataframe libraries and process large tabular data files.

It compares parallel and out-of-core (data that are too large to fit into the computer's memory) reading and processing of large datasets on CPU and GPU.

Dataframe Library	Parallel	Out-of-core	CPU/GPU	Evaluation
Pandas	no	no [1]	CPU	eager
Dask	yes	yes	CPU	lazy
Spark	yes	yes	CPU	lazy
cuDF	yes	no	GPU	eager
Dask-cuDF	yes	yes	GPU	lazy

[1] Pandas can read data in chunks, but they have to be processed independently.

Running Jupyter Lab locally (CPU only)

Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba

Install Miniconda3
Install Mamba: conda install mamba -n base -c conda-forge

Clone this git repository

git clone https://github.com/sbl-sdsc/df-parallel.git

Create CONDA environment

mamba env create -f df-parallel/environment.yml

Activate the CONDA environment

conda activate df-parallel

Launch Jupyter Lab

jupyter lab

Deactivate the CONDA environment

conda deactivate

To remove the CONDA environment, run conda env remove -n df-parallel

Running Jupyter Lab on SDSC Expanse

To launch Jupyter Lab on Expanse, use the galyleo script. Specify your ACCESS account number with the --account option. If you do not have an ACCESS acount and allocation on Expanse, you can apply through NSF’s ACCESS program or for a trial allocation, contact consult@sdsc.edu.

Clone this git repository

git clone https://github.com/sbl-sdsc/df-parallel.git

2a. Run on CPU (Pandas, Dask, and Spark dataframes):

galyleo launch --account <account_number> --partition shared --cpus 10 --memory 20 --time-limit 00:30:00 --conda-env df-parallel --conda-yml "${HOME}/df-parallel/environment.yml" --mamba

2b. Run on GPU (required for cuDF and Dask-cuDF dataframes):

galyleo launch --account <account_number> --partition gpu-shared --cpus 10 --memory 92 --gpus 1 --time-limit 00:30:00 --conda-env df-parallel-gpu --conda-yml "${HOME}/df-parallel/environment-gpu.yml" --mamba

Running the example notebooks

After Jupyter Lab has been launched, run the Notebook DownloadData.ipynb to create a dataset. In this notebook, specify the number of copies (ncopies) to be made from the orignal dataset to increase its size. By default, a single copy is created. After the dataset has been created, run the dataframe specific notebooks. Note, the cuDF and Dask-cuDF dataframe libraries require a GPU.

Test results (not representative)

Results for running on SDSC Expanse GPU node with 10 CPU cores (Intel Xeon Gold 6248 2.5 GHz), 1 GPU (NVIDIA V100 SMX2, 32GB), and 92 GB of memory (DDR4 DRAM), local storage (1.6 TB Samsung PM1745b NVMe PCIe SSD).

Datafile size (gene_info.tsv as of June 2022):

Dataset 1: 5.4 GB (18 GB in Pandas)
Dataset 2: 21.4 GB (4 x Dataset 1) (62.4 GB in Pandas)
Dataset 3: 43.7 GB (8 x Dataset 1)

Dataframe Library	time(5.4 GB) (s)	time(21.4 GB) (s)	time(43.7 GB) (s)	Parallel	Out-of-core	CPU/GPU
Pandas	56.3	222.4	-- [2]	no	no	CPU
Dask	15.7	42.1	121.8	yes	yes	CPU
Spark	14.2	31.2	56.5	yes	yes	CPU
cuDF	3.2	-- [2]	-- [2]	yes	no	GPU
Dask-cuDF	7.3	11.9	19.0	yes	yes	GPU

[2] out of memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

df-parallel

Running Jupyter Lab locally (CPU only)

Running Jupyter Lab on SDSC Expanse

Running the example notebooks

Test results (not representative)

Files

README.md

Latest commit

History

README.md

File metadata and controls

df-parallel

Running Jupyter Lab locally (CPU only)

Running Jupyter Lab on SDSC Expanse

Running the example notebooks

Test results (not representative)