This repository demonstrates how you can build a performant cloud-native Pytorch dataloader using Zarr, Dask, Xarray, and Xbatcher.
The demo can be run by executing the following command line script:
# cli help
python main.py --help
# example 1
python main.py \
--batch-size 4 \
--num-epochs 3 \
--num-batches 500 \
--shuffle \
--source arraylake > logs-blog/fig2-log.txt
# example 2
python main.py \
--batch-size 4 \
--num-epochs 3 \
--num-batches 500 \
--num-workers 32 \
--persistent-workers \
--dask-threads 4 \
--shuffle \
--prefetch-factor 3 \
--source arraylake > logs-blog/fig3-log.txt
The output of this script is a log file (logs-blog/fig3-log.txt
) that can be visualized using the Jupyter Notebook (plot.ipynb
).
This work is started as a collaboration between Earthmover and Zeus AI. It leverages a number of open source projects, including Zarr, Dask, Xarray, and Xbatcher -- all of which have been supported by a number of grants from NSF, NASA, and CZI.
As of February 2024, some of the improvements discovered in this project are being upstreamed into Xbatcher.
MIT