DeepPool Artifact

Instructions on how to run the VGG example

Ensure you have NVIDIA docker available on your system Download and run the PyTorch container:

docker run --gpus all --network="host" -it --rm nvcr.io/nvidia/pytorch:22.01-py3

In the container, clone the DeepPool repo:

git clone https://github.com/joshuafried/DeepPool-Artifact

Enter the directory and build DeepPool:

cd DeepPool-Artifact
bash build.sh

Now you can launch the DeepPool cluster coordinator as a background job:

python3 cluster.py  --addrToBind 0.0.0.0:12347 --c10dBackend nccl --be_batch_size=0 --cpp --logdir=$PWD &

Once you see "Now, cluster is ready to accept training jobs." you may launch a job. For example, to run VGG across 8 GPUs in DataParallel mode with global batch size 32, run:

python3 examples/vgg.py 8 32 DP 0

To run VGG in BurstParallel mode with an amplification limit of 5.0:

python3 examples/vgg.py 8 32 5.0 0

To view the results of the run, inspect the contents of cpprt0.out:

tail cpprt0.out

When a job completes, you will see a line of output indicating the iteration such as:

A training job vgg16_8_32_2.0_DP is completed (1800 iters, 13.57 ms/iter, 73.71 iter/s, 0.00 be img/s, 32 globalBatchSize).

To kill the cluster, run

pkill runtime

Now re-run VGG with a background training job:

python3 examples/vgg_be.py
python3 cluster.py  --addrToBind 0.0.0.0:12347 --c10dBackend nccl --be_batch_size=8  --cpp --logdir=$PWD --be_jit_file=vgg.jit --sample_per_kernel=8 &

Once the cluster is running:

python3 examples/vgg.py 8 32 DP 1
python3 examples/vgg.py 8 32 5.0 1

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
bench		bench
csrc		csrc
examples		examples
logs		logs
microbenchmark		microbenchmark
modules		modules
profile		profile
results		results
transformers		transformers
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
cluster.py		cluster.py
clusterClient.py		clusterClient.py
gpuProfiler.py		gpuProfiler.py
inceptionLayerGpuProfileA100.txt		inceptionLayerGpuProfileA100.txt
inceptionLayerGpuProfileA100V2.txt		inceptionLayerGpuProfileA100V2.txt
jobDescription.py		jobDescription.py
logger.py		logger.py
parallelizationPlanner.py		parallelizationPlanner.py
requirements.txt		requirements.txt
resnetLayerGpuProfileA100.txt		resnetLayerGpuProfileA100.txt
resnetLayerGpuProfileA100V2.txt		resnetLayerGpuProfileA100V2.txt
vggLayerGpuProfileA100.txt		vggLayerGpuProfileA100.txt
vitLayerGpuProfileA100.txt		vitLayerGpuProfileA100.txt
wrnLayerGpuProfileA100V2.txt		wrnLayerGpuProfileA100V2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepPool Artifact

Instructions on how to run the VGG example

About

Releases

Packages

Contributors 4

Languages

License

DeepPoolML/DeepPool

Folders and files

Latest commit

History

Repository files navigation

DeepPool Artifact

Instructions on how to run the VGG example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages