-
Notifications
You must be signed in to change notification settings - Fork 36
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Multi-GPU support for unsupervised learning (#207)
* Cleanup device throughout the codebase 1. PCALoss default device changed to cuda:LOCAL_RANK for multi-gpu 2. Temporal loss device changed to use device of input arguments for multi-gpu 3. Like above, generally functions should prefer to use the device of the input arguments. 3. Removed _TORCH_DEVICE for the most part. 4. Removed LightningModule parent class of Loss 5. Fix test_train chdir causing subsequent tests to fail on file not found errors. * Multi-GPU support for unsupervised learning * Update docs for unsupervised multi-GPU * Cleanup omegaconf.create * sort imports * PR comments * batch size division ceiling * context batch size adjustment * update docs * add doc file * fold pytest.ini into setup.cfg and update docs
- Loading branch information
Showing
28 changed files
with
496 additions
and
306 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
.. _multi_gpu_training: | ||
|
||
################### | ||
Multi-GPU Training | ||
################### | ||
|
||
Multi-GPU training allows you to distribute the load of model training across GPUs. | ||
This helps overcome OOMs in addition to accelerating training. | ||
|
||
To use this feature, set :ref:`num_gpus <config_num_gpus>` in your config file. | ||
|
||
How to choose batch_size | ||
======================== | ||
|
||
Multi-GPU training distributes batches across multiple GPUs in a way that maintains the same | ||
effective batch size as if you ran on 1 GPU. **Thus, if you reduced batch size in order to make | ||
your model fit in one GPU, you should increase it back to your desired effective batch size.** | ||
|
||
The batch size configuration parameters that this applies to are ``training.train_batch_size`` and | ||
``training.val_batch_size`` for the labeled frames, and ``dali.train.base.sequence_length`` and | ||
``dali.train.context.batch_size`` for unlabeled video frames. Test batch sizes are not relevant | ||
to this document as testing only occurs on one GPU. | ||
|
||
Calculate of per-GPU batch size | ||
------------------------------- | ||
|
||
Given the above, you need not worry about how lightning-pose calculates per-GPU batch size, | ||
but it is documented here for transparency. | ||
|
||
In general the per-GPU batch size will be: | ||
|
||
.. code-block:: python | ||
ceil(batch_size / num_gpus) | ||
The exception to this is the unlabeled per-GPU batch size for context models (``heatmap_mhcrnn``): | ||
|
||
.. code-block:: python | ||
ceil((batch_size - 4) / num_gpus) + 4 | ||
The adjusted calculation for the unlabeled batch size for context models maintains the same | ||
single-GPU effective batch size by accounting for the 4 context frames that are loaded with each | ||
training frame. | ||
For example, if you specified `dali.context.train.batch_size=16`, then your effective batch size | ||
was 16 - 4 = 12. | ||
To maintain 12 with 2 GPUs, each GPU will load 6 frames + 4 context frames, for a per-GPU batch | ||
size of 10. | ||
This is larger than simply dividing the original batch size of 16 across 2 GPUs. | ||
|
||
.. _execution_model: | ||
|
||
Execution model | ||
=============== | ||
|
||
.. warning:: | ||
The implementation spawns ``num_gpus - 1`` processes of the same command originally executed, | ||
repeating all of the command's execution per process. | ||
Thus it is advised to only run multi-GPU training in a dedicated training script | ||
(``scripts/train_hydra.py``). If you use lightning-pose as part of a custom script and don't | ||
want your entire script to run once per GPU, your script should run ``scripts/train_hydra.py`` | ||
rather than directly calling the ``train`` method. | ||
|
||
Tensorboard metric calculation | ||
============================== | ||
|
||
All metrics can be interpreted the same way as with a single-GPU. | ||
The metrics are the average value across the GPUs. | ||
|
||
Specifying the GPUs to run on | ||
============================= | ||
|
||
Use the environment variable ``CUDA_VISIBLE_DEVICES`` if you want lightning pose to run on certain | ||
GPUs. For example, if you want to train on only the first two GPUs on your machine, | ||
|
||
.. code-block:: bash | ||
CUDA_VISIBLE_DEVICES=0,1 python scripts/train_hydra.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.