-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding mixed precision training support #179
base: master
Are you sure you want to change the base?
Conversation
…40% on GPUs with tensor cores
Please, could any maintainer of this repo helps review this? |
@vinhngx This seems interesting. How much gain in training time were you able to achieve by using mixed precision training? |
@vinhngx Have you also prepared a script with mixed precision support for running inferences? |
@ghost though the training is done with FP16, the weight is stored as FP32 and there is no changed required when doing inference (in FP32) |
@vinhngx I am using this script for mixed precision training (on multiple gpus). However, it dosen't work well with O2, and O3. See #227. For O1, my gpus memory usage increases compared to FP32 training, and training time also increases. It seems we have to wait a little till mixed precision training support is introduced in Pytorch 1.5 I am also getting 'gradient overflow' when training on single gpu on O1 (when setting batch_size_per_gpu = 1, for batch_size_per_gpu >1, it works fine), even after
The issue I am getting is Do you know any workaround this? |
After referencing hszhao's DDP implementation on this link. Now my program has half GPU memory usage with 3x faster speed during the training on multiple devices. |
This PR adds mixed precision training support using APEX.
https://github.com/NVIDIA/apex
Automatic mixed precision training makes use of both FP32 and FP16 precisions where appropriate. FP16 operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput.
Mixed precision training can be enabled with passing the
--apex
flag to the training script, for example:How mixed precision works
Mixed precision is the use of both float16 and float32 data types when training a model.
Performing arithmetic operations in float16 takes advantage of the performance gains of using specialized processing units such as the Tensor cores on NVIDIA GPUs. Due to the smaller representable range of float16, performing the entire training with float16 data type can result in underflow of the gradients, leading to convergence or model quality issues.
However, performing only select arithmetic operations in float16 results in performance gains when using compatible hardware accelerators, decreasing training time and reducing memory usage, typically without sacrificing model performance.
To learn more about mixed precision and how it works:
Overview of Automatic Mixed Precision for Deep Learning
NVIDIA Mixed Precision Training Documentation
NVIDIA Deep Learning Performance Guide