We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CenterNet mixed-precision training cannot work well with specific cuDNN versions.
master
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda110-mpi3.1.6-v1.34.0
cuda=11.0.3, CUDNN_VERSION=8.0.5.39
python src/main.py ctdet --config_file=cfg/resnet_18_coco_mp.yaml --data_dir path_to_coco_dataset
2023-03-02 06:18:26,839 [nnabla][INFO]: Using DataIterator 2023-03-02 06:18:26,865 [nnabla][INFO]: Creating model... 2023-03-02 06:18:26,865 [nnabla][INFO]: {'hm': 80, 'wh': 2, 'reg': 2} 2023-03-02 06:18:26,865 [nnabla][INFO]: batch size per gpu: 24 [Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00: 0%| | 6/4929 [00:06<1:33:43, 1.14s/it]^C
or
2023-03-02 05:47:38,953 [nnabla][INFO]: Using DataIterator 2023-03-02 05:47:38,959 [nnabla][INFO]: Creating model... 2023-03-02 05:47:38,959 [nnabla][INFO]: {'hm': 80, 'reg': 2, 'wh': 2} 2023-03-02 05:47:38,964 [nnabla][INFO]: batch size per gpu: 32 ^M 0%| | 0/3697 [00:00<?, ?it/s]^M 0%| | 0/3697 [00:04<?, ?it/s] Traceback (most recent call last): File "nnabla-examples/object-detection/centernet/src/main.py", line 147, in <module> main(opt) File "nnabla-examples/object-detection/centernet/src/main.py", line 112, in main _ = trainer.update(epoch) File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 191, in update total_loss, hm_loss, wh_loss, off_loss = self.compute_gradient( File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient return self.compute_gradient(data) File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient return self.compute_gradient(data) File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient return self.compute_gradient(data) [Previous line repeated 7 more times] File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 175, in compute_gradient raise RuntimeError( RuntimeError: Something went wrong with gradient calculations. --------------------------------------------------------------------------
Using a newer cuDNN version solved this issue.
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda116-mpi3.1.6-v1.34.0
cuda=11.6.0, CUDNN_VERSION=8.4.0.27
The text was updated successfully, but these errors were encountered:
TomonobuTsujikawa
No branches or pull requests
CenterNet mixed-precision training cannot work well with specific cuDNN versions.
How to reproduce
master
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda110-mpi3.1.6-v1.34.0
as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt)cuda=11.0.3, CUDNN_VERSION=8.0.5.39
Error messages
or
How to solve
Using a newer cuDNN version solved this issue.
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda116-mpi3.1.6-v1.34.0
as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt)cuda=11.6.0, CUDNN_VERSION=8.4.0.27
The text was updated successfully, but these errors were encountered: