Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CenterNet mixed-precision training cannot work well with specific cuDNN versions #373

Open
hyingho opened this issue Mar 9, 2023 · 0 comments
Assignees

Comments

@hyingho
Copy link
Contributor

hyingho commented Mar 9, 2023

CenterNet mixed-precision training cannot work well with specific cuDNN versions.

How to reproduce

python src/main.py ctdet --config_file=cfg/resnet_18_coco_mp.yaml --data_dir path_to_coco_dataset

Error messages

2023-03-02 06:18:26,839 [nnabla][INFO]: Using DataIterator
2023-03-02 06:18:26,865 [nnabla][INFO]: Creating model...
2023-03-02 06:18:26,865 [nnabla][INFO]: {'hm': 80, 'wh': 2, 'reg': 2}
2023-03-02 06:18:26,865 [nnabla][INFO]: batch size per gpu: 24
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00:   0%|          | 6/4929 [00:06<1:33:43,  1.14s/it]^C

or

2023-03-02 05:47:38,953 [nnabla][INFO]: Using DataIterator
2023-03-02 05:47:38,959 [nnabla][INFO]: Creating model...
2023-03-02 05:47:38,959 [nnabla][INFO]: {'hm': 80, 'reg': 2, 'wh': 2}
2023-03-02 05:47:38,964 [nnabla][INFO]: batch size per gpu: 32
^M  0%|          | 0/3697 [00:00<?, ?it/s]^M  0%|          | 0/3697 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "nnabla-examples/object-detection/centernet/src/main.py", line 147, in <module>
    main(opt)
  File "nnabla-examples/object-detection/centernet/src/main.py", line 112, in main
    _ = trainer.update(epoch)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 191, in update
    total_loss, hm_loss, wh_loss, off_loss = self.compute_gradient(
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  [Previous line repeated 7 more times]
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 175, in compute_gradient
    raise RuntimeError(
RuntimeError: Something went wrong with gradient calculations.
--------------------------------------------------------------------------

How to solve

Using a newer cuDNN version solved this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants