You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.
I noticed one place that they not only use torch.cuda.set_device(local_rank) (L144)
but also set the specific gpu id everywhere (their args.gpu refers to local rank):
This is a bit weird. I'm wondering if you have any idea about this phenomenon?
And the doc for torch.cuda.set_device says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."
Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?
Thank you!
The text was updated successfully, but these errors were encountered:
Of course, if you do torch.cuda.set_device(local_rank) it is okay to use model.cuda() instead of model.cuda(args.gpu). The official repo kept the old fashion of best practice based on previous versions.
And what do u mean by 'cannot kill all the processes by Ctrl+D'? Everything goes well for me.
Thank you!
I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.
Thank you!
I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.
I finally encountered a situation where the GPU memory was not cleared after using ctrl+c to kill the process today. It happened when I used a custom CUDA module. If you encounter the same situation, maybe you can refer to facebookresearch/fairseq#487, which works for me (even though does not look elegant).
I will try to find the cause of this problem, but it seems not easy to locate...
Hi there,
Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.
I noticed one place that they not only use
torch.cuda.set_device(local_rank)
(L144)but also set the specific gpu id everywhere (their
args.gpu
refers to local rank):This is a bit weird. I'm wondering if you have any idea about this phenomenon?
And the doc for
torch.cuda.set_device
says that:"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."
Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?
Thank you!
The text was updated successfully, but these errors were encountered: