[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True) #5

laoreja · 2020-06-26T12:11:48Z

Hi there,

Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.

I noticed one place that they not only use
torch.cuda.set_device(local_rank) (L144)
but also set the specific gpu id everywhere (their args.gpu refers to local rank):

model.cuda(args.gpu)  # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])  # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu)  # L169

loc = 'cuda:{}'.format(args.gpu)  # L183
checkpoint = torch.load(args.resume, map_location=loc)

if args.gpu is not None:  # L 282
    images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

This is a bit weird. I'm wondering if you have any idea about this phenomenon?

And the doc for torch.cuda.set_device says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."

Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?

Thank you!

The text was updated successfully, but these errors were encountered:

tczhangzhi · 2020-06-27T10:42:59Z

Of course, if you do torch.cuda.set_device(local_rank) it is okay to use model.cuda() instead of model.cuda(args.gpu). The official repo kept the old fashion of best practice based on previous versions.

And what do u mean by 'cannot kill all the processes by Ctrl+D'? Everything goes well for me.

laoreja · 2020-06-28T00:23:54Z

Thank you!
I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.

lartpang · 2020-07-03T03:40:34Z

Thank you!
I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.

Me too... Do you have any advice?

tczhangzhi · 2020-07-11T08:09:17Z

I finally encountered a situation where the GPU memory was not cleared after using ctrl+c to kill the process today. It happened when I used a custom CUDA module. If you encounter the same situation, maybe you can refer to facebookresearch/fairseq#487, which works for me (even though does not look elegant).

I will try to find the cause of this problem, but it seems not easy to locate...

tczhangzhi added the enhancement New feature or request label Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True) #5

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True) #5

laoreja commented Jun 26, 2020

tczhangzhi commented Jun 27, 2020

laoreja commented Jun 28, 2020

lartpang commented Jul 3, 2020

tczhangzhi commented Jul 11, 2020 •

edited

Loading

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True) #5

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True) #5

Comments

laoreja commented Jun 26, 2020

tczhangzhi commented Jun 27, 2020

laoreja commented Jun 28, 2020

lartpang commented Jul 3, 2020

tczhangzhi commented Jul 11, 2020 • edited Loading

tczhangzhi commented Jul 11, 2020 •

edited

Loading