Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True) #5

Open
laoreja opened this issue Jun 26, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@laoreja
Copy link

laoreja commented Jun 26, 2020

Hi there,

Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.

I noticed one place that they not only use
torch.cuda.set_device(local_rank) (L144)
but also set the specific gpu id everywhere (their args.gpu refers to local rank):

model.cuda(args.gpu)  # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])  # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu)  # L169

loc = 'cuda:{}'.format(args.gpu)  # L183
checkpoint = torch.load(args.resume, map_location=loc)

if args.gpu is not None:  # L 282
    images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

This is a bit weird. I'm wondering if you have any idea about this phenomenon?

And the doc for torch.cuda.set_device says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."

Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?

Thank you!

@tczhangzhi
Copy link
Owner

Of course, if you do torch.cuda.set_device(local_rank) it is okay to use model.cuda() instead of model.cuda(args.gpu). The official repo kept the old fashion of best practice based on previous versions.

And what do u mean by 'cannot kill all the processes by Ctrl+D'? Everything goes well for me.

@tczhangzhi tczhangzhi added the enhancement New feature or request label Jun 27, 2020
@laoreja
Copy link
Author

laoreja commented Jun 28, 2020

Thank you!
I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.

@lartpang
Copy link

lartpang commented Jul 3, 2020

Thank you!
I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.

Me too... Do you have any advice?

@tczhangzhi
Copy link
Owner

tczhangzhi commented Jul 11, 2020

I finally encountered a situation where the GPU memory was not cleared after using ctrl+c to kill the process today. It happened when I used a custom CUDA module. If you encounter the same situation, maybe you can refer to facebookresearch/fairseq#487, which works for me (even though does not look elegant).

I will try to find the cause of this problem, but it seems not easy to locate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants