Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with Top-k fails on multi GPU #19

Open
thomas-riccardi opened this issue Sep 30, 2021 · 1 comment
Open

Training with Top-k fails on multi GPU #19

thomas-riccardi opened this issue Sep 30, 2021 · 1 comment

Comments

@thomas-riccardi
Copy link

Describe the bug
Running training with Top-k feature on multi GPU fails with torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'epochs'

It does with with one GPU, or with multi GPU and without topk.

To Reproduce
Steps to reproduce the behavior:

  1. with ./docker_run.sh
  2. run python train.py --outdir=/results --data=/images/ --resume=ffhq256 --gpus=2 --metrics=none --snap=1 --topk=0.9726
  3. result:
...
Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...

Traceback (most recent call last):
  File "train.py", line 608, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 603, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 247, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 205, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 166, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/scratch/train.py", line 445, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/scratch/training/training_loop.py", line 305, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
  File "/scratch/training/loss.py", line 81, in accumulate_gradients
    k_frac = np.maximum(self.G_top_k_gamma ** self.G.epochs, self.G_top_k_frac)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 795, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'epochs'

Expected behavior
Trainings with Top-k should work on multi GPU as it works on mono GPU.
Or, refuse to start in such conditions (with an error) if it's not supported.

Desktop (please complete the following information):

  • OS: Linux Ubuntu 20.04
  • NVIDIA driver version 460
  • Docker: nvcr.io/nvidia/pytorch:20.12-py3

Additional context
I use this repo head (464100c for reference) + merge of NVlabs#3; there was minimal conflicts, and I checked the code touched by #16 was not changed by that merge.

@49xxy
Copy link

49xxy commented Jul 6, 2022

描述 在多 GPU 上使用 Top-k 功能运行训练失败的错误torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'epochs'

它使用一个 GPU,或者使用多个 GPU 而没有 topk。

重现 行为的步骤:

  1. 与 ./docker_run.sh
  2. python train.py --outdir=/results --data=/images/ --resume=ffhq256 --gpus=2 --metrics=none --snap=1 --topk=0.9726
  3. 结果:
...
Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...

Traceback (most recent call last):
  File "train.py", line 608, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 603, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 247, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 205, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 166, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/scratch/train.py", line 445, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/scratch/training/training_loop.py", line 305, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
  File "/scratch/training/loss.py", line 81, in accumulate_gradients
    k_frac = np.maximum(self.G_top_k_gamma ** self.G.epochs, self.G_top_k_frac)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 795, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'epochs'

Top-k 的预期行为 训练应该在多 GPU 上工作,因为它在单 GPU 上工作。 或者,如果不支持,则拒绝在这种情况下(出现错误)启动。

桌面(请填写以下信息):

  • 操作系统:Linux Ubuntu 20.04
  • NVIDIA 驱动程序版本 460
  • 码头工人:nvcr.io/nvidia/pytorch:20.12-py3

附加上下文 我使用这个 repo 头(464100c供参考)+ NVlabs#3的合并;冲突很少,我检查了#16触及的代码没有被该合并更改。

Hi!Under what circumstances should top-K be used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants