Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when training with multiple GPUs #78

Open
EthanLeong opened this issue Feb 16, 2023 · 0 comments
Open

Problems when training with multiple GPUs #78

EthanLeong opened this issue Feb 16, 2023 · 0 comments

Comments

@EthanLeong
Copy link

EthanLeong commented Feb 16, 2023

Hi,

Thank the author for this amazing repository. I am having problems with training the model with multiple GPUs and I wonder if anyone else is also having the problem. The training is fine when using a a single RTX3090, but whenever I tried to use 2 GPUs with the following command:
python main.py configs/resa/resa34_openlane.py --gpus 0 1
The following error occurs:
/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/nn/parallel/functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "main.py", line 66, in
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 75, in train_epoch
loss.backward()
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 141, in backward
grad_tensors
= make_grads(tensors, grad_tensors)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 50, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs

After searching on the Internet, I found out that this error can be avoided by changing loss.backward() to loss.sum().backward(). However, this would cause the recorder and logging function to fail:
--- Logging error ---
Traceback (most recent call last):
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 1085, in emit
msg = self.format(record)
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 929, in format
return fmt.format(record)
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 668, in format
record.message = record.getMessage()
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 371, in getMessage
msg = str(self.msg)
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 116, in str
loss_state.append('{}: {:.4f}'.format(k, v.avg))
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 32, in avg
d = torch.tensor(list(self.deque))
ValueError: only one element tensors can be converted to Python scalars
Call stack:
File "main.py", line 66, in
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 89, in train_epoch
self.recorder.record('train')
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 97, in record
self.logger.info(self)
Message: <lanedet.utils.recorder.Recorder object at 0x7fd865ac7eb0>
Arguments: ()

Does anyone have a idea how to solve this? Any help is appreciated! Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant