Problems when training with multiple GPUs #78

EthanLeong · 2023-02-16T08:17:13Z

Hi,

Thank the author for this amazing repository. I am having problems with training the model with multiple GPUs and I wonder if anyone else is also having the problem. The training is fine when using a a single RTX3090, but whenever I tried to use 2 GPUs with the following command:
python main.py configs/resa/resa34_openlane.py --gpus 0 1
The following error occurs:
/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/nn/parallel/functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "main.py", line 66, in
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 75, in train_epoch
loss.backward()
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 141, in backward
grad_tensors = make_grads(tensors, grad_tensors)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 50, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs

After searching on the Internet, I found out that this error can be avoided by changing loss.backward() to loss.sum().backward(). However, this would cause the recorder and logging function to fail:
--- Logging error ---
Traceback (most recent call last):
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 1085, in emit
msg = self.format(record)
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 929, in format
return fmt.format(record)
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 668, in format
record.message = record.getMessage()
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 371, in getMessage
msg = str(self.msg)
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 116, in str
loss_state.append('{}: {:.4f}'.format(k, v.avg))
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 32, in avg
d = torch.tensor(list(self.deque))
ValueError: only one element tensors can be converted to Python scalars
Call stack:
File "main.py", line 66, in
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 89, in train_epoch
self.recorder.record('train')
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 97, in record
self.logger.info(self)
Message: <lanedet.utils.recorder.Recorder object at 0x7fd865ac7eb0>
Arguments: ()

Does anyone have a idea how to solve this? Any help is appreciated! Thank you.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when training with multiple GPUs #78

Problems when training with multiple GPUs #78

EthanLeong commented Feb 16, 2023 •

edited

Loading

Problems when training with multiple GPUs #78

Problems when training with multiple GPUs #78

Comments

EthanLeong commented Feb 16, 2023 • edited Loading

EthanLeong commented Feb 16, 2023 •

edited

Loading