-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.distributed.DistBackendError when training on multiple GPUs #2635
Comments
any update ? |
It looks like the issue is referring to a timeout in the NCCL library. Could you paste the installed versions of PyTorch, Lightning, and NCCL? Also, are any training steps being run or is it crashing right from the start (I see that it doesn't go past the first training epoch)? |
OK looking at the traceback further, looks like there's an issue with syncing losses to display it in the progress bar. Could you try passing in |
Here is the log with
|
installed versions: lightning - 2.1.4 Yes, please see in the (more complete) log attached above, base training succeeds, it's the transfer training after that crashes. |
scvi crashes when trying to train on multiple GPUs (2x Tesla P100-PCIE-16GB)
As attempt to work around Lightning-AI/pytorch-lightning#17212 issue
strategy='ddp_find_unused_parameters_true'
was set.Versions:
The text was updated successfully, but these errors were encountered: