You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.
Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.
Command used to run training with app/main.py:
Output
Environment:
Operating System: Ubuntu 24.04 LTS x86_64
Python version: 3.9
PyTorch version: 2.4.1
CUDA version: 12.1
NCCL version: 2.20.5
GPUs: 4 x NVIDIA RTX A5000
What I've Tried:
The text was updated successfully, but these errors were encountered: