Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Stops at Initialization with Multi-GPU Setup on Local Machine #83

Open
qwertymert opened this issue Oct 24, 2024 · 0 comments
Open

Comments

@qwertymert
Copy link

Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.

Command used to run training with app/main.py:

python main.py --fname=configs/pretrain/vitl16.yaml --devices cuda:0 cuda:1 cuda:2 cuda:3

Output

[INFO    ][2024-10-24 17:10:42][process_main             ] called-params configs/pretrain/vitl16.yaml
[INFO    ][2024-10-24 17:10:42][process_main             ] loaded params...
{   'app': 'vjepa',
    'data': {   'batch_size': 4,
                'clip_duration': None,
                'crop_size': 224,
                'dataset_type': 'XXXDataset',
                'datasets': [   '/home/xxx/xxx/jepa/src/datasets/xxx.csv'],
                'decode_one_clip': True,
                'filter_short_videos': False,
                'num_clips': 1,
                'num_frames': 16,
                'num_workers': 0,
                'patch_size': 16,
                'pin_mem': True,
                'sampling_rate': 1,
                'tubelet_size': 2},
    'data_aug': {   'auto_augment': False,
                    'motion_shift': False,
                    'random_resize_aspect_ratio': [0.75, 1.35],
                    'random_resize_scale': [0.3, 1.0],
                    'reprob': 0.0},
    'logging': {   'folder': '/home/xxx/xxx/jepa/evals/',
                   'write_tag': 'jepa'},
    'loss': {'loss_exp': 1.0, 'reg_coeff': 0.0},
    'mask': [   {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 8,
                    'spatial_scale': [0.15, 0.15],
                    'temporal_scale': [1.0, 1.0]},
                {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 2,
                    'spatial_scale': [0.7, 0.7],
                    'temporal_scale': [1.0, 1.0]}],
    'meta': {   'dtype': 'bfloat16',
                'eval_freq': 100,
                'load_checkpoint': True,
                'read_checkpoint': 'vitl16.pth.tar',
                'seed': 234,
                'use_sdpa': True},
    'model': {   'model_name': 'vit_large',
                 'pred_depth': 12,
                 'pred_embed_dim': 384,
                 'uniform_power': True,
                 'use_mask_tokens': True,
                 'zero_init_mask_tokens': True},
    'nodes': 1,
    'optimization': {   'clip_grad': 10.0,
                        'ema': [0.998, 1.0],
                        'epochs': 300,
                        'final_lr': 1e-06,
                        'final_weight_decay': 0.4,
                        'ipe': 300,
                        'ipe_scale': 1.25,
                        'lr': 0.000625,
                        'start_lr': 0.0002,
                        'warmup': 40,
                        'weight_decay': 0.04},
    'tasks_per_node': 4}
[INFO    ][2024-10-24 17:10:44][process_main             ] Running... (rank: 0/4)
[INFO    ][2024-10-24 17:10:44][main                     ] Running pre-training of app: vjepa

Environment:
Operating System: Ubuntu 24.04 LTS x86_64
Python version: 3.9
PyTorch version: 2.4.1
CUDA version: 12.1
NCCL version: 2.20.5
GPUs: 4 x NVIDIA RTX A5000

What I've Tried:

  • Verified that all GPUs are visible and available using nvidia-smi.
  • Verified that CUDA_VISIBLE_DEVICES is set correctly for each process.
  • Attempted to run the script on fewer GPUs (e.g., 1 or 2 GPUs) but faced the same issue.
  • Tried changing the main.py but couldn't solve the problem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant