Training Stops at Initialization with Multi-GPU Setup on Local Machine #83

qwertymert · 2024-10-24T14:43:03Z

Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.

Command used to run training with app/main.py:

python main.py --fname=configs/pretrain/vitl16.yaml --devices cuda:0 cuda:1 cuda:2 cuda:3

Output

[INFO    ][2024-10-24 17:10:42][process_main             ] called-params configs/pretrain/vitl16.yaml
[INFO    ][2024-10-24 17:10:42][process_main             ] loaded params...
{   'app': 'vjepa',
    'data': {   'batch_size': 4,
                'clip_duration': None,
                'crop_size': 224,
                'dataset_type': 'XXXDataset',
                'datasets': [   '/home/xxx/xxx/jepa/src/datasets/xxx.csv'],
                'decode_one_clip': True,
                'filter_short_videos': False,
                'num_clips': 1,
                'num_frames': 16,
                'num_workers': 0,
                'patch_size': 16,
                'pin_mem': True,
                'sampling_rate': 1,
                'tubelet_size': 2},
    'data_aug': {   'auto_augment': False,
                    'motion_shift': False,
                    'random_resize_aspect_ratio': [0.75, 1.35],
                    'random_resize_scale': [0.3, 1.0],
                    'reprob': 0.0},
    'logging': {   'folder': '/home/xxx/xxx/jepa/evals/',
                   'write_tag': 'jepa'},
    'loss': {'loss_exp': 1.0, 'reg_coeff': 0.0},
    'mask': [   {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 8,
                    'spatial_scale': [0.15, 0.15],
                    'temporal_scale': [1.0, 1.0]},
                {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 2,
                    'spatial_scale': [0.7, 0.7],
                    'temporal_scale': [1.0, 1.0]}],
    'meta': {   'dtype': 'bfloat16',
                'eval_freq': 100,
                'load_checkpoint': True,
                'read_checkpoint': 'vitl16.pth.tar',
                'seed': 234,
                'use_sdpa': True},
    'model': {   'model_name': 'vit_large',
                 'pred_depth': 12,
                 'pred_embed_dim': 384,
                 'uniform_power': True,
                 'use_mask_tokens': True,
                 'zero_init_mask_tokens': True},
    'nodes': 1,
    'optimization': {   'clip_grad': 10.0,
                        'ema': [0.998, 1.0],
                        'epochs': 300,
                        'final_lr': 1e-06,
                        'final_weight_decay': 0.4,
                        'ipe': 300,
                        'ipe_scale': 1.25,
                        'lr': 0.000625,
                        'start_lr': 0.0002,
                        'warmup': 40,
                        'weight_decay': 0.04},
    'tasks_per_node': 4}
[INFO    ][2024-10-24 17:10:44][process_main             ] Running... (rank: 0/4)
[INFO    ][2024-10-24 17:10:44][main                     ] Running pre-training of app: vjepa

Environment:
Operating System: Ubuntu 24.04 LTS x86_64
Python version: 3.9
PyTorch version: 2.4.1
CUDA version: 12.1
NCCL version: 2.20.5
GPUs: 4 x NVIDIA RTX A5000

What I've Tried:

Verified that all GPUs are visible and available using nvidia-smi.
Verified that CUDA_VISIBLE_DEVICES is set correctly for each process.
Attempted to run the script on fewer GPUs (e.g., 1 or 2 GPUs) but faced the same issue.
Tried changing the main.py but couldn't solve the problem

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Stops at Initialization with Multi-GPU Setup on Local Machine #83

Training Stops at Initialization with Multi-GPU Setup on Local Machine #83

qwertymert commented Oct 24, 2024

Training Stops at Initialization with Multi-GPU Setup on Local Machine #83

Training Stops at Initialization with Multi-GPU Setup on Local Machine #83

Comments

qwertymert commented Oct 24, 2024