How is the training synchronized across ranks during distributed training? #21

function2-llx · 2024-06-10T01:49:13Z

Dear authors,

As mentioned in #2 (comment), the model was trained wth multiple GPUs. Also, according to the code, it is trained with DDP. Let's have a look at a piece of code in train_epoch:

SegVol/train.py

Lines 62 to 92 in 97f91e7

    
           for batch in epoch_iterator: 
        
               image, gt3D = batch["image"].cuda(), batch["post_label"].cuda() 
        
               pseudo_seg_cleaned = batch['pseudo_seg_cleaned'].cuda() 
        
               organ_name_list = batch['organ_name_list'] 
        
               loss_step_avg = 0 
        
               sl_loss_step_avg = 0 
        
               ssl_loss_step_avg = 0 
        
               for cls_idx in range(len(organ_name_list)): 
        
                   optimizer.zero_grad() 
        
                   organs_cls = organ_name_list[cls_idx] 
        
                   labels_cls = gt3D[:, cls_idx] 
        
                   if torch.sum(labels_cls) == 0: 
        
                       print(f'[RANK {rank}: GPU {gpu}] ITER-{iter_num} --- No object, skip iter') 
        
                       continue 
        
                   sl_loss, ssl_loss = segvol_model(image, organs=None, boxes=None, points=None, 
        
                                                   train_organs=organs_cls, 
        
                                                   train_labels=labels_cls, 
        
                                                   pseudo_seg_cleaned=pseudo_seg_cleaned) 
        
                   if args.use_pseudo_label: 
        
                       loss = sl_loss + 0.1 * ssl_loss 
        
                       ssl_loss_step_avg += ssl_loss.item() 
        
                       sl_loss_step_avg += sl_loss.item() 
        
                   loss_step_avg += loss.item() 
        
                   loss.backward() 
        
                   optimizer.step() 
        
                   print(f'[RANK {rank}: GPU {gpu}] ITER-{iter_num} --- loss {loss.item()}, sl_loss, {sl_loss.item()}, ssl_loss {ssl_loss.item()}') 
        
                   iter_num += 1

It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.

Could you please share your thoughts on this? Thanks.

Best wishes

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is the training synchronized across ranks during distributed training? #21

How is the training synchronized across ranks during distributed training? #21

function2-llx commented Jun 10, 2024

How is the training synchronized across ranks during distributed training? #21

How is the training synchronized across ranks during distributed training? #21

Comments

function2-llx commented Jun 10, 2024