Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the training synchronized across ranks during distributed training? #21

Open
function2-llx opened this issue Jun 10, 2024 · 0 comments

Comments

@function2-llx
Copy link

Dear authors,

As mentioned in #2 (comment), the model was trained wth multiple GPUs. Also, according to the code, it is trained with DDP. Let's have a look at a piece of code in train_epoch:

SegVol/train.py

Lines 62 to 92 in 97f91e7

for batch in epoch_iterator:
image, gt3D = batch["image"].cuda(), batch["post_label"].cuda()
pseudo_seg_cleaned = batch['pseudo_seg_cleaned'].cuda()
organ_name_list = batch['organ_name_list']
loss_step_avg = 0
sl_loss_step_avg = 0
ssl_loss_step_avg = 0
for cls_idx in range(len(organ_name_list)):
optimizer.zero_grad()
organs_cls = organ_name_list[cls_idx]
labels_cls = gt3D[:, cls_idx]
if torch.sum(labels_cls) == 0:
print(f'[RANK {rank}: GPU {gpu}] ITER-{iter_num} --- No object, skip iter')
continue
sl_loss, ssl_loss = segvol_model(image, organs=None, boxes=None, points=None,
train_organs=organs_cls,
train_labels=labels_cls,
pseudo_seg_cleaned=pseudo_seg_cleaned)
if args.use_pseudo_label:
loss = sl_loss + 0.1 * ssl_loss
ssl_loss_step_avg += ssl_loss.item()
sl_loss_step_avg += sl_loss.item()
loss_step_avg += loss.item()
loss.backward()
optimizer.step()
print(f'[RANK {rank}: GPU {gpu}] ITER-{iter_num} --- loss {loss.item()}, sl_loss, {sl_loss.item()}, ssl_loss {ssl_loss.item()}')
iter_num += 1

It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.

Could you please share your thoughts on this? Thanks.

Best wishes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant