You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned in #2 (comment), the model was trained wth multiple GPUs. Also, according to the code, it is trained with DDP. Let's have a look at a piece of code in train_epoch:
It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.
Could you please share your thoughts on this? Thanks.
Best wishes
The text was updated successfully, but these errors were encountered:
Dear authors,
As mentioned in #2 (comment), the model was trained wth multiple GPUs. Also, according to the code, it is trained with DDP. Let's have a look at a piece of code in
train_epoch
:SegVol/train.py
Lines 62 to 92 in 97f91e7
It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.
Could you please share your thoughts on this? Thanks.
Best wishes
The text was updated successfully, but these errors were encountered: