Distributed training cannot achieve ideal accuracy #6

Cathy0908 · 2022-11-03T08:55:51Z

some codes in train.py is like:

if world_size > 1:
    with torch.no_grad():
          all_z = [torch.zeros_like(z) for _ in range(world_size)]
          torch.distributed.all_gather(all_z, z)
    all_z[cfg.local_rank] = z
    z = torch.cat(all_z)

I'm not sure whether this distributed training codes are correct. I tried training with 8xGPU and got lower accuracy than training with. My setting is:

Train 1 GPU:

python train.py --ds CUB --model vit_small_patch16_224 --num_samples 9 --lr 3e-5 --ep 50 --resize 256 --bs 900

Train 8 GPU:

python -m torch.distributed.launch --nproc_per_node=4 train.py --ds CUB --model vit_small_patch16_224 --num_samples 9 --lr 3e-5 --ep 50 --resize 256 --bs 225

I think training with batch-size=4N on 1 GPU is exactly equivalent to training with batch-size=N on 4 GPU ( when syncBN is on )
What's the correct way to train on multi-GPU ?

The text was updated successfully, but these errors were encountered:

minhquoc0712 · 2023-02-01T10:51:26Z

Hi @Cathy0908, how far your results different from the paper. I have trained on Cars196 with one Nvidia A100, and all the results is lower than in the paper. The lower the K value in recall@K, the further the gap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training cannot achieve ideal accuracy #6

Distributed training cannot achieve ideal accuracy #6

Cathy0908 commented Nov 3, 2022

minhquoc0712 commented Feb 1, 2023 •

edited

Loading

Distributed training cannot achieve ideal accuracy #6

Distributed training cannot achieve ideal accuracy #6

Comments

Cathy0908 commented Nov 3, 2022

minhquoc0712 commented Feb 1, 2023 • edited Loading

minhquoc0712 commented Feb 1, 2023 •

edited

Loading