Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training cannot achieve ideal accuracy #6

Open
Cathy0908 opened this issue Nov 3, 2022 · 1 comment
Open

Distributed training cannot achieve ideal accuracy #6

Cathy0908 opened this issue Nov 3, 2022 · 1 comment

Comments

@Cathy0908
Copy link

some codes in train.py is like:

if world_size > 1:
    with torch.no_grad():
          all_z = [torch.zeros_like(z) for _ in range(world_size)]
          torch.distributed.all_gather(all_z, z)
    all_z[cfg.local_rank] = z
    z = torch.cat(all_z)

I'm not sure whether this distributed training codes are correct. I tried training with 8xGPU and got lower accuracy than training with. My setting is:

Train 1 GPU:

python train.py --ds CUB --model vit_small_patch16_224 --num_samples 9 --lr 3e-5 --ep 50 --resize 256 --bs 900

Train 8 GPU:

python -m torch.distributed.launch --nproc_per_node=4 train.py --ds CUB --model vit_small_patch16_224 --num_samples 9 --lr 3e-5 --ep 50 --resize 256 --bs 225

I think training with batch-size=4N on 1 GPU is exactly equivalent to training with batch-size=N on 4 GPU ( when syncBN is on )
What's the correct way to train on multi-GPU ?

@minhquoc0712
Copy link

minhquoc0712 commented Feb 1, 2023

Hi @Cathy0908, how far your results different from the paper. I have trained on Cars196 with one Nvidia A100, and all the results is lower than in the paper. The lower the K value in recall@K, the further the gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants