about training device #13

long280 · 2024-03-21T10:01:17Z

Hello, thank you very much for your work, I would like to ask about the equipment you use when training your code. And why is the fc_weight file given in your paper only 4kb, but the model file I trained is 1~2GB?

255doesnotexist · 2024-08-12T01:27:21Z

They may uploaded a git-lfs pointer file instead of a full pretrain model...
And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output.
Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256.
For 1 epoch, ~1.5hrs needed.

255doesnotexist · 2024-08-19T15:40:12Z

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

oceanzhf · 2024-09-23T03:21:22Z

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal? How did you perform training by addressing fake/real_list_path manually in the command args?

255doesnotexist · 2024-09-23T07:32:26Z

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal?

IDK. But it may be normal because no runtime exception was thrown in training process.

How did you perform training by addressing fake/real_list_path manually in the command args?

You should modify data/datasets.py, add a data mode like 'manually'.

elif opt.data_mode == 'manually':
    real_list = get_list( os.path.join(opt.real_list_path) )
    fake_list = get_list( os.path.join(opt.fake_list_path) )

Because there is no .pickle in this path so it just triggered recursive searching in your image dataset path.

Address your real and fake image path and it works soon.

oceanzhf · 2024-09-25T02:22:57Z

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal?

IDK. But it may be normal because no runtime exception was thrown in training process.

How did you perform training by addressing fake/real_list_path manually in the command args?

You should modify data/datasets.py, add a data mode like 'manually'.
elif opt.data_mode == 'manually':
    real_list = get_list( os.path.join(opt.real_list_path) )
    fake_list = get_list( os.path.join(opt.fake_list_path) )
Because there is no .pickle in this path so it just triggered recursive searching in your image dataset path.

Address your real and fake image path and it works soon.

Thank you very much for your answer. When you validate with your trained .pth file, do you encounter this error: RuntimeError: Error(s) in loading state_dict for Linear: Missing key(s) in state_dict: 'weight', 'bias'. Unexpected key(s) in state_dict: 'model', 'optimizer', 'total_steps'?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about training device #13

about training device #13

long280 commented Mar 21, 2024

255doesnotexist commented Aug 12, 2024 •

edited

Loading

255doesnotexist commented Aug 19, 2024

oceanzhf commented Sep 23, 2024

255doesnotexist commented Sep 23, 2024 •

edited

Loading

oceanzhf commented Sep 25, 2024

about training device #13

about training device #13

Comments

long280 commented Mar 21, 2024

255doesnotexist commented Aug 12, 2024 • edited Loading

255doesnotexist commented Aug 19, 2024

oceanzhf commented Sep 23, 2024

255doesnotexist commented Sep 23, 2024 • edited Loading

oceanzhf commented Sep 25, 2024

255doesnotexist commented Aug 12, 2024 •

edited

Loading

255doesnotexist commented Sep 23, 2024 •

edited

Loading