Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about training device #13

Open
long280 opened this issue Mar 21, 2024 · 5 comments
Open

about training device #13

long280 opened this issue Mar 21, 2024 · 5 comments

Comments

@long280
Copy link

long280 commented Mar 21, 2024

Hello, thank you very much for your work, I would like to ask about the equipment you use when training your code. And why is the fc_weight file given in your paper only 4kb, but the model file I trained is 1~2GB?

@255doesnotexist
Copy link

255doesnotexist commented Aug 12, 2024

They may uploaded a git-lfs pointer file instead of a full pretrain model...
And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output.
Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256.
For 1 epoch, ~1.5hrs needed.

@255doesnotexist
Copy link

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

@oceanzhf
Copy link

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal? How did you perform training by addressing fake/real_list_path manually in the command args?

@255doesnotexist
Copy link

255doesnotexist commented Sep 23, 2024

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal?

IDK. But it may be normal because no runtime exception was thrown in training process.

How did you perform training by addressing fake/real_list_path manually in the command args?

You should modify data/datasets.py, add a data mode like 'manually'.

elif opt.data_mode == 'manually':
    real_list = get_list( os.path.join(opt.real_list_path) )
    fake_list = get_list( os.path.join(opt.fake_list_path) )

Because there is no .pickle in this path so it just triggered recursive searching in your image dataset path.

Address your real and fake image path and it works soon.

@oceanzhf
Copy link

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal?

IDK. But it may be normal because no runtime exception was thrown in training process.

How did you perform training by addressing fake/real_list_path manually in the command args?

You should modify data/datasets.py, add a data mode like 'manually'.

elif opt.data_mode == 'manually':
    real_list = get_list( os.path.join(opt.real_list_path) )
    fake_list = get_list( os.path.join(opt.fake_list_path) )

Because there is no .pickle in this path so it just triggered recursive searching in your image dataset path.

Address your real and fake image path and it works soon.

Thank you very much for your answer. When you validate with your trained .pth file, do you encounter this error: RuntimeError: Error(s) in loading state_dict for Linear: Missing key(s) in state_dict: 'weight', 'bias'. Unexpected key(s) in state_dict: 'model', 'optimizer', 'total_steps'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants