-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training instability with motorcycle #138
Comments
Hi @pallgeuer, thanks for the great questions! I think the divergence might be because the GAN is unstable to train. There're several opinions you can try to increase stability:
Re volume subdivision: Unfortunately, we didn't include the volume subdivision code in this codebase, our paper has the ablation studies when removing the volume subdivision (Table 2) and the released pre-trained model is trained without volume subdivision. Re R1 regularization: the gamma we used is 80 for motorbikes. Re SDF regularization: which regularization do you mean exactly? we do have one regularization in the paper (Eq 2) and the hyperparameter for this is fixed in the code for all the experiments Re single v.s. two discriminators: we always use two discriminators in all the experiments (except when we do ablation studies to compare the effect of two discriminators v.s. a single discriminator) Re Adam beta. We apologize for the typo in the paper, the (0, 0.99) in the code is the correct one. |
Hi, many thanks for the detailed answer. My original diverging trainings were with a batch size of 64, which is close to the most I can fit on 2xA100 (96 fits but starts showing symptoms of hitting against the GPU memory limit). Is by any chance gradient accumulation implemented to allow higher batch sizes? Was the pretrained model trained with Weirdly, a training run that set gamma=40 instead of gamma=80 was the first run that has made it to 6000kimg and is currently still converging. I got the name "SDF regularization" from the paper:
But yes, this is exactly the loss contribution described in Eqns 2 & 3 like you said. Okay, so this github repo by default uses two discriminators when called with parameters like I specified? Was the choice of Adam betas just inherited from another project, or did initial tests with beta1 >= 0.5 show that it had a detrimental effect? Was training with it unstable? Has a learning rate scheduler that reduces the learning rate over time been tested? |
Hi, I meet the similar issue when I am trying to use this code to train on Chair, The following are the fid scores during training: I have also calculated the checkpoint provided in https://drive.google.com/drive/folders/1oJ-FmyVYjIwBZKDAQ4N1EEcE9dJjumdW, {"results": {"fid50k": 22.706035931177578}, "metric": "fid50k", "total_time": 1566.5149657726288, "total_time_str": "26m 07s", "num_gpus": 1, "snapshot_pkl": "weights/shapenet_chair.pt", "timestamp": 1693843925.6001537} The best model I achieved is network-snapshot-001433.pkl which gets {"results": {"fid50k": 28.708304000589685}, "metric": "fid50k", "total_time": 1631.6883997917175, "total_time_str": "27m 12s", "num_gpus": 1, "snapshot_pkl": "../../../results/00001-stylegan2-03001627-gpus8-batch32-gamma400/network-snapshot-001433.pt", "timestamp": 1693845837.4513173} Is there any problem in my training setting? |
Hi, I am trying to use this code to train on the motorcycle data but the training is proving to be unstable. I have done the blender renders as described and have all 337 models with 96 renders per model. I train as follows:
This should be essentially the default documented training parameters except that I'm running on 2xA100 instead of 8xA100.
My issue is that the FID50k only decreases from ~250 initially to ~85 (more than the 50-65 expected from the paper), and at around 2000-3000kimg (out of the planned 20000kimg) the training diverges and never recovers. What parameters should I use so that your code on your data can actually finish training?
I would also be interested to know what the differences are between the code and training commands provided in this github, and the one that was used to train the pretrained motorcycle model. For one, volume subdivision isn't implemented, but what else (e.g. R1 regularization, SDF regularization, single vs two discriminators)? The paper also says Adam beta = 0.9, but the code uses (0, 0.99) (!) which is puzzling.
The text was updated successfully, but these errors were encountered: