Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on Loss scale, Hyperparameters #4

Open
lifelongeek opened this issue Feb 6, 2017 · 4 comments
Open

Questions on Loss scale, Hyperparameters #4

lifelongeek opened this issue Feb 6, 2017 · 4 comments

Comments

@lifelongeek
Copy link

lifelongeek commented Feb 6, 2017

Thanks to share easy-to-follow code.
I am currently applying WGAN to learning text distribution.

Here is questions regarding WGAN.

Question1. In Figure 3, the loss of MLP and DCGAN seems comparable. However, I think scale of loss can be varied depending on weight initialization scale and model size. (Please correct me if I am wrong.) In this case, what could be the way to compare learning result of two different model?

Question2. Could you share what the sensitive hyperparameter for WGAN are?
For example : weight initialization scale(0.02), clamping threshold (0.01), batch size(64), model size, #D/G step(5), lr(0.00005)

Thank you

@lifelongeek
Copy link
Author

oops.. The paper already mention about Question1.

"However, we do not claim that this is a new method to quantitatively evaluate
generative models yet. The constant scaling factor that depends on the critic’s
architecture means it’s hard to compare models with different critics."

@gwern
Copy link

gwern commented Feb 6, 2017

Question2. Could you share what the sensitive hyperparameter for WGAN are? For example : weight initialization scale(0.02), clamping threshold (0.01), batch size(64), model size, #D/G step(5), lr(0.00005)

I can comment a bit on this: me and @FeepingCreature have been trying out WGAN for modeling anime images (and 64px cropped faces specifically because attempts at larger images and more diverse datasets failed totally even with increased learning rates/discriminator steps). So far we've found that batch size doesn't seem especially important, model size and image size are very important (64px works great but 128px struggles to get anywhere, we've had better results enlarging the model while keeping it at 64px), learning rate is important and higher than the defaults doesn't seem to work well, and #D/G steps or --Diters can be useful to tweak and definitely must be increased if learning rate is increased. We haven't tried changing the weight initialization or the clamping, but we have tried adding 4 fully connected layers to the generator (in between the latent z vector input and the convolutional layers) and the discriminator (at the top before the final state output) to try to encourage more global coherency, and this currently seems very promising but we haven't run any of the FC models to convergence yet so maybe it won't wind up helping. The Loss_D is reasonably helpful but hasn't turned out to be a panacea - there are long stretches where it bounces up and down despite the apparent image quality increasing. Overfitting thus far has not been a problem and expanding our face dataset, cleaning out non-faces (using a modified version of main.py to do scoring of image files by the discriminator to find & delete non-faces), and aggressive data augmentation have not helped - the WGANs heavily underfit.

Personally, I'm still wondering what it'll take to get unsupervised GANs to generate really diverse scenes on the level of StackGAN. I thought perhaps regular DCGANs could do it except that they diverged before leaning; but while, very impressively, none of my WGANs have diverged (just plateaued and stopped learning), they're still limited to highly homogeneous image sets.

@LukasMosser
Copy link

@gmkim90 I also have some experience here to share:
Increasing the learning rate required me to also set the DIters parameter higher to 20 instead of 5.
That produced a somewhat similar curve to what is shown in the paper, but flattening at much higher loss.

I have also observed improvements in image quality with no decrease in the discriminator loss, although I am still learning whether it is necessary for me to increase DIters further.

Leaving learning rate and Diters at its default never lowers my loss for the datasets I'm running (may be due to not enough DIters)

@Kaede93
Copy link

Kaede93 commented Jul 25, 2020

I have similar questions, but it's relative to Wasserstein distance.

I trained WGAN, WGAN-GP, and WGAN-DIV using celebA dataset and DCGAN in 64*64 image size (with default hyperparameters recommended in paper).

  1. The WD scale in WGAN are in range [0, 2], WGAN-GP and WGAN-DIV can reach hundreds at the beginning of training, and converged in range [0, 10]. The images generated by WGAN-DIV is much better than WGAN (using same noise), but why the "distance" between the fake and real are much higher in WGAN-DIV?

  2. I replaced the lsgan loss in CycleGAN model with wgan loss, and the WD is extremely small (about 1e-4) at the beginning of training, and then goes to nan (which means that the gradients vanished?). So I am wondering if the wgan loss works in other GAN (except for DCGAN)?

Please contact me if you have any advice, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants