Questions on Loss scale, Hyperparameters #4

lifelongeek · 2017-02-06T06:28:37Z

Thanks to share easy-to-follow code.
I am currently applying WGAN to learning text distribution.

Here is questions regarding WGAN.

Question1. In Figure 3, the loss of MLP and DCGAN seems comparable. However, I think scale of loss can be varied depending on weight initialization scale and model size. (Please correct me if I am wrong.) In this case, what could be the way to compare learning result of two different model?

Question2. Could you share what the sensitive hyperparameter for WGAN are?
For example : weight initialization scale(0.02), clamping threshold (0.01), batch size(64), model size, #D/G step(5), lr(0.00005)

Thank you

lifelongeek · 2017-02-06T07:45:57Z

oops.. The paper already mention about Question1.

"However, we do not claim that this is a new method to quantitatively evaluate
generative models yet. The constant scaling factor that depends on the critic’s
architecture means it’s hard to compare models with different critics."

gwern · 2017-02-06T14:21:24Z

Question2. Could you share what the sensitive hyperparameter for WGAN are? For example : weight initialization scale(0.02), clamping threshold (0.01), batch size(64), model size, #D/G step(5), lr(0.00005)

I can comment a bit on this: me and @FeepingCreature have been trying out WGAN for modeling anime images (and 64px cropped faces specifically because attempts at larger images and more diverse datasets failed totally even with increased learning rates/discriminator steps). So far we've found that batch size doesn't seem especially important, model size and image size are very important (64px works great but 128px struggles to get anywhere, we've had better results enlarging the model while keeping it at 64px), learning rate is important and higher than the defaults doesn't seem to work well, and #D/G steps or --Diters can be useful to tweak and definitely must be increased if learning rate is increased. We haven't tried changing the weight initialization or the clamping, but we have tried adding 4 fully connected layers to the generator (in between the latent z vector input and the convolutional layers) and the discriminator (at the top before the final state output) to try to encourage more global coherency, and this currently seems very promising but we haven't run any of the FC models to convergence yet so maybe it won't wind up helping. The Loss_D is reasonably helpful but hasn't turned out to be a panacea - there are long stretches where it bounces up and down despite the apparent image quality increasing. Overfitting thus far has not been a problem and expanding our face dataset, cleaning out non-faces (using a modified version of main.py to do scoring of image files by the discriminator to find & delete non-faces), and aggressive data augmentation have not helped - the WGANs heavily underfit.

Personally, I'm still wondering what it'll take to get unsupervised GANs to generate really diverse scenes on the level of StackGAN. I thought perhaps regular DCGANs could do it except that they diverged before leaning; but while, very impressively, none of my WGANs have diverged (just plateaued and stopped learning), they're still limited to highly homogeneous image sets.

LukasMosser · 2017-02-06T15:52:22Z

@gmkim90 I also have some experience here to share:
Increasing the learning rate required me to also set the DIters parameter higher to 20 instead of 5.
That produced a somewhat similar curve to what is shown in the paper, but flattening at much higher loss.

I have also observed improvements in image quality with no decrease in the discriminator loss, although I am still learning whether it is necessary for me to increase DIters further.

Leaving learning rate and Diters at its default never lowers my loss for the datasets I'm running (may be due to not enough DIters)

Kaede93 · 2020-07-25T16:07:58Z

I have similar questions, but it's relative to Wasserstein distance.

I trained WGAN, WGAN-GP, and WGAN-DIV using celebA dataset and DCGAN in 64*64 image size (with default hyperparameters recommended in paper).

The WD scale in WGAN are in range [0, 2], WGAN-GP and WGAN-DIV can reach hundreds at the beginning of training, and converged in range [0, 10]. The images generated by WGAN-DIV is much better than WGAN (using same noise), but why the "distance" between the fake and real are much higher in WGAN-DIV?
I replaced the lsgan loss in CycleGAN model with wgan loss, and the WD is extremely small (about 1e-4) at the beginning of training, and then goes to nan (which means that the gradients vanished?). So I am wondering if the wgan loss works in other GAN (except for DCGAN)?

Please contact me if you have any advice, thank you!

gwern mentioned this issue Feb 9, 2017

Interpretation of Discriminator Loss #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on Loss scale, Hyperparameters #4

Questions on Loss scale, Hyperparameters #4

lifelongeek commented Feb 6, 2017 •

edited

Loading

lifelongeek commented Feb 6, 2017

gwern commented Feb 6, 2017 •

edited

Loading

LukasMosser commented Feb 6, 2017

Kaede93 commented Jul 25, 2020

Questions on Loss scale, Hyperparameters #4

Questions on Loss scale, Hyperparameters #4

Comments

lifelongeek commented Feb 6, 2017 • edited Loading

lifelongeek commented Feb 6, 2017

gwern commented Feb 6, 2017 • edited Loading

LukasMosser commented Feb 6, 2017

Kaede93 commented Jul 25, 2020

lifelongeek commented Feb 6, 2017 •

edited

Loading

gwern commented Feb 6, 2017 •

edited

Loading