-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to train to convergence (small dataset) #9
Comments
I'm noticing the same on my own dataset of ~175k text-image pairs, so maybe it's not a dataset size issue (or 200k is also not enough)? To add my own data of what I've tried so far to resolve the NaN issue: I originally tried training with a learning rate of 1e-4 and a training batch size of 32. Training would run for about 8 hours or so on my 3090, but eventually NaNs would consistently take over, no matter how much I rolled back to a prior checkpoint. I noticed that the loss value would also have dramatic swings while training (+/- 3.0 between consecutive training iterations, it was hard to tell if the loss was actually trending any direction). I've tried the following so far:
I can update here if I manage to get things working well. I think I have the learning rate too low now, but I'll give it some time to see if things work out. |
Beautiful. Will kick off training tonight and get back to you when I have results. |
Yes definitely interested. Please share how it goes. |
@lucidrains Almost identical training loss behavior as before. I have no strong intuition on why. I'm gonna do some deeper logging of various losses in the CLIP module and backtrack from there. Will update you when I know more. |
My attempt with the larger batch size is still going without any NaNs so far in about 62 hours of training on my 3090. Currently the loss is hovering around -0.21, so it's possible they're just around the corner. I'll let you know if and when they show up, but I would possibly recommend trying your training with a similar batch size if you haven't already. It's possible you may need a larger batch size than 64, because I'm also passing in the augmentations for each image. I guess my effective batch size would be 64*32=2048 with the augments. |
@Netruk44 Thanks for the update. I've tried larger batch sizes (e.g. 384) and same issue. The fact that the loss is negative is still concerning. Still curious where you end up once training finishes however. @lucidrains |
@jacobwjs Unfortunately, my machine power-cycled itself for some reason, so training on my x-clip model has stopped for now. I wanted to test out lucidrains's imagen model with my text-image dataset. I might come back to this, though. For this model, as far as I could tell the loss never converged. It continued down below -1.0, but I never saw a NaN. Previous models would occasionally see a NaN loss while training before it got to the point where I couldn't train the model at all anymore due to the NaNs. So if the 384 batch size didn't help you, maybe the augmentations or the 1e-5 learning rate helped to stabilize things for me. I also set Either way, it looks like you've found settings that'll let you train the model. I'd be curious to know how well the model works for you. |
Thanks for following up. Bummer on the lost run, but hopefully you have some checkpoints for when/if you come back to training. I assumed most would be interested in getting imagen up and running. Good luck with it, and I'll be following along there as well. I'll tweak a few more parts of x-clip before kicking off "real" training. Will get those results here eventually. Until then... |
Hi nice work with x-clip. Hoping to play around with it and eventually combine it into your DALLE2 work.
Currently having some trouble training on roughly 30k image-text pairs. Loss eventually goes negative and starts producing Nan's. I've dropped learning rate down (1e-4) and I'm clipping gradients (max_norm=0.5).
Any thoughts on what are sane training params/configs on such a small dataset using x-clip?
The text was updated successfully, but these errors were encountered: