Unable to train to convergence (small dataset) #9

jacobwjs · 2022-05-23T07:03:02Z

Hi nice work with x-clip. Hoping to play around with it and eventually combine it into your DALLE2 work.

Currently having some trouble training on roughly 30k image-text pairs. Loss eventually goes negative and starts producing Nan's. I've dropped learning rate down (1e-4) and I'm clipping gradients (max_norm=0.5).

Any thoughts on what are sane training params/configs on such a small dataset using x-clip?

Netruk44 · 2022-05-23T23:18:22Z

I'm noticing the same on my own dataset of ~175k text-image pairs, so maybe it's not a dataset size issue (or 200k is also not enough)?

To add my own data of what I've tried so far to resolve the NaN issue:

I originally tried training with a learning rate of 1e-4 and a training batch size of 32. Training would run for about 8 hours or so on my 3090, but eventually NaNs would consistently take over, no matter how much I rolled back to a prior checkpoint. I noticed that the loss value would also have dramatic swings while training (+/- 3.0 between consecutive training iterations, it was hard to tell if the loss was actually trending any direction).

I've tried the following so far:

Set is_multiview=true on the constructor, & supplying aug_text and aug_image while training
- I generated 32 augmentations for both image and text before passing it into the model.
  - I don't think it's documented anywhere, so if you want to do this yourself, aug_text and aug_image should be a tensor with the shape [augment_count, batch_size, ... ]. Alternatively you can pass a list of tensors the same shape as text or image, instead of a single tensor.
- This required using a different tokenizer since the SimpleTokenizer was slowing down training at this scale. (I wound up using huggingface's tokenizers, which was very straightforward to train and integrate into x-clip)
- To keep everything in memory on my gpu, I had to drop the training batch size down to 16.
- This didn't seem to help training stability very much, but I did start seeing the model produce some early results. It seemed to be able to identify some of the most common tokens in my training images. However, I didn't test things very thoroughly at this point, so I may have just have picked a good couple of images to manually inspect the model on.
- Unfortunately, the model never really progressed past that. Performance started to degrade, and then NaN loss values started happening again.
Increasing the batch size to 64 (+ 32 augmentations for each image and text), dropping learning rate to 1e-5
- This is my current attempt I'm training.
- I had to implement gradient accumulation for the training, since there was no hope of keeping everything in memory anymore.
- This seems to have increased the stability of the loss value, there's no dramatic swings anymore (it tends to keep to a range of about +/- 0.25), and the loss has been steadily trending towards 0.
- Results from this model have not been as good, at least not so far. I'm guessing that's because of the lowered learning rate.

I can update here if I manage to get things working well. I think I have the learning rate too low now, but I'll give it some time to see if things work out.

lucidrains · 2022-05-24T02:27:02Z

@jacobwjs hi Jacob! thanks for trying to train this! I've added numerically stable attention and some extra layernorms (recommended by a recent paper) to see if that helps with stability 9d9bae0 do let me know if it does or does not work, and we can try some other avenues

jacobwjs · 2022-05-24T06:13:23Z

@jacobwjs hi Jacob! thanks for trying to train this! I've added numerically stable attention and some extra layernorms (recommended by a recent paper) to see if that helps with stability 9d9bae0 do let me know if it does or does not work, and we can try some other avenues

Beautiful. Will kick off training tonight and get back to you when I have results.

jacobwjs · 2022-05-24T06:15:45Z

I'm noticing the same on my own dataset of ~175k text-image pairs, so maybe it's not a dataset size issue (or 200k is also not enough)?

To add my own data of what I've tried so far to resolve the NaN issue:

I originally tried training with a learning rate of 1e-4 and a training batch size of 32. Training would run for about 8 hours or so on my 3090, but eventually NaNs would consistently take over, no matter how much I rolled back to a prior checkpoint. I noticed that the loss value would also have dramatic swings while training (+/- 3.0 between consecutive training iterations, it was hard to tell if the loss was actually trending any direction).

I've tried the following so far:

* Set `is_multiview=true` on the constructor, & supplying aug_text and aug_image while training
  
  * I generated 32 augmentations for both image and text before passing it into the model.
    
    * I don't think it's documented anywhere, so if you want to do this yourself, aug_text and aug_image should be a tensor with the shape `[augment_count, batch_size, ... ]`. Alternatively you can pass a list of tensors the same shape as `text` or `image`, instead of a single tensor.
  * This required using a different tokenizer since the SimpleTokenizer was slowing down training at this scale. (I wound up using [huggingface's tokenizers](https://github.com/huggingface/tokenizers), which was very straightforward to train and integrate into x-clip)
  * To keep everything in memory on my gpu, I had to drop the training batch size down to 16.
  * This didn't seem to help training stability very much, but I did start seeing the model produce some early results. It seemed to be able to identify some of the most common tokens in my training images. However, I didn't test things very thoroughly at this point, so I may have just have picked a good couple of images to manually inspect the model on.
  * Unfortunately, the model never really progressed past that. Performance started to degrade, and then NaN loss values started happening again.

* Increasing the batch size to 64 (+ 32 augmentations for each image and text), dropping learning rate to 1e-5
  
  * This is my current attempt I'm training.
  * I had to implement gradient accumulation for the training, since there was no hope of keeping everything in memory anymore.
  * This seems to have increased the stability of the loss value, there's no dramatic swings anymore (it tends to keep to a range of about +/- 0.25), and the loss has been steadily trending towards 0.
  * Results from this model have not been as good, at least not so far. I'm guessing that's because of the lowered learning rate.

I can update here if I manage to get things working well. I think I have the learning rate too low now, but I'll give it some time to see if things work out.

Yes definitely interested. Please share how it goes.

jacobwjs · 2022-05-24T13:09:53Z

@lucidrains Almost identical training loss behavior as before. I have no strong intuition on why.

I'm gonna do some deeper logging of various losses in the CLIP module and backtrack from there. Will update you when I know more.

Netruk44 · 2022-05-25T16:27:37Z

My attempt with the larger batch size is still going without any NaNs so far in about 62 hours of training on my 3090. Currently the loss is hovering around -0.21, so it's possible they're just around the corner.

I'll let you know if and when they show up, but I would possibly recommend trying your training with a similar batch size if you haven't already. It's possible you may need a larger batch size than 64, because I'm also passing in the augmentations for each image. I guess my effective batch size would be 64*32=2048 with the augments.

jacobwjs · 2022-05-26T01:37:28Z

@Netruk44 Thanks for the update. I've tried larger batch sizes (e.g. 384) and same issue. The fact that the loss is negative is still concerning. Still curious where you end up once training finishes however.

@lucidrains
Everything converges when I turn off "decoupled_contrastive_loss". This seems to be the culprit.

Netruk44 · 2022-05-27T20:12:14Z

@jacobwjs Unfortunately, my machine power-cycled itself for some reason, so training on my x-clip model has stopped for now. I wanted to test out lucidrains's imagen model with my text-image dataset. I might come back to this, though.

For this model, as far as I could tell the loss never converged. It continued down below -1.0, but I never saw a NaN. Previous models would occasionally see a NaN loss while training before it got to the point where I couldn't train the model at all anymore due to the NaNs. So if the 384 batch size didn't help you, maybe the augmentations or the 1e-5 learning rate helped to stabilize things for me. I also set text_rotary_pos_emb = True, but I don't think that affected the convergence much.

Either way, it looks like you've found settings that'll let you train the model. I'd be curious to know how well the model works for you.

jacobwjs · 2022-05-29T07:43:47Z

@jacobwjs Unfortunately, my machine power-cycled itself for some reason, so training on my x-clip model has stopped for now. I wanted to test out lucidrains's imagen model with my text-image dataset. I might come back to this, though.

For this model, as far as I could tell the loss never converged. It continued down below -1.0, but I never saw a NaN. Previous models would occasionally see a NaN loss while training before it got to the point where I couldn't train the model at all anymore due to the NaNs. So if the 384 batch size didn't help you, maybe the augmentations or the 1e-5 learning rate helped to stabilize things for me. I also set text_rotary_pos_emb = True, but I don't think that affected the convergence much.

Either way, it looks like you've found settings that'll let you train the model. I'd be curious to know how well the model works for you.

Thanks for following up. Bummer on the lost run, but hopefully you have some checkpoints for when/if you come back to training.

I assumed most would be interested in getting imagen up and running. Good luck with it, and I'll be following along there as well.

I'll tweak a few more parts of x-clip before kicking off "real" training. Will get those results here eventually. Until then...

huutuongtu mentioned this issue Jan 18, 2024

Loss NaN when training CLIP lucidrains/DALLE2-pytorch#310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train to convergence (small dataset) #9

Unable to train to convergence (small dataset) #9

jacobwjs commented May 23, 2022

Netruk44 commented May 23, 2022 •

edited

Loading

lucidrains commented May 24, 2022

jacobwjs commented May 24, 2022

jacobwjs commented May 24, 2022

jacobwjs commented May 24, 2022

Netruk44 commented May 25, 2022

jacobwjs commented May 26, 2022

Netruk44 commented May 27, 2022 •

edited

Loading

jacobwjs commented May 29, 2022

Unable to train to convergence (small dataset) #9

Unable to train to convergence (small dataset) #9

Comments

jacobwjs commented May 23, 2022

Netruk44 commented May 23, 2022 • edited Loading

lucidrains commented May 24, 2022

jacobwjs commented May 24, 2022

jacobwjs commented May 24, 2022

jacobwjs commented May 24, 2022

Netruk44 commented May 25, 2022

jacobwjs commented May 26, 2022

Netruk44 commented May 27, 2022 • edited Loading

jacobwjs commented May 29, 2022

Netruk44 commented May 23, 2022 •

edited

Loading

Netruk44 commented May 27, 2022 •

edited

Loading