RuntimeError: CUDA error: out of memory when training the model for too long. #13

safranchik · 2022-08-30T20:21:36Z

There appears to be a memory leak during the validation routine since the GPU I'm using (A40, 48 GB VRAM) runs out of memory from training on the predict flu task for 125 epochs.

The text was updated successfully, but these errors were encountered:

TheMikeMerrill · 2022-09-15T18:54:20Z

We discussed this on slack, but just wanted to confirm that this is a known bug. I believe there's a memory leak somewhere in on_train_epoch_end but I haven't been able to find it.

TheMikeMerrill · 2022-10-27T00:10:26Z

This isn't a complete solution, but I think the memory leak is happening in the metric bootstrapping during the validation loops. I can't figure out where specifically it's going down, but setting --model.val_bootstraps=0 after this commit (370d41f) should stop the leak.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: out of memory when training the model for too long. #13

RuntimeError: CUDA error: out of memory when training the model for too long. #13

safranchik commented Aug 30, 2022

TheMikeMerrill commented Sep 15, 2022

TheMikeMerrill commented Oct 27, 2022

RuntimeError: CUDA error: out of memory when training the model for too long. #13

RuntimeError: CUDA error: out of memory when training the model for too long. #13

Comments

safranchik commented Aug 30, 2022

TheMikeMerrill commented Sep 15, 2022

TheMikeMerrill commented Oct 27, 2022