You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When i try to finetune gpt-2 355M (because gpt-neo 350M is broken), no matter what i do i will always get a cuda oom error. not with a t4 (and no, i'm not planning on getting colab pro just to play with funy stupeed text gen ai), not with fp16 (by the way, that doesnt work as well) and not even with gradient_checkpointing=True.
What the duck am i supposed to do? create vram out of thin air? cant there be something that limits the vram and empties it (i find it astonishing that there's no way to empty the vram other than factory reset of the vm. like, it's vram, not disk storage) to avoid not being able to train/loosing half of the progress/having to factory reset because the vram is full of the previous failed training attempt)
Any potentially usefull help is appreciated. sorry if i was a little rude, i'm just tired of stuff on github being more broken than a chair that has been set on fire.
The text was updated successfully, but these errors were encountered:
This issue is most likely not due to the script. It has been reproduced in three different scripts this year. I reported it to Google, I hope they can take a look at it. But yes, I've faced the same issue you're having. Personally, I don't see this getting fixed soon: it looks like Google is allocating fewer high-performance GPUs to free-tier users, accounting for more crashes when training with bigger models. For the time being, consider training with the 124M model using GPU, or bigger models using TPU, which is not time-ideal but at least it will work. I don't recommend using another service such as Gradient. Their free tier won't get you anywhere with even the smallest model. Please consider wording this differently next time, though: there is a human on the other side.
When i try to finetune gpt-2 355M (because gpt-neo 350M is broken), no matter what i do i will always get a cuda oom error. not with a t4 (and no, i'm not planning on getting colab pro just to play with funy stupeed text gen ai), not with fp16 (by the way, that doesnt work as well) and not even with gradient_checkpointing=True.
What the duck am i supposed to do? create vram out of thin air? cant there be something that limits the vram and empties it (i find it astonishing that there's no way to empty the vram other than factory reset of the vm. like, it's vram, not disk storage) to avoid not being able to train/loosing half of the progress/having to factory reset because the vram is full of the previous failed training attempt)
Any potentially usefull help is appreciated. sorry if i was a little rude, i'm just tired of stuff on github being more broken than a chair that has been set on fire.
The text was updated successfully, but these errors were encountered: