-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory allocated increase during finetuning #792
Comments
qq @rong-hash would changing the "gradient_accumulation_steps 4 " change the behavior for you? cc: @mreso |
Hi @rong-hash you're using padding which means that samples are first bucketed together with respect to their length. This is to minimize excessive padding when short and long sequences would be batched together. Due to the different sequence lengths of the batches I suspect you're seeing jumps in memory usage whenever a batches with longer sequences are processed. The OOM then occurs when an even longer sequence length is processed. Have you tried reducing the batch size? |
Hi @mreso @HamidShojanazeri , I tried your solutions, but still get the same problem. I set argument |
The jumps will come from the different sequence lengths. The first time a sequence is longer than all the others before there will be more memory allocated to fit the intermediate tensors. Memory that's allocated will usually not be freed unless you explicitly tell PyTorch to do so even if subsequent samples are shorter and require less memory. Are you still seeing OOMs with bs=1? |
@mreso Yes, OOM still occurs with bs=1, even at the same step number.
Here's the latest script:
|
System Info
Pytorch version: 2.4.1+cu124
Cuda version: 12.7
GPU: A100 80G * 1
Information
🐛 Describe the bug
My script is :
Everything goes well except the memory allocated. I found that the allocated memory increased in some specific steps, which is very abnormal. The memory allocated picture is shown below.
And finally, it will cause OOM error.
Error logs
Expected behavior
I believe the normal behavior is that the GPU memory allocated is stable during the training.
The text was updated successfully, but these errors were encountered: