huge model training supermemory problem #17

Airliin · 2023-10-24T02:51:19Z

Hello！
I used a huge model to do finetune training. I had 80g of gpu memory, and still reported errors exceeding gpu memory, but when I looked at the gpu memory usage, the peak gpu memory only reached 30. How to solve this problem? Thank you!
RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 44.56 GiB total capacity; 41.63 GiB already allocated; 217.56 MiB free; 42.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Exception in thread Thread-6:

The text was updated successfully, but these errors were encountered:

Ziyan-Huang · 2023-10-24T13:18:14Z

Hello @Airliin

I'm not quite sure about the specific reason. Could you please provide more details on the patch size and batch size you are using during training?

Airliin · 2023-10-25T01:13:38Z

Hello @Ziyan-Huang .Thank you for your response. patch size:[ 96, 128, 128], batch size: 2;

Ziyan-Huang · 2023-10-25T03:57:17Z

Thank you for providing the details, @Airilin. Based on your settings, it seems that the training should be able to run smoothly on a GPU with 80GB of VRAM. Unfortunately, I don't have additional suggestions at the moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huge model training supermemory problem #17

huge model training supermemory problem #17

Airliin commented Oct 24, 2023

Ziyan-Huang commented Oct 24, 2023

Airliin commented Oct 25, 2023

Ziyan-Huang commented Oct 25, 2023

huge model training supermemory problem #17

huge model training supermemory problem #17

Comments

Airliin commented Oct 24, 2023

Ziyan-Huang commented Oct 24, 2023

Airliin commented Oct 25, 2023

Ziyan-Huang commented Oct 25, 2023