Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory: RTX 4090 / 24GB #16

Open
boxabirds opened this issue Mar 21, 2024 · 21 comments
Open

Out of memory: RTX 4090 / 24GB #16

boxabirds opened this issue Mar 21, 2024 · 21 comments

Comments

@boxabirds
Copy link
Contributor

Hi no matter what movie size I choose -- 5fps, 640x480 I get this error below. nvtop shows that the webUI triggers pre-allocation of 21.5GB but then ... it's not used?

/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] 0%| | 0/15 [00:01<?, ?it/s] Traceback (most recent call last): File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/gradio/queueing.py", line 388, in call_prediction output = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/gradio/route_utils.py", line 219, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/gradio/blocks.py", line 1437, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/gradio/blocks.py", line 1109, in call_function prediction = await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 851, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/gradio/utils.py", line 650, in wrapper response = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/sambashare/expts/FRESCO/webUI.py", line 159, in process keypath = process1(*args) ^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/sambashare/expts/FRESCO/webUI.py", line 280, in process1 latents = inference(global_state.pipe, global_state.controlnet, global_state.frescoProc, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/sambashare/expts/FRESCO/src/pipe_FRESCO.py", line 201, in inference noise_pred = pipe.unet( ^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/sambashare/expts/FRESCO/src/diffusion_hacked.py", line 776, in forward sample = optimize_feature(sample, flows, occs, correlation_matrix, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/sambashare/expts/FRESCO/src/diffusion_hacked.py", line 485, in optimize_feature optimizer.step(closure) File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/optim/optimizer.py", line 373, in wrapper out = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/optim/adam.py", line 143, in step loss = closure() ^^^^^^^^^ File "/home/julian/sambashare/expts/FRESCO/src/diffusion_hacked.py", line 478, in closure loss.backward() File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/julian/.local/share/virtualenvs/FRESCO-jUStlEeO/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.06 GiB. GPU 0 has a total capacty of 23.65 GiB of which 525.94 MiB is free. Process 15481 has 1.27 GiB memory in use. Including non-PyTorch memory, this process has 21.00 GiB memory in use. Of the allocated memory 20.11 GiB is allocated by PyTorch, and 416.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@williamyang1991
Copy link
Owner

You can set a smaller batch size to avoid OOM.

batch_size: 8

Our method will optimize the feature during DDPM, which reaches a peak memory usage when optimization is applied.

@boxabirds
Copy link
Contributor Author

I tried with batch_size: 4 and then 2 and it made no difference 🤔

I don't think it's this: it's saying it's trying to allocate 112MB, GPU has capacity of 23.65GB but only 106MB is free. 20.85 GiB is allocated by Pytorch. But for what I wonder

@boxabirds
Copy link
Contributor Author

What GPUs did you do your work on? Might it simply be that there is a minimum GPU memory size of 40GB or something?

@jinwyp
Copy link
Contributor

jinwyp commented Mar 21, 2024

There is a bug when batch_size: 4.
Please pull the latest code

#6

@boxabirds
Copy link
Contributor Author

I get the same error when batch size is 2 as well though …?

@efwfe
Copy link
Contributor

efwfe commented Mar 21, 2024

A10G 24G works fine with batch size = 8

@boxabirds
Copy link
Contributor Author

boxabirds commented Mar 21, 2024 via email

@efwfe
Copy link
Contributor

efwfe commented Mar 21, 2024

Great — based on my Stack trace, what am I doing wrong?

On Thu, 21 Mar 2024 at 13:30, efwfe @.> wrote: A10G 24G works fine with batch size = 8 — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABD62JFRZGNTVCP2A5S3WTYZLOHBAVCNFSM6AAAAABFBH62QKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJSGMYDCMRZHA . You are receiving this because you authored the thread.Message ID: @.>

It's not clear what happened here. Try pulling and using the latest code maybe helpful.

@JPW0080
Copy link

JPW0080 commented Mar 21, 2024

Is xformers installed?

@boxabirds
Copy link
Contributor Author

boxabirds commented Mar 21, 2024 via email

@boxabirds
Copy link
Contributor Author

Great — based on my Stack trace, what am I doing wrong?

On Thu, 21 Mar 2024 at 13:30, efwfe @.> wrote: A10G 24G works fine with batch size = 8 — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABD62JFRZGNTVCP2A5S3WTYZLOHBAVCNFSM6AAAAABFBH62QKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJSGMYDCMRZHA . You are receiving this because you authored the thread.Message ID: _@**.**_>

It's not clear what happened here. Try pulling and using the latest code maybe helpful.

I checked and this is against the latest code. I don't see any changes in the last 12 hours and my pull was inside that time.

@moosl
Copy link

moosl commented Mar 21, 2024

I have the same issue here.

@williamyang1991
Copy link
Owner

Then maybe you could turn off the optmization function to further same memory (but sacrifice performance)?
#14 (comment)

@williamyang1991
Copy link
Owner

Is xformers installed?

I tried xformers.ops.memory_efficient_attention, but found it is less memory efficient than F.scaled_dot_product_attention
So I didn't use xformers in my code

'''
# for xformers implementation
if importlib.util.find_spec("xformers") is not None:
hidden_states = xformers.ops.memory_efficient_attention(
rearrange(query, "b h d c -> b d h c"), rearrange(key, "b h d c -> b d h c"),
rearrange(value, "b h d c -> b d h c"),
attn_bias=attention_mask, op=None
)
hidden_states = rearrange(hidden_states, "b d h c -> b h d c", h=attn.heads)
'''
# the output of sdp = (batch, num_heads, seq_len, head_dim)
# TODO: add support for attn.scale when we move to Torch 2.1
# output: BC * 8 * HW * D2
hidden_states = F.scaled_dot_product_attention(
query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
)
#print('cross: ', GPU.getGPUs()[1].memoryUsed)

@boxabirds
Copy link
Contributor Author

Then maybe you could turn off the optmization function to further same memory (but sacrifice performance)? #14 (comment)

There’s something very strange going on because i#14 is a 12 gig GPU and it works but I have a 24 GB GPU and it won’t do even the most basic processing on a image sequence requiring 112MB. Something’s going on with the pytorch allocation, why does it need 20 gigs of GPU RAM? The only thing I can conclude is #14 is against a different version of the code base.

@williamyang1991
Copy link
Owner

I think maybe there is no problem with the code.
Maybe there is some specific settings on GPU allocation in your computer that causes the OOM?

@boxabirds
Copy link
Contributor Author

boxabirds commented Mar 22, 2024 via email

@williamyang1991
Copy link
Owner

you can print memory usage in diffusion_hacked.py like

print('diffusion_hacked Line 286: ', GPU.getGPUs()[1].memoryUsed) 

to see when running which code, OOM happens.

@cvespaz
Copy link

cvespaz commented Mar 22, 2024

same issue here, following

@cvespaz
Copy link

cvespaz commented Mar 22, 2024

even just running "run keyframes" ooms on a 24gb card? am i missing something here? ran the example test just fine with gradio

@williamyang1991
Copy link
Owner

even just running "run keyframes" ooms on a 24gb card? am i missing something here? ran the example test just fine with gradio

full frames do not take more memory. Keyframe part uses the most the memory.
You mean the example video work fine but your own video oom?
Maybe your video has too many pixels.
The example video has 512*512 pixels.
If your video is large, you can use smaller resize parameter

img = resize_image(frame, 512)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants