Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Offloading #8

Open
cocktailpeanut opened this issue Nov 8, 2024 · 8 comments
Open

CPU Offloading #8

cocktailpeanut opened this issue Nov 8, 2024 · 8 comments

Comments

@cocktailpeanut
Copy link

I saw the line pipe.enable_model_cpu_offload() here https://github.com/instantX-research/InstantIR/blob/main/pipelines/sdxl_instantir.py#L113C13-L113C44 and tried the approach with the gradio app, but get the following error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

What else needs to be done in the code to fix this error and make this work?

@JY-Joy
Copy link
Collaborator

JY-Joy commented Nov 8, 2024

CPU off loading is used to reduce memory usage. The error you encountered is because our aggregator is not properly registered as pipe's module at present.
However, this line of code was added to the example by mistakes, as there are already bunch of involved module in InstantIR and trigger CPU off loading will severely slow down its inference. Please just ignore this line and remove it from your script, or directly try our example usage in infer.py. We will remove this confusing code in next commit, thanks a lot!

@cocktailpeanut
Copy link
Author

@JY-Joy Thank you for the response. I wonder if there is a way to carve off the VRAM usage a bit more. I tried experimenting with some of the diffusers memory save techniques with no success, but I think you might know best. It would be really great if this works on consumer grade PCs (I can confirm it works on 4090 and Mac M1 Max with 64G memory--although consumes like 47G memory during inference) Here's one user who wanted to try but failed https://x.com/Teslanaut/status/1854985331915034995

Do you think there's room for any optimization?

@JY-Joy
Copy link
Collaborator

JY-Joy commented Nov 10, 2024

Yeah ofc there is room for optimization, we will have those diffusers optimizations supported in the near future. However, it is really quite weird that InstantIR consumes 47GB VRAM. I checked the twi you mentioned. The user reported that there are 20 MiB VRAM missing on a 3080Ti device, which is of 12GB total capacity. Sadly in our current implementation it is recommended to deploy InstantIR on devices with at least 22GB VRAM. We will try to optimize this but I've to say there will be a trade off with efficiency.
Thanks for your feedback and your twi ❤️

@Akossimon
Copy link

i run mac and pinokio. it might be useful to understand , that usually mac owners own 32gb of ram and almost never 64Gb of ram, for mac uses HD space as Virtual Memory, so there is never a need to purchase 64Gb of ram anymore like it used to be in the past. Sadly your amazing killer top notch app does not work with 32Gb of ram on macs when using pinokio and your app in it.

i also think there are many Mac users who are no aware that they could voice for such RAM optimization right here with the coders themselves, otherwise you would have many, many, many more people wondering if this can be optimized for 32gb ram on macs and Pinokio.

This would be so amazing if you could make it work on macs with 32GB of ram :)

@Skquark
Copy link

Skquark commented Nov 22, 2024

I'd also love to CPU offload this with the ability for it to run on a system with 16GB or less, understanding that it'll slow it down. I've integrated InstantIR as an upscale method in my app at AEIONic.com to be an option along with RealESRGAN, AuraSR, etc. and got it working within all the pipelines, however I missunderstood that it's not necessarily an upscaler but more an image repairer. I also didn't expect it to take up all the vram max out, but at least it still runs even though it takes an hour for one. I tried enabling CPU Offloading too, got the same error as above, and dug into the pipelines and realized it wasn't quite implemented in the aggregator and I couldn't figure out a hack. Wish I had 24GB+ card, but any chance of optimizing it somehow? Maybe there's a way to use TorchAO or Quantize it or Bitsandbites or something? It'd be nice if it officially gets adapted into Diffusers library..

@JY-Joy
Copy link
Collaborator

JY-Joy commented Nov 22, 2024

Thanks for your careful investigation @Skquark, and InstantIR is indeed designed to be a image repairer. By upscaling, did you mean output images larger than 1024px? At present the maximum px output is constraint by SDXL's capacity, and your device of course.
Given lots of these demands, we will prioritize the implementation of CPU off-loading. Thanks all you guys for your interests!

@Skquark
Copy link

Skquark commented Nov 22, 2024

For sure, well it still has its usefulness, just not for my upscale intent. I'm going to move it in my app to it's own tab and make it a utility tool instead of image post-processor. I still won't be able to run it on my own computer, but others can get some use out of it. Thanks, let us know when we have a way to optimize...

@emil-malina
Copy link

  • if you guys use infer.sh to do upscaling, make sure you set batch_size=1, it's 4 by default
  • the other tip is to use vae tiling and slicing, to reduce memory footprint:
    pipe.enable_vae_tiling()
    pipe.enable_vae_slicing()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants