Solving Performance Issues. #4457

comfyanonymous · 2024-08-18T08:25:40Z

comfyanonymous
Aug 18, 2024
Maintainer

There has been a number of big changes to the ComfyUI core recently which should improve performance across the board but there might still be some bugs that slow things down for some people and I want to find and fix them before the next stable release.

If you have performance issues:

For Windows Users:

Download the latest standalone from https://github.com/comfyanonymous/ComfyUI?tab=readme-ov-file#windows
Update it using: update/update_comfyui.bat
Run ComfyUI and try your workflow.

If you still have performance issues, report them in this thread, make sure to post your full ComfyUI log and your workflow. The more information the better.

Some common sources of user errors:

Using --higvram and trying to run flux, even a 4090 does not have enough memory for that.
Using pytorch 2.4 on Windows (the standalone above ships with 2.3.1), there are a number of issues with it so it should be avoided. Use 2.3.1 or try the latest pytorch nightly instead. This isn't an issue on Linux.

motolo · 2024-08-18T10:26:50Z

motolo
Aug 18, 2024

Hi, I did several tests with clean installation and perfectly configured env. Only with Flux did I notice a deterioration in performance. Out of curiosity I disabled xformers and used Pytorch Cross attention expecting a total collapse in performance but instead the speed turned out to be the same. They improved pytorch cross attention, is xformers necessary? It could be that the acceleration doesn't start which instead happens on Forge. I repeat, everything is correctly configured, no errors in the installation....I repeat...perfect installation...did I already say that the installation is ok? I would never want to...Thank you..

xformers 0.0.27 pytorch 2.3.1

2 replies

comfyanonymous Aug 18, 2024
Maintainer Author

So what's your hardware and how fast is it supposed to go?

Amit30swgoh Sep 1, 2024

L4 google colab pro same question please ty

RYG81 · 2024-08-18T10:37:29Z

RYG81
Aug 18, 2024

I found the workflow loading time is higher compared to old version. And still there are issue with other not conflict with new UI

1 reply

Sulthanmh Aug 19, 2024

Thread

Dampfinchen · 2024-08-18T11:40:40Z

Dampfinchen
Aug 18, 2024

What I can say is that I (RTX 2060 6 GB, 32 GB RAM, Windows 11) get vastly better performance on SD Forge with Flux Dev compared to Comfy (using the recommended standalone build). Around 11s/it vs 7 s/it using the exact same settings (1024x1024, 20 steps, Euler) in FP16 with T5 XXL at fp8. This is strange because I was under the impression both were using similar engines under the hood.

9 replies

deceani Aug 18, 2024

I have the same problem with comfyui an flux schnell fp8 - i get images random that are not related to my prompt

this is on a fresh install and updated to latest version

deceani Aug 18, 2024

comfyui.log
the log - on rtx2060 vram 64gb of system ram windows 10

Arron17 Aug 18, 2024

Looks like it was an issue with the non-blocking, I had the problem too. Newest commit fixes the issue - 6730f3e

comfyanonymous Aug 18, 2024
Maintainer Author

I did the non blocking a different way, can you check if you get the good performance without the image weirdness in the latest commit?

Dampfinchen Aug 19, 2024

I did the non blocking a different way, can you check if you get the good performance without the image weirdness in the latest commit?

Yup, images and speed are both great now. Thank you for your great work!

ErixStrong · 2024-08-18T17:18:49Z

ErixStrong
Aug 18, 2024

Since the update yesterday, once in while I get an out of memory (I just need to press the Queue Prompt button again and it works) or losing connection while it's rendering (Need to restart server). Before the update I didn't have does issues.

I'm not using --highvram setting and my pytorch version is 2.3.1
My GPU is a 4090

The only suspect message in the command line is:

I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.

added:

My arguments are:

set COMMANDLINE_ARGS= --cuda-malloc --no-half-vae --cuda-device 0 --use-pytorch-cross-attention

2 replies

comfyanonymous Aug 18, 2024
Maintainer Author

Can you post your full comfyui log?

ErixStrong Aug 18, 2024

Here you go.

comfyui.log

And here is one with the outofmemory error:

comfyui.log

KINGLIFER · 2024-08-18T18:18:22Z

KINGLIFER
Aug 18, 2024

Should have left well alone.

3 replies

KINGLIFER Aug 28, 2024

Downvote all you like I am not sitting at home making personal adrenalin junkie images. I am doing work. You should not release if you have not fully tested it. If it works why break it. Only a clown would downvote that. This isn't reddit. Get a grip.

Soupcheese Sep 27, 2024

go back to reddit

doggeddalle Oct 2, 2024

the projection is real king, try seeth less

JorgeR81 · 2024-08-18T19:09:24Z

JorgeR81
Aug 18, 2024

GTX 1070 ( 8GB RAM )
32 GB RAM
Windows 10

In my case, Flux fp8 got slower, but Flux f16 got faster.
So now Flux fp16 is faster than Flux fp8 !

This involves at least these 2 commits:

8115d8c - Add Flux fp16 support hack
5c69cde - Load TE model straight to vram if certain conditions are met

After the "fp16 support hack", both Flux fp8 and Flux f16 got slower.
I think this is a Pascal GPU issue.
#4363

But more recently, Flux fp16 got faster, ( even faster than before the "fp16 support hack" ).
And Flux fp16 now uses about more 4 GB RAM than before ( while for Flux fp8 it stayed the same ).
So I think the "Load TE model" commit only works for me, when using Flux fp16.

EDIT:

After the latest update, there was an improvement, for fp16
I don't have that extra 4GB RAM usage, while maintaining the same speed gains.
But Flux fp8 is still slower.

1 reply

JorgeR81 Aug 19, 2024

After the latest commit ( 39f114c ):

Flux FP8 speed improved slightly ( but still slower than FP16 ).
I also noticed that Flux FP16 now uses more GPU ( and less CPU ).

For me, GUFF models are also faster and also use more GPU.
Is that why FP16 is faster now ?

EDIT:
I did some tests and FP16 now seems to be as fast as GGUF !
Like with GUFF, there are diminishing returns, for large images, but I think that may be due to a low VRAM GPU ( 8GB )?
city96/ComfyUI-GGUF#35

But GGUF still has an advantage.
It requires less than 32GB RAM while loading, so no page file is needed.
With GGUF, the model "loads instantaneously", and RAM / VRAM usage only starts to increase at the KSampler stage.
Overall, it's much faster at loading the models, even for Q8_0.
Could RAM / VRAM usage also be improved for FP8 / FP16 ?

This is now my graph with FP16
( similar to GGUF models )

This is the graph with FP8
( even slower than FP16 )

motolo · 2024-08-18T22:54:02Z

motolo
Aug 18, 2024

After the latest updates the situation has improved significantly. However, there is a problem after generation: the VRAM is sometimes not emptied resulting in the possibility of OOM, but it doesn't always happen. Otherwise the speed has improved significantly. Keep optimizing and soon Forge will be behind you again :) FLUX fp8.

1 reply

BrechtCorbeel Aug 23, 2024

This might explain some issues I am having, but with sd 1.5 animatediff

rachelcenter · 2024-08-19T01:47:34Z

rachelcenter
Aug 19, 2024

please fix the MPS mac m2 silicone issue!!!

Requested to load AutoencoderKL
Loading 1 new model
/AppleInternal/Library/BuildRoots/4ff29661-3588-11ef-9513-e2437461156c/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:788: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
zsh: abort python3.12 main.py
(base) rachel@ ComfyUI % /Users/rachel/.pyenv/versions/3.12.4/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

0 replies

deepskydiver · 2024-08-19T03:52:24Z

deepskydiver
Aug 19, 2024

On Windows using comfyui.

How do I either downgrade the embedded version of Pytorch to 2.3.1, or if I use a newly installed version of comfyui - stop it from going to 2.40?

0 replies

ErixStrong · 2024-08-20T23:18:50Z

ErixStrong
Aug 20, 2024

2 days ago I sometimes had OOM, now since today's update my first render is good, my second is very noisy, and all others are only noise. I need to restart Comfyui to go back to be able to generate only one good render. This is using Lora, without Lora it works.

3 replies

comfyanonymous Aug 21, 2024
Maintainer Author

Which OS/GPU/amount of memory do you have?

motolo Aug 22, 2024

2 days ago I sometimes had OOM, now since today's update my first render is good, my second is very noisy, and all others are only noise. I need to restart Comfyui to go back to be able to generate only one good render. This is using Lora, without Lora it works.

Same problem...first rendering ok...second OOM...

BylethC Sep 17, 2024

is this one fixed? i might have ran into this problem today. first image ok, then stuck on the second generation. when stucking my graphics card kept running on 100% (3D and VRAM) and ComfyUI workflow stuck on the CLIP part with no progress at all, and the web interface became really really laggy.

This is really weird cuz the first generation was really fast like 20s-30s but the second one just won't complete, even under a same workflow as the first one.

this is not a issue that gonna happen 100% when you do the second generation. sometimes it happens while other times it does not.

env:
13900K, 4090, 96GB RAM, latest comfyui version, windows 11

BrechtCorbeel · 2024-08-23T06:08:31Z

BrechtCorbeel
Aug 23, 2024

Using --higvram and trying to run flux, even a 4090 does not have enough memory for that.

Using pytorch 2.4 on Windows (the standalone above ships with 2.3.1), there are a number of issues with it so it should be avoided. Use 2.3.1 or try the latest pytorch nightly instead. This isn't an issue on Linux.

I seldomly but do sometimes have vram issues with my 4090, but with he right res is never an issue. Some nodes or workflows are just really bad ramping up vram for very little.

Yes 2.4 was a massive issue to run anything on.

0 replies

menahem121 · 2024-08-23T07:26:26Z

menahem121
Aug 23, 2024

On ubuntu 22.04 with cuda 12.4 comfyUI keeps crashing and failing to load ip adapters models.
i wonder if its because i installed pytorch with:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu124

instead of: pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

This part of the README wasn't updated in the last 3 month so i wonder if you still recommend installing the --pre torch ? @comfyanonymous

0 replies

motolo · 2024-08-24T22:08:53Z

motolo
Aug 24, 2024

For nvidia 10xx series cards.

I did several tests and ended up with the following FLAGs to optimize performance:
For FLUX.1 use: --disable-xformers --use-split-cross-attention --cache-classic --lowvram, in my case I have recovered almost a minute
For SDXL use: --disable-xformers --use-split-cross-attention --cache-classic --lowvram
For sd 1.5 use: --disable-xformers --force-fp32 --cache-classic --lowvram
or --lowvram --force-fp32 (used xformers)

with -use-pytorch-cross-attention or xformers you can't use fp16, why?

0 replies

Foul-Tarnished · 2024-08-24T23:31:54Z

Foul-Tarnished
Aug 24, 2024

I have exact same speed with ComfyUI and forge
RTX 4080, 7800X3D, 32GB RAM

0 replies

BrechtCorbeel · 2024-08-27T04:04:00Z

BrechtCorbeel
Aug 27, 2024

I now recently not long after I did my previous post started getting perpetual VRAM issues, the issue being is that I think it does not unload VRAM and then starts another on top of it making my computer lag and putting nothing out because it jams up my whole system. It runs fine the first time, but if queue stops and restarts or after a couple it might result in this issue.

2 replies

motolo Aug 27, 2024

You can try this flag --cache-classic

Amit30swgoh Sep 1, 2024

also is it good to add set COMMANDLINE_ARGS= --cuda-malloc --no-half-vae --cuda-device 0 --use-pytorch-cross-attention

what does it do? good for L4?

im useing
torch: 2.4.0+cu121
torchvision: 0.19.0+cu121
torchaudio: 2.4.0+cu121
xformers: 0.0.28.dev895

patriciagomesoo · 2024-09-05T17:57:05Z

patriciagomesoo
Sep 5, 2024

Hi, I have dual boot disks with Windows 11 and Ubuntu 22.04, RTX3090 and 32GB RAM. Flux Dev 1 works with text encoder fp16 and fp8 in ComfyUI in Windows only, in Ubuntu only works fp8, it causes the system to crash completely, having to do a hard reboot. Have tried the --reserve-vram but failed. Any ideas why the fp16 only work in Windows? On WebUi Forge I can get it to work in Ubuntu, setting the GPU reserve to 12000 and Swap to CPU. Really wanted to use ComfyUI in Ubuntu :(

1 reply

patriciagomesoo Sep 5, 2024

torch==2.1.2+cu121
torchvision==0.16.2+cu121
transformers==4.44.2

remi-viau · 2024-09-26T11:16:20Z

remi-viau
Sep 26, 2024

Hello,

Latest version is super unstable on my config on fresh install :
Windows 11
nvidia toolkit 12.1
torch==2.3.1
torchvision==0.18.1
torchaudio==2.3.1
GPU swap disabled
I tried the --reserve-vram 0.6 without success even at 1
The system was stable during a period then crashed. OOM

Total VRAM 24563 MB, total RAM 32691 MB
pytorch version: 2.3.1+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 Ti : cudaMallocAsync

waiting for another update.

edit : GPU 3090 Ti OC + 32Go RAM + NVME + Ryzen 3950x 32 thread
I got better stability on 3060 with swap but the memory was always exausted and needed to reboot

work smoothly on ; git checkout 14af129

0 replies

Solving Performance Issues. #4457

comfyanonymous Aug 18, 2024 Maintainer

Replies: 17 comments · 25 replies

comfyanonymous Aug 18, 2024 Maintainer Author

comfyanonymous Aug 18, 2024 Maintainer Author

comfyanonymous Aug 18, 2024 Maintainer Author

comfyanonymous Aug 21, 2024 Maintainer Author

comfyanonymous
Aug 18, 2024
Maintainer

Replies: 17 comments 25 replies

comfyanonymous Aug 18, 2024
Maintainer Author

comfyanonymous Aug 18, 2024
Maintainer Author

comfyanonymous Aug 18, 2024
Maintainer Author

comfyanonymous Aug 21, 2024
Maintainer Author