Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reduce condition graph computing time? #433

Open
hxgqh opened this issue Oct 7, 2024 · 7 comments
Open

How to reduce condition graph computing time? #433

hxgqh opened this issue Oct 7, 2024 · 7 comments

Comments

@hxgqh
Copy link

hxgqh commented Oct 7, 2024

it consumes 34 seconds, while sampling only consumes 5.5s/it

image

@Green-Sky
Copy link
Contributor

flux employs t5xxl, which is relatively heavy. And it is only implemented to run on cpu right now.

@stduhpf
Copy link
Contributor

stduhpf commented Oct 7, 2024

@hxgqh Set the numbers of threads with the -t argument.
With a single thread it also takes around 34 seconds on my CPU.
With -t 24 (I have a 24 thread CPU) , it only takes around 3.5 seconds

@hxgqh
Copy link
Author

hxgqh commented Oct 7, 2024

@stduhpf I changed threads to be 2/4/8/24/48/96, and it seems not working. Maybe the default threads is set to be cores of cpu? #3

@stduhpf
Copy link
Contributor

stduhpf commented Oct 7, 2024

You're right @hxgqh , it uses the number of physical cores of the CPU by default. (so 12 in my case)
If I don't set the -t argument at all, it takes 4 seconds with 12 threads.

Then I guess either your CPU is too slow, or you're running out of system memory and it's using swap (if that's the case, maybe using a quantized version of t5xxxl could help).

@hxgqh
Copy link
Author

hxgqh commented Oct 8, 2024

@Green-Sky Is there any plan to run t5xxl on GPU?

@actionless
Copy link

actionless commented Nov 16, 2024

it seems smth is off with it, that same t5xxl step on ryzen 5600x the python cpu implementation (from diffusers lib) running 67 seconds faster (237%) than CPP implementation

and that's actually the best of sd-cpp measurements, with 6 threads, when setting to 12 (nproc) - it becomes 22 seconds even slower

  • python diffusers (cpu) - 52 seconds
  • sd-cpp 3 threads - 199 seconds
  • sd-cpp 6 threads - 119 seconds
  • sd-cpp 12 (nproc) threads - 141 seconds
  • sd-cpp 16 threads - 147 seconds

offtopic:
inference step takes ~9 seconds for both diffusers gpu and sd-cpp gpu, however for achieving the same quality of result diffusers taking 4 steps, and sd-cpp needs 20 - but i guess the latter is smth to do with the default parameters

@actionless
Copy link

actionless commented Nov 20, 2024

i found the solution for my problem with high condition graph computing time, mb would be useful for someone else:

because i was building the package on old xeon server, but running on a new ryzen workstation - apparently some cpu optimizations were disabled during compilation

so after re-compiling it on the workstation itself - condition graph computes now in 18 seconds (34 seconds faster than diffusers-cpu and 100 seconds faster than before, without those optimizations) in the same testcase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants