Replies: 8 comments 23 replies
-
This requires an overhaul of the execution model and in-depth consideration of various corner cases. For example, tasks that require VRAM should not operate in parallel. Otherwise, you'll encounter the nightmare of VRAM shortage. There are complex issues such as policy decisions on which tasks should run in parallel and which should not, and how to ensure that each node developer correctly applies these policies without errors. To implement this, it is not a matter that can be addressed simply in an issue. It will need to be refined through lengthy discussions in a discussion forum. |
Beta Was this translation helpful? Give feedback.
-
Could we create a tool that allows users to manually specify which nodes can run in parallel for each specific workflow? This would help calculate the VRAM load for each case and use memory more efficiently. This way, we can avoid VRAM shortages and handle complex workflows more effectively. By letting users decide which tasks run in parallel, we solve the problems with automatic parallelization and give them control over resource use. |
Beta Was this translation helpful? Give feedback.
-
I think this is a very much required feature otherwise ComfyUI will not harnessing it full potential. There are multiple use cases for this! Better is to have a thread node/group to denote the nodes can be executed in threads and then these nodes can run parallely. |
Beta Was this translation helpful? Give feedback.
-
@guill Previously, it was somewhat premature to discuss the parallelization of the execution model, but now that #2666 has been merged, it seems possible to discuss the next steps towards parallelization. Are you perhaps interested in this topic as well? |
Beta Was this translation helpful? Give feedback.
-
Here’s a concept I’d like your feedback on:
|
Beta Was this translation helpful? Give feedback.
-
While the features to control where different parts of a job are executed, and/or split models across GPUs are necessary in the long term, in the near term the focus should be on adding support for the queue to execute jobs against all available local GPUs (or a subset) and a list of remote server instances. Should be relatively simple to add support for concurrent jobs at the queueing point and allow the user to select which compute resources are available for queueing jobs. If I have several mixed GPUs locally, I may want to omit the one with less VRAM if I am running bigger models in the batch, or manually add a list of remote server addresses. Also need a mechanism for the remote jobs to return the output for the local pipeline to display and/or write to disk. Once queueing takes advantage of distributing the workload, then adding more granular controls for which parts of the pipeline run on which GPU or remote server instance or tagging them for parallelism dependencies makes a lot of sense, it's just a lot more complicated and should be done after basic job queueing is working. It's pretty easy to get multiple local GPUs or remote server instances going... older gpus with decent VRAM are starting to get pretty cheap, PCIE riser solutions and mining rig power supply setups let you easily add ~28 gpus to a single system, so support for multigpu stuff is going to be increasingly more expected and useful in the future. e.g. I have a server with 8xM40x24gb's, one with 6x1060x6gb's, and have another one under assembly with 12xK80xdual12gb+2xM40x24gb cards, all of which was pretty cheap to pick up on ebay for less than the cost of the (better, but expensive) current generation cards, and when prices on RTX cards start dropping off after the nvidia 5000 series ships, there will be more better gpus flooding the channels.... As such I think the priority should be to get the queueing to take advantage of scheduling jobs to multiple GPUS first.... |
Beta Was this translation helpful? Give feedback.
-
Yeah, I have StableSwarmUI running and it cranks up 8 instances for the GPUs easily enough..... for some reason I can't install NetDistPlus as it fails, but I do have NetDist sorta working after looking at the examples more closely... it hangs if I include the GPU 0 server in the workspace, maybe because I'm remote desktop'd in to the ubuntu host and that's using GPU0 to accelerate. Anyway, it works ish, but it's still pretty clunky. for example, it needs to finish all images in the current set before it can move on to the next queued job item... in my case some GPUs are on full x16 bandwidth riser cables and some are on 1x extenders, so the gpus idle out while waiting on the ones with the slower links to finish transferring data. Also, it's super irritating that the remote server ports change when they crash or get restarted. the ports are all under python so the original ports don't get freed up unless you reboot or restart python. It also craps out if any one of the remote hosts aren't available. So... not very robust... It works OK, but long term I'd rather see multigpu/remotehost support in the queuing directly and more robustly.... when designing things to be able to split independent nodes across multiple gpus and manage dependencies, the work to distribute work items from the queue across local and remote GPUs should also get done.. it's all interrelated and there are valid reasons to want to split a single job across multiple GPUs versus wanting to batch multiple jobs across a GPU pool. Fundamentally it needs to be able to manage assigning various workstreams to the available local and remote GPUs, and managing ordering dependencies, whether they are whole jobs or subcomponents, and manage caching/reuse in gpu memory, ....it's kind of all the same thing if everything becomes work items that need to be scheduled on available GPUs, and there are different load balancing strategies one might want depending on how the cluster is being used. (e.g. for cranking out images in batch vs accelerating a single workflow you would split things across gpus differently with maximizing load efficiency by having the main model cached on all the GPUS vs maximizing parallelism for a single job by dedicating gpus to specific models) |
Beta Was this translation helpful? Give feedback.
-
What about the MultiGPU custom node? |
Beta Was this translation helpful? Give feedback.
-
Can we implement parallel execution of independent nodes in ComfyUI to improve performance?
Description:
Allow nodes that do not depend on each other to run simultaneously. This can optimize resource usage and reduce processing time.
Examples:
Image Processing:
All these operations can run in parallel.
Benefits:
If determining which nodes are independent is complex, this configuration could be set manually for each workflow.
Would it be possible to add this feature?
Beta Was this translation helpful? Give feedback.
All reactions