Parallel Execution of Independent Nodes #3683

torealise · 2024-05-21T19:42:04Z

torealise
May 21, 2024

Can we implement parallel execution of independent nodes in ComfyUI to improve performance?

Description:

Allow nodes that do not depend on each other to run simultaneously. This can optimize resource usage and reduce processing time.

Examples:

Image Processing:

Node A: Loads Image 1, preprocesses it, and sends it to ControlNet.
Node B: Loads Image 2 and sends it to IPAdapter.
Node C: Segments Image 3.

All these operations can run in parallel.

Benefits:

Faster execution.
Better use of CPU/GPU resources.
More efficient handling of complex workflows.

If determining which nodes are independent is complex, this configuration could be set manually for each workflow.

Would it be possible to add this feature?

ltdrdata · 2024-05-22T01:32:52Z

ltdrdata
May 22, 2024
Collaborator

This requires an overhaul of the execution model and in-depth consideration of various corner cases.

For example, tasks that require VRAM should not operate in parallel. Otherwise, you'll encounter the nightmare of VRAM shortage. There are complex issues such as policy decisions on which tasks should run in parallel and which should not, and how to ensure that each node developer correctly applies these policies without errors.

To implement this, it is not a matter that can be addressed simply in an issue. It will need to be refined through lengthy discussions in a discussion forum.

0 replies

torealise · 2024-05-22T15:03:55Z

torealise
May 22, 2024
Author

This requires an overhaul of the execution model and in-depth consideration of various corner cases.

For example, tasks that require VRAM should not operate in parallel. Otherwise, you'll encounter the nightmare of VRAM shortage. There are complex issues such as policy decisions on which tasks should run in parallel and which should not, and how to ensure that each node developer correctly applies these policies without errors.

To implement this, it is not a matter that can be addressed simply in an issue. It will need to be refined through lengthy discussions in a discussion forum.

Could we create a tool that allows users to manually specify which nodes can run in parallel for each specific workflow?

This would help calculate the VRAM load for each case and use memory more efficiently. This way, we can avoid VRAM shortages and handle complex workflows more effectively.

By letting users decide which tasks run in parallel, we solve the problems with automatic parallelization and give them control over resource use.

7 replies

identitycrisis2077 Aug 27, 2024

Another option is to dynamically generate parallel execution. The first execution of a workflow would be sequential execution but the nodes will collect it's corresponding footprint and we can make an opinionated guess on which nodes can be parallelized based on the footprint and available resource. If it's not available, the node will queue and wait on it's execution until required resources are freed. Might not be optimal but still better than 100% sequential.

WASasquatch Sep 5, 2024

This presents a great opportunity to have GPU node flags (useful information to have available about nodes in any case), and subsequently, add a dynamic GPU selection dropdown on these flagged nodes, which is passed to the back-end determining the device the operation runs on, so when you have a 6 card 4090 bench, you can do 6 ksamplers in parallel. We should start thinking beyond the novelty hobbyists using ComfyUI.

And hey, if you run 3 jobs on one device that is GPU intensive, maybe, like, don't do that. That's on them and probably shouldn't be a limiting factor in mature features already enjoyed in other software without really making the risk an issue. They assume people know what they're doing when they manually enable/use a feature.

ltdrdata Sep 5, 2024
Collaborator

IMO, a GPU pool approach seems better than critical sections.
While nodes are executing in parallel, nodes that require important resources like GPU can request a resource object in a way similar to requesting a unique_id in hidden parameters.
Until the node's execution is completed, the usage rights for that GPU device are allocated only to that specific node.

In more detail, as @WASasquatch suggested, we could expose it as a 'required' parameter, allowing users to select the GPU device. In this case, during parallel execution of the workflow, the node would remain in a queue state until the specified GPU device becomes available.

Or it would be possible for a single node to request multiple GPU devices.

  {
      'required': { 
                "device": ("DEVICE", { 'is_multiple': False } ) 
      },
  }

  {
      'required': { 
                "device": ("DEVICE", { 'is_multiple':True, 'count': 'all' } ) 
      },
  }

WASasquatch Sep 5, 2024

I like the idea of input param. With new UI going on could be expanded with a device filter. Say you want X nodes to only use devices [1, 2] because 0 is some low class card that you know won't handle a specific job, or maybe used elsewhere

Looking at my friend with his 1080 and 3080 and shenanigans he goes through with ML lmao

identitycrisis2077 Sep 14, 2024

Just humor me, wouldn't running multiple instances of core workflow achieve the same as running on a GPU pool? Allowing dynamically scaling a workflow to run multiple requests concurrently would allow for cost effective scaling of workflows. This is not something which a hobbyist might require but this has potential on the cloud.

pulkit-jain · 2024-07-22T17:19:43Z

pulkit-jain
Jul 22, 2024

I think this is a very much required feature otherwise ComfyUI will not harnessing it full potential.

There are multiple use cases for this!

Better is to have a thread node/group to denote the nodes can be executed in threads and then these nodes can run parallely.

0 replies

ltdrdata · 2024-09-05T01:11:28Z

ltdrdata
Sep 5, 2024
Collaborator

@guill Previously, it was somewhat premature to discuss the parallelization of the execution model, but now that #2666 has been merged, it seems possible to discuss the next steps towards parallelization.

Are you perhaps interested in this topic as well?

10 replies

WASasquatch Sep 17, 2024

Could automatically managing resources from a pool perhaps be done as a second step? My assumption here is that people with access to multi GPU systems that can take advantage of this will have enough experience to modify their workflows accordingly, so it'd be a great first step to get something going at all.

For example, how about flagging each GPU node with a GPU index to run on. During execution it then schedules all possible nodes in parallel, with the restriction that two nodes assigned to the same GPU cannot run at the same time. This would let us easily build workflows that process batches in parallel on multiple GPUs by copy pasting a few nodes, and then join the results back together to finalize.

When designing a batching workflow for multiple GPUs it doesn't really make sense to me to always want it shareable with systems that have less GPUs. For example I might have loaded multiple models on specific GPUs to create some kind of pipeline with the assumption that I have this particular amount of VRAM available to do it.

An example workflow might be loading the text encoder and vae for flux on GPU 0, preparing the prompts there, then generating 4 batches in parallel on GPUs 1-4, decoding all the images back on GPU 0, and finally upscaling them on GPU 5.

If we could go one step further and allow partially overlapping multiple instances of such workflow, with the same restriction on parallel processing on the same GPU, so that the flux for prompt 2 can already run in parallel with the upscale for prompt 1, that would be truly amazing.

Basically this would avoid the whole resource contention problem by creating multiple resource pools (one per GPU) that work similar to the current single-GPU implementation and letting the user specify which to use per node.

Once that system exists, more fancy pooling could be built on top, for example with a node that runs a subgraph as many times as you have GPUs, or similar.

Yeah that's what I am saying about GPUs. In general having ability to choose device a GPU node runs on is ideal. For example with eGPU and my older laptop (my new one has 4090 💪) I can't use it cause it defaults to cuda device:0. Though I didn't look into the problem much it goes to show even a modest setup could benefit let alone a bench

vvauijij Nov 5, 2024

I definitely think distributed computing is worth talking about at some point, but I'm really just thinking about one machine at the moment.

I was wondering if production systems using ComfyUI have already faced the need for distributed computing? I’d like your feedback on adding a general-purpose RPC node to execute remote nodes (and a tool to wrap and deploy nodes as services outside ComfyUI). This would allow asynchrony to be encapsulated in network calls without considering PyTorch's internal synchronization model.

aikitoria Nov 5, 2024

I'd like a low-overhead generic RPC node. For example, generating images/latents on one system and sending them to another for second stage processing without having to convert it to a png in the middle or whatever. We can assume extremely fast networking between systems in a cluster so such encoding would be pure unnecessary overhead. https://github.com/nux1111/ComfyUI_NetDist_Plus already tries to add RPC nodes but they are difficult to use since they try to be too clever, can't handle multi-stage workflows, and add huge needless overhead for serialization (why does it not just send the raw latents/images...)

While not 100% optimal, this would get us a lot closer to solving the parallel execution in one workflow problem (just send out a bunch of RPCs to more Comfy instances on the other GPUs).

I'm not sure I understand the deployment you are suggesting. Couldn't the RPC just consist of executing a Comfy API workflow on another instance, so we do not need to deploy anything specific, as the "main" workflow can configure the other instances? Only question would be how to retrieve the results with high efficiency (should not involve saving files to disk, encoding images, converting binary data to base64 for no reason, additional network calls, or anything similarly silly)

vvauijij Nov 6, 2024

Thanks for sharing feedback and highlighting the limitations and overheads of existing RPC-nodes!

I'm not sure I understand the deployment you are suggesting. Couldn't the RPC just consist of executing a Comfy API workflow on another instance, so we do not need to deploy anything specific, as the "main" workflow can configure the other instances?

Deploying a single node would allow to significantly optimize model loading into RAM (as mentioned in other discussions, there are many problems with model caching). As of now, running remote Comfy instance seems to be more straightforward and useful

Only question would be how to retrieve the results with high efficiency (should not involve saving files to disk, encoding images, converting binary data to base64 for no reason, additional network calls, or anything similarly silly)

I haven't explored #2666 in details yet, but as mentioned above, PENDING status may be used to implement asynchronous retrieval of RPC-call results

WASasquatch Nov 12, 2024

I'd like a low-overhead generic RPC node. For example, generating images/latents on one system and sending them to another for second stage processing without having to convert it to a png in the middle or whatever. We can assume extremely fast networking between systems in a cluster so such encoding would be pure unnecessary overhead. https://github.com/nux1111/ComfyUI_NetDist_Plus already tries to add RPC nodes but they are difficult to use since they try to be too clever, can't handle multi-stage workflows, and add huge needless overhead for serialization (why does it not just send the raw latents/images...)

While not 100% optimal, this would get us a lot closer to solving the parallel execution in one workflow problem (just send out a bunch of RPCs to more Comfy instances on the other GPUs).

I'm not sure I understand the deployment you are suggesting. Couldn't the RPC just consist of executing a Comfy API workflow on another instance, so we do not need to deploy anything specific, as the "main" workflow can configure the other instances? Only question would be how to retrieve the results with high efficiency (should not involve saving files to disk, encoding images, converting binary data to base64 for no reason, additional network calls, or anything similarly silly)

I believe you can loop over the tensor keys and make a dictionary with tensor value with the .tolist() as then you can serialize this and send it with API calls.

torealise · 2024-09-08T04:28:49Z

torealise
Sep 8, 2024
Author

Here’s a concept I’d like your feedback on:

Users design and validate their workflow in ComfyUI.
Once confirmed, they activate a profiler that tracks each node’s processing time, VRAM usage, and resource consumption.
The profiler generates a resource map detailing system performance.
A script analyzes this map, optimizing the workflow specifically for the hardware in use.
The optimized settings are saved for future runs, maximizing efficiency on that system.

1 reply

aikitoria Nov 5, 2024

This sounds way too complicated. Just let the user configure how it should be distributed.

erikturn · 2024-10-25T18:06:06Z

erikturn
Oct 25, 2024

While the features to control where different parts of a job are executed, and/or split models across GPUs are necessary in the long term, in the near term the focus should be on adding support for the queue to execute jobs against all available local GPUs (or a subset) and a list of remote server instances. Should be relatively simple to add support for concurrent jobs at the queueing point and allow the user to select which compute resources are available for queueing jobs. If I have several mixed GPUs locally, I may want to omit the one with less VRAM if I am running bigger models in the batch, or manually add a list of remote server addresses. Also need a mechanism for the remote jobs to return the output for the local pipeline to display and/or write to disk. Once queueing takes advantage of distributing the workload, then adding more granular controls for which parts of the pipeline run on which GPU or remote server instance or tagging them for parallelism dependencies makes a lot of sense, it's just a lot more complicated and should be done after basic job queueing is working.

It's pretty easy to get multiple local GPUs or remote server instances going... older gpus with decent VRAM are starting to get pretty cheap, PCIE riser solutions and mining rig power supply setups let you easily add ~28 gpus to a single system, so support for multigpu stuff is going to be increasingly more expected and useful in the future. e.g. I have a server with 8xM40x24gb's, one with 6x1060x6gb's, and have another one under assembly with 12xK80xdual12gb+2xM40x24gb cards, all of which was pretty cheap to pick up on ebay for less than the cost of the (better, but expensive) current generation cards, and when prices on RTX cards start dropping off after the nvidia 5000 series ships, there will be more better gpus flooding the channels....

As such I think the priority should be to get the queueing to take advantage of scheduling jobs to multiple GPUS first....

3 replies

aikitoria Oct 25, 2024

This is something you can already do with tools like SwarmUI (running one Comfy instance per GPU and forwarding jobs to them), while for running parts of a workflow in parallel there is no good solution at all yet.

ltdrdata Oct 26, 2024
Collaborator

Although SwarmUI does provide such a scenario, it still needs improvement since its usability is completely different from being able to control it at the workflow level in ComfyUI.

Among the currently available methods, this approach using multiple instances is one way to achieve parallel processing.
https://github.com/city96/ComfyUI_NetDist

aikitoria Oct 26, 2024

I've been using https://github.com/nux1111/ComfyUI_NetDist_Plus, works great on my machine with 8 GPUs to generate basic images in parallel and join them all back in a batch for the UI to display. But I haven't yet figured out how to create proper multi-stage distributed workflows with it (think jailing t5 and and sd 1.5 to one GPU, then generating conditioning for all on that one, generating the images in parallel on multiple GPUs, joining those back together in a batch, then upscaling on the separate GPU again). Or any other workflow that would require multiple forks and joins. Maybe I'm just not using it right.

erikturn · 2024-10-26T22:00:42Z

erikturn
Oct 26, 2024

Yeah, I have StableSwarmUI running and it cranks up 8 instances for the GPUs easily enough..... for some reason I can't install NetDistPlus as it fails, but I do have NetDist sorta working after looking at the examples more closely... it hangs if I include the GPU 0 server in the workspace, maybe because I'm remote desktop'd in to the ubuntu host and that's using GPU0 to accelerate. Anyway, it works ish, but it's still pretty clunky. for example, it needs to finish all images in the current set before it can move on to the next queued job item... in my case some GPUs are on full x16 bandwidth riser cables and some are on 1x extenders, so the gpus idle out while waiting on the ones with the slower links to finish transferring data. Also, it's super irritating that the remote server ports change when they crash or get restarted. the ports are all under python so the original ports don't get freed up unless you reboot or restart python. It also craps out if any one of the remote hosts aren't available. So... not very robust...

It works OK, but long term I'd rather see multigpu/remotehost support in the queuing directly and more robustly.... when designing things to be able to split independent nodes across multiple gpus and manage dependencies, the work to distribute work items from the queue across local and remote GPUs should also get done.. it's all interrelated and there are valid reasons to want to split a single job across multiple GPUs versus wanting to batch multiple jobs across a GPU pool. Fundamentally it needs to be able to manage assigning various workstreams to the available local and remote GPUs, and managing ordering dependencies, whether they are whole jobs or subcomponents, and manage caching/reuse in gpu memory, ....it's kind of all the same thing if everything becomes work items that need to be scheduled on available GPUs, and there are different load balancing strategies one might want depending on how the cluster is being used. (e.g. for cranking out images in batch vs accelerating a single workflow you would split things across gpus differently with maximizing load efficiency by having the main model cached on all the GPUS vs maximizing parallelism for a single job by dedicating gpus to specific models)

0 replies

fedor-evolox · 2024-11-02T19:49:13Z

This node is used to offload weights to specific GPUs in case you do not have enough VRAM all on one, it does not allow parallel execution.

Well, it is also helpful in some cases.

Parallel Execution of Independent Nodes #3683

Replies: 8 comments · 23 replies

ltdrdata May 22, 2024 Collaborator

torealise May 22, 2024 Author

ltdrdata Sep 5, 2024 Collaborator

ltdrdata Sep 5, 2024 Collaborator

torealise Sep 8, 2024 Author

ltdrdata Oct 26, 2024 Collaborator

Replies: 8 comments 23 replies

ltdrdata
May 22, 2024
Collaborator

torealise
May 22, 2024
Author

ltdrdata Sep 5, 2024
Collaborator

ltdrdata
Sep 5, 2024
Collaborator

torealise
Sep 8, 2024
Author

ltdrdata Oct 26, 2024
Collaborator