[QUESTION] How does flux handle hardware resoureces competition? #39

chenhongyu2048 · 2024-09-05T11:07:33Z

Your question
Ask a clear and concise question about Flux.

I'm puzzled by how flux handles the problem of computation and communication competing for hardware resources when they overlap.

Because in my own project, when I start a gemm kernel and a nccl kernel at the same time, the nccl kernel is likely to be delayed until the gemm is completed because there are not enough SM resources.

What I've observed from flux's code is that in src/all_gather/ths_op/all_gather_gemm_kernel_crossnode.cc, cutlass_op->run and copy_all2all functions will execute separately (my understanding is that the scheduling of these two is entirely determined by the GPU).

In another similar work, nanoflow, the number of SMs available is set for different kernels to avoid the interference.

I would like to know how Flux handles this and thank you for your generous help.

The text was updated successfully, but these errors were encountered:

wenlei-bao · 2024-09-10T17:41:58Z

@chenhongyu2048
For flux related:

What I've observed from flux's code is that in src/all_gather/ths_op/all_gather_gemm_kernel_crossnode.cc, cutlass_op->run and copy_all2all functions will execute separately (my understanding is that the scheduling of these two is entirely determined by the GPU).
copy all2all utilize cuda API which use copy engine underline, does not compete with the compute resource with the gemm kernel. For nccl kernel, yes it would use SMs but could be very limited, you need to carefully assign your compute resources, otherwise they could compete and result in deadlock.

Besides, you can play with CUDA_DEVICE_MAX_CONNECTIONS which somehow help you decide the kernel launch/execution order in a certain way.

chenhongyu2048 · 2024-09-11T05:43:37Z

@wenlei-bao Hello, thanks for your advice first. CUDA_DEVICE_MAX_CONNECTIONS=1 did help me a lot.
And I still have a question, which is also about resources competition.
I'm curious if FLUX has ever faced memory bandwidth contention in the implementation? When ncclsend/recv overlaps with GEMM, they compete for memory bandwidth (although use different SMs and streams), which has a negligible impact on GEMM, but causes NCCL operations to slow down significantly.
There doesn't seem to be any mention of this in the FLUX paper, as the decomposed time reported in the experiment is the exposed communication time, so we can't know the effect of overlapping on communication.

wenlei-bao · 2024-11-15T19:21:55Z

@chenhongyu2048 The overlapping metric we proposed in the paper should demonstrate the overlapping effect.
You probably want to calculate the bandwidth based on latency. And it is also related to your HW interconnect.

wenlei-bao self-assigned this Sep 10, 2024

wenlei-bao added the question Further information is requested label Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] How does flux handle hardware resoureces competition? #39

[QUESTION] How does flux handle hardware resoureces competition? #39

chenhongyu2048 commented Sep 5, 2024

wenlei-bao commented Sep 10, 2024

chenhongyu2048 commented Sep 11, 2024

wenlei-bao commented Nov 15, 2024

[QUESTION] How does flux handle hardware resoureces competition? #39

[QUESTION] How does flux handle hardware resoureces competition? #39

Comments

chenhongyu2048 commented Sep 5, 2024

wenlei-bao commented Sep 10, 2024

chenhongyu2048 commented Sep 11, 2024

wenlei-bao commented Nov 15, 2024