You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your question
Ask a clear and concise question about Flux.
I'm puzzled by how flux handles the problem of computation and communication competing for hardware resources when they overlap.
Because in my own project, when I start a gemm kernel and a nccl kernel at the same time, the nccl kernel is likely to be delayed until the gemm is completed because there are not enough SM resources.
What I've observed from flux's code is that in src/all_gather/ths_op/all_gather_gemm_kernel_crossnode.cc, cutlass_op->run and copy_all2all functions will execute separately (my understanding is that the scheduling of these two is entirely determined by the GPU).
In another similar work, nanoflow, the number of SMs available is set for different kernels to avoid the interference.
I would like to know how Flux handles this and thank you for your generous help.
The text was updated successfully, but these errors were encountered:
What I've observed from flux's code is that in src/all_gather/ths_op/all_gather_gemm_kernel_crossnode.cc, cutlass_op->run and copy_all2all functions will execute separately (my understanding is that the scheduling of these two is entirely determined by the GPU).
copy all2all utilize cuda API which use copy engine underline, does not compete with the compute resource with the gemm kernel. For nccl kernel, yes it would use SMs but could be very limited, you need to carefully assign your compute resources, otherwise they could compete and result in deadlock.
Besides, you can play with CUDA_DEVICE_MAX_CONNECTIONS which somehow help you decide the kernel launch/execution order in a certain way.
@wenlei-bao Hello, thanks for your advice first. CUDA_DEVICE_MAX_CONNECTIONS=1 did help me a lot.
And I still have a question, which is also about resources competition.
I'm curious if FLUX has ever faced memory bandwidth contention in the implementation? When ncclsend/recv overlaps with GEMM, they compete for memory bandwidth (although use different SMs and streams), which has a negligible impact on GEMM, but causes NCCL operations to slow down significantly.
There doesn't seem to be any mention of this in the FLUX paper, as the decomposed time reported in the experiment is the exposed communication time, so we can't know the effect of overlapping on communication.
@chenhongyu2048 The overlapping metric we proposed in the paper should demonstrate the overlapping effect.
You probably want to calculate the bandwidth based on latency. And it is also related to your HW interconnect.
Your question
Ask a clear and concise question about Flux.
I'm puzzled by how flux handles the problem of computation and communication competing for hardware resources when they overlap.
Because in my own project, when I start a gemm kernel and a nccl kernel at the same time, the nccl kernel is likely to be delayed until the gemm is completed because there are not enough SM resources.
What I've observed from flux's code is that in
src/all_gather/ths_op/all_gather_gemm_kernel_crossnode.cc
,cutlass_op->run
andcopy_all2all
functions will execute separately (my understanding is that the scheduling of these two is entirely determined by the GPU).In another similar work, nanoflow, the number of SMs available is set for different kernels to avoid the interference.
I would like to know how Flux handles this and thank you for your generous help.
The text was updated successfully, but these errors were encountered: