Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How does flux handle hardware resoureces competition? #39

Open
chenhongyu2048 opened this issue Sep 5, 2024 · 3 comments
Open
Assignees
Labels
question Further information is requested

Comments

@chenhongyu2048
Copy link

Your question
Ask a clear and concise question about Flux.

I'm puzzled by how flux handles the problem of computation and communication competing for hardware resources when they overlap.

Because in my own project, when I start a gemm kernel and a nccl kernel at the same time, the nccl kernel is likely to be delayed until the gemm is completed because there are not enough SM resources.

What I've observed from flux's code is that in src/all_gather/ths_op/all_gather_gemm_kernel_crossnode.cc, cutlass_op->run and copy_all2all functions will execute separately (my understanding is that the scheduling of these two is entirely determined by the GPU).

In another similar work, nanoflow, the number of SMs available is set for different kernels to avoid the interference.

I would like to know how Flux handles this and thank you for your generous help.

@wenlei-bao wenlei-bao self-assigned this Sep 10, 2024
@wenlei-bao wenlei-bao added the question Further information is requested label Sep 10, 2024
@wenlei-bao
Copy link
Collaborator

@chenhongyu2048
For flux related:

What I've observed from flux's code is that in src/all_gather/ths_op/all_gather_gemm_kernel_crossnode.cc, cutlass_op->run and copy_all2all functions will execute separately (my understanding is that the scheduling of these two is entirely determined by the GPU).
copy all2all utilize cuda API which use copy engine underline, does not compete with the compute resource with the gemm kernel. For nccl kernel, yes it would use SMs but could be very limited, you need to carefully assign your compute resources, otherwise they could compete and result in deadlock.

Besides, you can play with CUDA_DEVICE_MAX_CONNECTIONS which somehow help you decide the kernel launch/execution order in a certain way.

@chenhongyu2048
Copy link
Author

@wenlei-bao Hello, thanks for your advice first. CUDA_DEVICE_MAX_CONNECTIONS=1 did help me a lot.
And I still have a question, which is also about resources competition.
I'm curious if FLUX has ever faced memory bandwidth contention in the implementation? When ncclsend/recv overlaps with GEMM, they compete for memory bandwidth (although use different SMs and streams), which has a negligible impact on GEMM, but causes NCCL operations to slow down significantly.
There doesn't seem to be any mention of this in the FLUX paper, as the decomposed time reported in the experiment is the exposed communication time, so we can't know the effect of overlapping on communication.

@wenlei-bao
Copy link
Collaborator

@chenhongyu2048 The overlapping metric we proposed in the paper should demonstrate the overlapping effect.
You probably want to calculate the bandwidth based on latency. And it is also related to your HW interconnect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants