-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36
Comments
You can check whether the frequencies of different GPUs on the server are the same; some GPUs might have downclocked. |
How many times did Flux loop in this profile? |
about 500 @zheng-ningxin |
Would you also observe this phenomenon when you use torch.profile? |
@Rainlin007 The difference you showed in your profiling looks quite big indeed, so in your 500 runs, which iteration this screenshot belongs to? maybe check the later ones to see if this is stable or occasionally show up ? |
Your question
Ask a clear and concise question about Flux.
There is torch.Size([5120, 1024]) x torch.Size([8192, 1024]) gemm_rs op in my project,fp16.I made a benchmark on A100:
torch.Size([5120, 1024]) x torch.Size([8192, 1024]):
torch #0: gemm 0.358 ms, comm 0.416 ms, total 0.774 ms
torch #1: gemm 0.357 ms, comm 0.416 ms, total 0.773 ms
torch #2: gemm 0.354 ms, comm 0.418 ms, total 0.772 ms
torch #3: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms
torch #4: gemm 0.359 ms, comm 0.414 ms, total 0.773 ms
torch #5: gemm 0.355 ms, comm 0.418 ms, total 0.772 ms
torch #6: gemm 0.361 ms, comm 0.412 ms, total 0.773 ms
torch #7: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms
flux #0: gemm 0.386 ms, comm 0.138 ms, total 0.524 ms
flux #1: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms
flux #2: gemm 0.382 ms, comm 0.142 ms, total 0.523 ms
flux #3: gemm 0.384 ms, comm 0.139 ms, total 0.523 ms
flux #4: gemm 0.387 ms, comm 0.136 ms, total 0.523 ms
flux #5: gemm 0.383 ms, comm 0.140 ms, total 0.523 ms
flux #6: gemm 0.388 ms, comm 0.135 ms, total 0.523 ms
flux #7: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms
but in my proj,flux elapsed time over 900us,and my nsys results are:
my proj
benchmark
We can see the bytedance::flux::CudaIpcBarrierAllKernel time not same,how can I solve the problem?
The text was updated successfully, but these errors were encountered: