[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36

Rainlin007 · 2024-08-22T02:14:14Z

Your question
Ask a clear and concise question about Flux.

There is torch.Size([5120, 1024]) x torch.Size([8192, 1024]) gemm_rs op in my project,fp16.I made a benchmark on A100:

torch.Size([5120, 1024]) x torch.Size([8192, 1024]):
torch #0: gemm 0.358 ms, comm 0.416 ms, total 0.774 ms
torch #1: gemm 0.357 ms, comm 0.416 ms, total 0.773 ms
torch #2: gemm 0.354 ms, comm 0.418 ms, total 0.772 ms
torch #3: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms
torch #4: gemm 0.359 ms, comm 0.414 ms, total 0.773 ms
torch #5: gemm 0.355 ms, comm 0.418 ms, total 0.772 ms
torch #6: gemm 0.361 ms, comm 0.412 ms, total 0.773 ms
torch #7: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms

flux #0: gemm 0.386 ms, comm 0.138 ms, total 0.524 ms
flux #1: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms
flux #2: gemm 0.382 ms, comm 0.142 ms, total 0.523 ms
flux #3: gemm 0.384 ms, comm 0.139 ms, total 0.523 ms
flux #4: gemm 0.387 ms, comm 0.136 ms, total 0.523 ms
flux #5: gemm 0.383 ms, comm 0.140 ms, total 0.523 ms
flux #6: gemm 0.388 ms, comm 0.135 ms, total 0.523 ms
flux #7: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms

but in my proj，flux elapsed time over 900us，and my nsys results are:

my proj

benchmark

We can see the bytedance::flux::CudaIpcBarrierAllKernel time not same，how can I solve the problem？

Rainlin007 · 2024-08-22T06:19:53Z

I saw that the gemm time on different GPUs was different, which led to subsequent increase in synchronization time, but I did not find the reason for the different gemm time.

zheng-ningxin · 2024-08-26T09:00:22Z

You can check whether the frequencies of different GPUs on the server are the same; some GPUs might have downclocked.

Rainlin007 · 2024-08-26T09:11:58Z

I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?

Rainlin007 · 2024-08-26T09:12:11Z

@zheng-ningxin

zheng-ningxin · 2024-08-26T09:16:29Z

How many times did Flux loop in this profile?

Rainlin007 · 2024-08-26T09:21:58Z

How many times did Flux loop in this profile?

about 500 @zheng-ningxin

zheng-ningxin · 2024-08-26T11:29:00Z

Would you also observe this phenomenon when you use torch.profile?

wenlei-bao · 2024-08-29T17:30:21Z

I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?

@Rainlin007
For long run, GPU might adjust the frequency. You can use tool like nvidia-sim to monitor the frequency and do some sampling to check the change.

The difference you showed in your profiling looks quite big indeed, so in your 500 runs, which iteration this screenshot belongs to? maybe check the later ones to see if this is stable or occasionally show up ?

Rainlin007 changed the title ~~[QUESTION]Why are the effects different in benchmark and real project？~~ [QUESTION] The gemm time of GPUs of different ranks under tp8 is very different Aug 22, 2024

Rainlin007 changed the title ~~[QUESTION] The gemm time of GPUs of different ranks under tp8 is very different~~ [QUESTION] The gemm time on GPUs of different ranks under tp8 is very different Aug 22, 2024

Rainlin007 changed the title ~~[QUESTION] The gemm time on GPUs of different ranks under tp8 is very different~~ [QUESTION] The gemm time on GPUs of different ranks under tp8 is very different ，and it cause low performance Aug 22, 2024

Rainlin007 changed the title ~~[QUESTION] The gemm time on GPUs of different ranks under tp8 is very different ，and it cause low performance~~ [QUESTION] The gemm times on GPUs of different ranks under tp8 are very different ，and cause low performance Aug 22, 2024

Rainlin007 changed the title ~~[QUESTION] The gemm times on GPUs of different ranks under tp8 are very different ，and cause low performance~~ [QUESTION] The gemm times on GPUs of different ranks under tp8 are very different , and cause low performance Aug 22, 2024

Rainlin007 changed the title ~~[QUESTION] The gemm times on GPUs of different ranks under tp8 are very different , and cause low performance~~ [QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance Aug 22, 2024

zheng-ningxin self-assigned this Aug 26, 2024

wenlei-bao added the question Further information is requested label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36

[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36

Rainlin007 commented Aug 22, 2024 •

edited

Loading

Rainlin007 commented Aug 22, 2024

zheng-ningxin commented Aug 26, 2024

Rainlin007 commented Aug 26, 2024

Rainlin007 commented Aug 26, 2024

zheng-ningxin commented Aug 26, 2024

Rainlin007 commented Aug 26, 2024 •

edited

Loading

zheng-ningxin commented Aug 26, 2024

wenlei-bao commented Aug 29, 2024

[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36

[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36

Comments

Rainlin007 commented Aug 22, 2024 • edited Loading

Rainlin007 commented Aug 22, 2024

zheng-ningxin commented Aug 26, 2024

Rainlin007 commented Aug 26, 2024

Rainlin007 commented Aug 26, 2024

zheng-ningxin commented Aug 26, 2024

Rainlin007 commented Aug 26, 2024 • edited Loading

zheng-ningxin commented Aug 26, 2024

wenlei-bao commented Aug 29, 2024

Rainlin007 commented Aug 22, 2024 •

edited

Loading

Rainlin007 commented Aug 26, 2024 •

edited

Loading