Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36

Open
Rainlin007 opened this issue Aug 22, 2024 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@Rainlin007
Copy link

Rainlin007 commented Aug 22, 2024

Your question
Ask a clear and concise question about Flux.

There is torch.Size([5120, 1024]) x torch.Size([8192, 1024]) gemm_rs op in my project,fp16.I made a benchmark on A100:

torch.Size([5120, 1024]) x torch.Size([8192, 1024]):
torch #0: gemm 0.358 ms, comm 0.416 ms, total 0.774 ms
torch #1: gemm 0.357 ms, comm 0.416 ms, total 0.773 ms
torch #2: gemm 0.354 ms, comm 0.418 ms, total 0.772 ms
torch #3: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms
torch #4: gemm 0.359 ms, comm 0.414 ms, total 0.773 ms
torch #5: gemm 0.355 ms, comm 0.418 ms, total 0.772 ms
torch #6: gemm 0.361 ms, comm 0.412 ms, total 0.773 ms
torch #7: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms

flux #0: gemm 0.386 ms, comm 0.138 ms, total 0.524 ms
flux #1: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms
flux #2: gemm 0.382 ms, comm 0.142 ms, total 0.523 ms
flux #3: gemm 0.384 ms, comm 0.139 ms, total 0.523 ms
flux #4: gemm 0.387 ms, comm 0.136 ms, total 0.523 ms
flux #5: gemm 0.383 ms, comm 0.140 ms, total 0.523 ms
flux #6: gemm 0.388 ms, comm 0.135 ms, total 0.523 ms
flux #7: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms

but in my proj,flux elapsed time over 900us,and my nsys results are:

my proj
image

benchmark
image

We can see the bytedance::flux::CudaIpcBarrierAllKernel time not same,how can I solve the problem?

@Rainlin007
Copy link
Author

I saw that the gemm time on different GPUs was different, which led to subsequent increase in synchronization time, but I did not find the reason for the different gemm time.

image

@Rainlin007 Rainlin007 changed the title [QUESTION]Why are the effects different in benchmark and real project? [QUESTION] The gemm time of GPUs of different ranks under tp8 is very different Aug 22, 2024
@Rainlin007 Rainlin007 changed the title [QUESTION] The gemm time of GPUs of different ranks under tp8 is very different [QUESTION] The gemm time on GPUs of different ranks under tp8 is very different Aug 22, 2024
@Rainlin007 Rainlin007 changed the title [QUESTION] The gemm time on GPUs of different ranks under tp8 is very different [QUESTION] The gemm time on GPUs of different ranks under tp8 is very different ,and it cause low performance Aug 22, 2024
@Rainlin007 Rainlin007 changed the title [QUESTION] The gemm time on GPUs of different ranks under tp8 is very different ,and it cause low performance [QUESTION] The gemm times on GPUs of different ranks under tp8 are very different ,and cause low performance Aug 22, 2024
@Rainlin007 Rainlin007 changed the title [QUESTION] The gemm times on GPUs of different ranks under tp8 are very different ,and cause low performance [QUESTION] The gemm times on GPUs of different ranks under tp8 are very different , and cause low performance Aug 22, 2024
@Rainlin007 Rainlin007 changed the title [QUESTION] The gemm times on GPUs of different ranks under tp8 are very different , and cause low performance [QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance Aug 22, 2024
@zheng-ningxin zheng-ningxin self-assigned this Aug 26, 2024
@zheng-ningxin
Copy link
Collaborator

You can check whether the frequencies of different GPUs on the server are the same; some GPUs might have downclocked.

@Rainlin007
Copy link
Author

I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?
image

@Rainlin007
Copy link
Author

@zheng-ningxin

@zheng-ningxin
Copy link
Collaborator

How many times did Flux loop in this profile?

@Rainlin007
Copy link
Author

Rainlin007 commented Aug 26, 2024

How many times did Flux loop in this profile?

about 500 @zheng-ningxin

@zheng-ningxin
Copy link
Collaborator

Would you also observe this phenomenon when you use torch.profile?

@wenlei-bao
Copy link
Collaborator

I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?

@Rainlin007
For long run, GPU might adjust the frequency. You can use tool like nvidia-sim to monitor the frequency and do some sampling to check the change.

The difference you showed in your profiling looks quite big indeed, so in your 500 runs, which iteration this screenshot belongs to? maybe check the later ones to see if this is stable or occasionally show up ?

@wenlei-bao wenlei-bao added the question Further information is requested label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants