-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abnormal Performance in Multi-GPU Testing caused by All2All #199
Comments
Thanks for your feedbacks. The problem is caused by cp parallel. Disable now in #205 ! And for small videos, the speedup will be lower. For longer videos (as in pab paper), the speedup will be near linear. I have tested open_sora (run_base) on A100, and the speedup is 3.5x for 4 gpus. As to the all_to_all problem you have mentioned. When i tested before on h100, the total time of all2all is about 5-8% even for 8 gpus. If you still have this problem, maybe i can test on a800 later. |
close due to inactive |
we find that the all2all speed will be much slower on H800 and A800 due to their low nvlink bandwidth. we may optimize this problem in future. |
Hi, thanks for your great work!
I ran the DSP and PAB examples from examples/latte on A800 GPUs. The results I obtained are as follows:
Compared to single-GPU performance, the multi-GPU setup did not achieve the expected results.
Furthermore, I examined the trace profile of the inference process and found that the all-to-all communication in Dynamic Switch consumed more time than computation:
Communication overhead using 4x A800s ⬇️
Communication overhead using 8x A800s ⬇️ It's totally communication bound!
Could you please explain this situation? How would you interpret these results, particularly the unexpected performance in multi-GPU setup and the disproportionate time spent on communication versus computation?
Additional Consideration:
Given that a single GPU can complete the inference process for a single Latte video sample, it's understandable that the entire inference process becomes communication-bound in a multi-GPU setup. I suspect that increasing the computational load might be necessary to achieve the expected performance on multiple GPUs. However, Latte doesn't support modifying video duration or resolution.
Questions:
Do you have any suggestions for modifying the computational load in your code?
Are there any other parameters or settings we can adjust to better balance computation and communication in multi-GPU scenarios?
Is this behavior expected for small video samples, and if so, what would be the recommended minimum sample size or computational load to effectively utilize multiple GPUs?
Any insights or guidance on optimizing performance for multi-GPU setups with Latte would be greatly appreciated.
The text was updated successfully, but these errors were encountered: