Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding MLA DP+TP parallel during inference #215

Open
TheTinyTeddy opened this issue Jan 3, 2025 · 2 comments
Open

Question regarding MLA DP+TP parallel during inference #215

TheTinyTeddy opened this issue Jan 3, 2025 · 2 comments

Comments

@TheTinyTeddy
Copy link

Thank you for open-sourcing this great work!

I was wondering if you could provide a bit more details on how you approach the MLA parallelism. When you mention TP do you mean only partition projection weights (e.g. W_q1, W_q2, W_kv1, W_kv2, W_o); and for DP is to partition batch of tokens just like in https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models ?

@GeeeekExplorer
Copy link
Contributor

In our demo, we implemented a distributed parallel strategy with DP=1, TP=EP=N. In sglang, it seems that the strategy is TP=1, DP=EP=N.

@TheTinyTeddy
Copy link
Author

TheTinyTeddy commented Jan 6, 2025

In our demo, we implemented a distributed parallel strategy with DP=1, TP=EP=N. In sglang, it seems that the strategy is TP=1, DP=EP=N.

Thank you for the reply!

For decoding, it is mentioned "The attention part employs TP4 with SP, combined with DP80". Could I ask where the SP is used for (honestly I think SP and DP are similar when using continuous batching) in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants