Question regarding MLA DP+TP parallel during inference #215

TheTinyTeddy · 2025-01-03T06:29:00Z

Thank you for open-sourcing this great work!

I was wondering if you could provide a bit more details on how you approach the MLA parallelism. When you mention TP do you mean only partition projection weights (e.g. W_q1, W_q2, W_kv1, W_kv2, W_o); and for DP is to partition batch of tokens just like in https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models ?

GeeeekExplorer · 2025-01-03T07:22:04Z

In our demo, we implemented a distributed parallel strategy with DP=1, TP=EP=N. In sglang, it seems that the strategy is TP=1, DP=EP=N.

TheTinyTeddy · 2025-01-06T02:30:51Z

In our demo, we implemented a distributed parallel strategy with DP=1, TP=EP=N. In sglang, it seems that the strategy is TP=1, DP=EP=N.

Thank you for the reply!

For decoding, it is mentioned "The attention part employs TP4 with SP, combined with DP80". Could I ask where the SP is used for (honestly I think SP and DP are similar when using continuous batching) in this case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding MLA DP+TP parallel during inference #215

Question regarding MLA DP+TP parallel during inference #215

TheTinyTeddy commented Jan 3, 2025

GeeeekExplorer commented Jan 3, 2025

TheTinyTeddy commented Jan 6, 2025 •

edited

Loading

Question regarding MLA DP+TP parallel during inference #215

Question regarding MLA DP+TP parallel during inference #215

Comments

TheTinyTeddy commented Jan 3, 2025

GeeeekExplorer commented Jan 3, 2025

TheTinyTeddy commented Jan 6, 2025 • edited Loading

TheTinyTeddy commented Jan 6, 2025 •

edited

Loading