You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our demo, we implemented a distributed parallel strategy with DP=1, TP=EP=N. In sglang, it seems that the strategy is TP=1, DP=EP=N.
Thank you for the reply!
For decoding, it is mentioned "The attention part employs TP4 with SP, combined with DP80". Could I ask where the SP is used for (honestly I think SP and DP are similar when using continuous batching) in this case?
Thank you for open-sourcing this great work!
I was wondering if you could provide a bit more details on how you approach the MLA parallelism. When you mention TP do you mean only partition projection weights (e.g. W_q1, W_q2, W_kv1, W_kv2, W_o); and for DP is to partition batch of tokens just like in https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models ?
The text was updated successfully, but these errors were encountered: