Any consideration on why use 4 sp & 32 tp? #74

ParanoidHW · 2024-04-28T07:26:53Z

Hi, authors, great work!
I have a small question on the parallelism. It seems ring attention can hide the communication time under the local attn computation. So why still use more tensor parallelism than sequential parallelism, e.g. 32 tp vs. 4 sp during inference, instead the opposite? since the communication costs caused by TP cannot be ignored or overlapped.
Hope you can answer my question. Many thanks~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any consideration on why use 4 sp & 32 tp? #74

Any consideration on why use 4 sp & 32 tp? #74

ParanoidHW commented Apr 28, 2024

Any consideration on why use 4 sp & 32 tp? #74

Any consideration on why use 4 sp & 32 tp? #74

Comments

ParanoidHW commented Apr 28, 2024