[QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ? #1177

wuyingjun-lucky · 2024-03-21T13:11:33Z

wuyingjun-lucky
Mar 21, 2024

Your question
Why does Megatron-LM using gloo backend not value paased by --distributed-backend when Creating Parrallel Group ?
Ask a clear and concise question about Megatron-LM.

 for i in range(pipeline_model_parallel_size):
        start_rank = i * num_pipeline_model_parallel_groups
        end_rank = (i + 1) * num_pipeline_model_parallel_groups
        for j in range(context_parallel_size * tensor_model_parallel_size):
            ranks = range(
                start_rank + j, end_rank, context_parallel_size * tensor_model_parallel_size
            )
            group = torch.distributed.new_group(
                ranks, pg_options=get_nccl_options('dp', nccl_comm_cfgs)
            )
            group_gloo = torch.distributed.new_group(ranks, backend="gloo")
            if rank in ranks:
                _DATA_PARALLEL_GROUP = group
                _DATA_PARALLEL_GROUP_GLOO = group_gloo
                _DATA_PARALLEL_GLOBAL_RANKS = ranks

yuantailing · 2024-03-27T17:01:24Z

yuantailing
Mar 27, 2024

Most of groups use the default backend.
Only a few groups use gloo backend, because gloo is needed for the communication of CPU tensors.

0 replies

2024-05-26T18:20:21Z

github-actions[bot]
bot May 26, 2024

Marking as stale. No activity in 60 days.

0 replies

yangfuwei · 2024-07-22T02:03:54Z

yangfuwei
Jul 22, 2024

Most of groups use the default backend.
Only a few groups use gloo backend, because gloo is needed for the communication of CPU tensors.

Hi, I found some codes fixed to use gloo , for example https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py#L557, But I met issue when creating gloo backend. If I manually set all the "gloo" to "nccl", it works. What is the influence? Will it be okay if we replace all the "gloo" to "nccl" ? Thank you.

0 replies

2024-09-20T18:21:34Z

github-actions[bot]
bot Sep 20, 2024

Marking as stale. No activity in 60 days.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ? #1177

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ? #1177

wuyingjun-lucky Mar 21, 2024

Replies: 4 comments

yuantailing Mar 27, 2024

github-actions[bot] bot May 26, 2024

yangfuwei Jul 22, 2024

github-actions[bot] bot Sep 20, 2024

wuyingjun-lucky
Mar 21, 2024

yuantailing
Mar 27, 2024

github-actions[bot]
bot May 26, 2024

yangfuwei
Jul 22, 2024

github-actions[bot]
bot Sep 20, 2024