[BUG]: weird stuck while training #6095

ericxsun · 2024-10-19T05:33:21Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

When training a language model with the GeminiPlugin, I encountered an issue where the process got stuck during the forward step. I was saving a checkpoint every 3000 steps, and when it got stuck, I had to kill the process and resume from the latest checkpoint.

The stuck times

start step	stuck step	total step in each run
225000	271464	46464
180000	226463	46463
135000	181463	46463
90000	136463	46463
45000	91463	46463
0	46465	46465

Is there any idea to find why? Thanks a lot.

Environment

CUDA: 12.1
NCCL: 2.18
Pytorch: 2.1.2
Python: 3.8
Colossalai: 0.4.2

Edenzzzz · 2024-10-21T16:05:49Z

Can you share any relevant messages and stack trace on stuck or exit?

ericxsun · 2024-10-22T05:46:51Z

Can you share any relevant messages and stack trace on stuck or exit?

I didn’t receive any useful information or logs. All nodes seem to be functioning correctly. The only option I have is to kill the training process and resume it.

When I add more logs, the process gets stuck at the forward step.

Edenzzzz · 2024-10-22T17:32:13Z

Could you share the stack trace when you kill by ctrl c and a reproducible script?

ericxsun · 2024-11-04T00:32:40Z

Could you share the stack trace when you kill by ctrl c and a reproducible script?

Could it caused by the weird behavior described in #6111 ?

Edenzzzz · 2024-11-04T01:10:23Z

You can probably test the behavior of all_gather_object, see if it spawns multiple processes.
What happens with booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048) is that it calls into save_sharded_optimizer, which all_gathers the states . You can try removing some barriers along this call stack and ping other members with your findings (whether it fixes the stuck).

ericxsun · 2024-11-04T04:01:33Z

I observed that, following this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 525 in 2f583c1

    
           compacted_states = self.pack_optimizer_states_to_tensor(param_id, state_names) if own_param else None

, the PID for other ranks starts appearing on rank-0

Furthermore, after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 593 in 2f583c1

    
           compacted_states = torch.zeros(compacted_size, dtype=dtype, device=device, requires_grad=False)

If device is replaced with torch.device(f"cuda:{torch.cuda.current_device()}"), each rank retains only one PID, just as at the start.

compacted_states = torch.zeros(
    compacted_size,
    dtype=dtype,
    device=torch.device(f"cuda:{torch.cuda.current_device()}"),
    requires_grad=False
)

And after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 532 in 2f583c1

    
           dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group)

the PID for other ranks still starts appearing on each rank.

ericxsun · 2024-11-04T04:08:18Z

Hi @ver217，could you take a look? Thanks very much.

Edenzzzz · 2024-11-06T17:21:32Z

And after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 532 in 2f583c1

dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group)

the PID for other ranks still starts appearing on each rank.

This might just be the default behavior. All gather by definition collects tensor-based objects from other ranks.
https://discuss.pytorch.org/t/distributed-all-gather-object-produces-multiple-additional-processes/164991
For the stuck, please try removing dist.barrier call

ericxsun added the bug Something isn't working label Oct 19, 2024

Edenzzzz assigned ver217 Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: weird stuck while training #6095

[BUG]: weird stuck while training #6095

ericxsun commented Oct 19, 2024

Edenzzzz commented Oct 21, 2024 •

edited

Loading

ericxsun commented Oct 22, 2024

Edenzzzz commented Oct 22, 2024 •

edited

Loading

ericxsun commented Nov 4, 2024

Edenzzzz commented Nov 4, 2024 •

edited

Loading

ericxsun commented Nov 4, 2024 •

edited

Loading

ericxsun commented Nov 4, 2024

Edenzzzz commented Nov 6, 2024 •

edited

Loading

[BUG]: weird stuck while training #6095

[BUG]: weird stuck while training #6095

Comments

ericxsun commented Oct 19, 2024

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

Edenzzzz commented Oct 21, 2024 • edited Loading

ericxsun commented Oct 22, 2024

Edenzzzz commented Oct 22, 2024 • edited Loading

ericxsun commented Nov 4, 2024

Edenzzzz commented Nov 4, 2024 • edited Loading

ericxsun commented Nov 4, 2024 • edited Loading

ericxsun commented Nov 4, 2024

Edenzzzz commented Nov 6, 2024 • edited Loading

Edenzzzz commented Oct 21, 2024 •

edited

Loading

Edenzzzz commented Oct 22, 2024 •

edited

Loading

Edenzzzz commented Nov 4, 2024 •

edited

Loading

ericxsun commented Nov 4, 2024 •

edited

Loading

Edenzzzz commented Nov 6, 2024 •

edited

Loading