-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: weird stuck while training #6095
Comments
Can you share any relevant messages and stack trace on stuck or exit? |
I didn’t receive any useful information or logs. All nodes seem to be functioning correctly. The only option I have is to kill the training process and resume it. When I add more logs, the process gets stuck at the forward step. |
Could you share the stack trace when you kill by ctrl c and a reproducible script? |
Could it caused by the weird behavior described in #6111 ? |
You can probably test the behavior of |
I observed that, following this line:
Furthermore, after reaching this line:
If
And after reaching this line:
the PID for other ranks still starts appearing on each rank. |
Hi @ver217,could you take a look? Thanks very much. |
This might just be the default behavior. All gather by definition collects tensor-based objects from other ranks. |
Is there an existing issue for this bug?
🐛 Describe the bug
When training a language model with the GeminiPlugin, I encountered an issue where the process got stuck during the forward step. I was saving a checkpoint every 3000 steps, and when it got stuck, I had to kill the process and resume from the latest checkpoint.
The stuck times
Is there any idea to find why? Thanks a lot.
Environment
CUDA: 12.1
NCCL: 2.18
Pytorch: 2.1.2
Python: 3.8
Colossalai: 0.4.2
The text was updated successfully, but these errors were encountered: