-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Got nan during backward with zero2 #6091
Comments
FP16 ZeRO should auto-check for overflow and skip that step, though this seems unimplemented for bf16. @botbw would you be able to take a look? I haven't been maintaining this part. |
Since you're using Open-Sora, feel free to open this kind of issue in their repo too. |
Indeed bf16 has the same range as fp32, but in my opinion this check can be enforced on all precisions? |
Hi @Edenzzzz, thank you for involving. Please note that by adding the mentioned lines, nan will not occur again and the RuntimeError are never raised by these lines. Therefore, I don’t think skipping a specific iteration could help. I suspect the bug is from communication. That’s also the reason I open an issue in this repo. |
Emm then I think |
@flymin Thanks for reporting this! Will it be possible to share the config/code snippet you are using? If not, could you try setting |
Sorry I cannot provide my current code. I may have some time late in the Nov. to work on this issue again. I have tried adding synchronization code in this function but it does not help. I also tried From my workaround, I have to add two |
If you suspect comm issues, you can put |
|
Is there an existing issue for this bug?
🐛 Describe the bug
My code is based on Open-Sora, and can run without any issue on 32 gpus, using zero2.
However, when using 64 gpus, nan appears in the tensor gradients after the second backward step.
I have made a workaround to patch
colossalai/zero/low_level/low_level_optim.py
withWith the patch above, my code run normally and the loss seems fine.
I think it may related to asynchronized state between cuda streams. I do not the exact reason and I do not think my workaround could really solve the issue.
Any idea from the team member?
Environment
Nvidia H20
ColossalAI version: 0.4.3
cuda 12.4
pytorch 2.4
The text was updated successfully, but these errors were encountered: