-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN loss due to mixed precision? #504
Comments
Don't have much to add, but just to note I'm getting the same issue with the RAG model with using a config with fp16 turned on either as the above without the zero optimization or with it. The training will run without DeepSpeed including with amp enabled but with DeepSpeed it will only train with amp and the zero optimizer. |
I don't have any logs of what is going on inside Deepspeed, I'm not sure if there is an option to turn on a full stacktrace all I know is returned loss from the batch_outputs which is returned from the deepspeed model engine that tests as a nan.
|
@sabetAI I found a workaround. In RagTokenForGeneration or RagSequenceForGeneration in the get_nll function, there is a smoothing loss, and in my training all the smooth_obj values are infinity. I've got the model to train by just zeroing out these vectors. smooth_obj[torch.isinf(smooth_obj)] = 0 This effectively turns off the smoothing loss and I'm not sure how that affects the models but the losses are non-nan now and can back-propagate. @samyam sorry for not replying. I'm still not sure if there is an outstanding Deepspeed specific issue but training works for me now. |
I had a (presumably) similar issue with computing a cross-entropy loss.
|
I'm trying to train RAG using deepspeed, but get a NaN loss, possibly due to mixed precision errors.
Can reproduce by running
sh train.sh
from https://github.com/sabetAI/rageThe text was updated successfully, but these errors were encountered: