Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN loss due to mixed precision? #504

Open
sabetAI opened this issue Nov 5, 2020 · 5 comments
Open

NaN loss due to mixed precision? #504

sabetAI opened this issue Nov 5, 2020 · 5 comments

Comments

@sabetAI
Copy link

sabetAI commented Nov 5, 2020

I'm trying to train RAG using deepspeed, but get a NaN loss, possibly due to mixed precision errors.

Can reproduce by running sh train.sh from https://github.com/sabetAI/rage

@dwlmt
Copy link

dwlmt commented Nov 9, 2020

Don't have much to add, but just to note I'm getting the same issue with the RAG model with using a config with fp16 turned on either as the above without the zero optimization or with it. The training will run without DeepSpeed including with amp enabled but with DeepSpeed it will only train with amp and the zero optimizer.

@samyam
Copy link
Contributor

samyam commented Nov 10, 2020

@sabetAI , @dwlmt can you please share your logs ?

@dwlmt
Copy link

dwlmt commented Nov 12, 2020

I don't have any logs of what is going on inside Deepspeed, I'm not sure if there is an option to turn on a full stacktrace all I know is returned loss from the batch_outputs which is returned from the deepspeed model engine that tests as a nan.

loss = batch_outputs.get("loss")

if torch.isnan(loss.float()):
    raise ValueError("nan loss encountered")

@dwlmt
Copy link

dwlmt commented Nov 13, 2020

@sabetAI I found a workaround. In RagTokenForGeneration or RagSequenceForGeneration in the get_nll function, there is a smoothing loss, and in my training all the smooth_obj values are infinity. I've got the model to train by just zeroing out these vectors.

smooth_obj[torch.isinf(smooth_obj)] = 0

This effectively turns off the smoothing loss and I'm not sure how that affects the models but the losses are non-nan now and can back-propagate.

@samyam sorry for not replying. I'm still not sure if there is an outstanding Deepspeed specific issue but training works for me now.

@jendrikjoe
Copy link

I had a (presumably) similar issue with computing a cross-entropy loss.
What solved it for me was to safe-guard the loss calculation to be done with 32-bit:

with torch.autocast(device_type="cuda", dtype=torch.float32):
      loss = self.loss(output_logits, target_ids)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants