NaN loss due to mixed precision? #504

sabetAI · 2020-11-05T21:57:20Z

I'm trying to train RAG using deepspeed, but get a NaN loss, possibly due to mixed precision errors.

Can reproduce by running sh train.sh from https://github.com/sabetAI/rage

The text was updated successfully, but these errors were encountered:

dwlmt · 2020-11-09T20:46:48Z

Don't have much to add, but just to note I'm getting the same issue with the RAG model with using a config with fp16 turned on either as the above without the zero optimization or with it. The training will run without DeepSpeed including with amp enabled but with DeepSpeed it will only train with amp and the zero optimizer.

samyam · 2020-11-10T20:18:42Z

@sabetAI , @dwlmt can you please share your logs ?

dwlmt · 2020-11-12T01:25:28Z

I don't have any logs of what is going on inside Deepspeed, I'm not sure if there is an option to turn on a full stacktrace all I know is returned loss from the batch_outputs which is returned from the deepspeed model engine that tests as a nan.

loss = batch_outputs.get("loss")

if torch.isnan(loss.float()):
    raise ValueError("nan loss encountered")

dwlmt · 2020-11-13T20:57:43Z

@sabetAI I found a workaround. In RagTokenForGeneration or RagSequenceForGeneration in the get_nll function, there is a smoothing loss, and in my training all the smooth_obj values are infinity. I've got the model to train by just zeroing out these vectors.

smooth_obj[torch.isinf(smooth_obj)] = 0

This effectively turns off the smoothing loss and I'm not sure how that affects the models but the losses are non-nan now and can back-propagate.

@samyam sorry for not replying. I'm still not sure if there is an outstanding Deepspeed specific issue but training works for me now.

jendrikjoe · 2024-01-30T20:34:42Z

I had a (presumably) similar issue with computing a cross-entropy loss.
What solved it for me was to safe-guard the loss calculation to be done with 32-bit:

with torch.autocast(device_type="cuda", dtype=torch.float32):
      loss = self.loss(output_logits, target_ids)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN loss due to mixed precision? #504

NaN loss due to mixed precision? #504

sabetAI commented Nov 5, 2020 •

edited

Loading

dwlmt commented Nov 9, 2020

samyam commented Nov 10, 2020

dwlmt commented Nov 12, 2020

dwlmt commented Nov 13, 2020

jendrikjoe commented Jan 30, 2024

NaN loss due to mixed precision? #504

NaN loss due to mixed precision? #504

Comments

sabetAI commented Nov 5, 2020 • edited Loading

dwlmt commented Nov 9, 2020

samyam commented Nov 10, 2020

dwlmt commented Nov 12, 2020

dwlmt commented Nov 13, 2020

jendrikjoe commented Jan 30, 2024

sabetAI commented Nov 5, 2020 •

edited

Loading