Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss goes to -inf #1

Open
jdeschena opened this issue Aug 23, 2024 · 5 comments
Open

Loss goes to -inf #1

jdeschena opened this issue Aug 23, 2024 · 5 comments

Comments

@jdeschena
Copy link

jdeschena commented Aug 23, 2024

Hello,

I am trying to run your code to reproduce your results, but when using either the lambda_DCE or t_DCE loss, the loss quickly goes to -infinity. As such, I am wondering if there is a sign missing somewhere. Could you have a look at the code? Simply negating the loss from get_loss_fn does not solve the issue.

Thanks in advance.

@jdeschena jdeschena changed the title lambda_DCE loss goes to -inf Loss goes to -inf Aug 23, 2024
@JingyangOu
Copy link
Collaborator

Thank you for your attention. I haven't encountered this issue during my runs, so it would be helpful if you could share more details. Could you please provide the training logs and any specific settings or modifications you made? This will help me diagnose the problem more accurately.

@dongzhuoyao
Copy link

can you share a screenshot of your loss trend? for me the training loss curve is not stable @JingyangOu

I am using 128 tokens, with 10k vocabulary size, and a model with 130M paramters.

image

@JingyangOu
Copy link
Collaborator

This is my training loss with default configures ( 1024 tokens, gpt2 tokenizer with 50k vocabulary size, 130M parameters):
image
Can you reproduce my result? The loss should always be positive. I'm not sure whether this bug is related to your code modifications.

@dongzhuoyao
Copy link

How large is your batch size? 512x16? for me, my bs is only 32

@JingyangOu
Copy link
Collaborator

The batch size in the config refers to the equivalent batch size after combining all GPUs and applying gradient accumulation. Therefore, the batch size I used is 512. I also tried training with a batch size of 32, and in this case, the resulting curve was similar to the one with a batch size of 512, with no signs of training instability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants