-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan when training the OccWorld #9
Comments
I get same problem |
me too |
I also have the Nan problem when using 4 RTX 4090 with batch 1. |
I also train the model in 4 GPUs, but in RTX 3090. So the nan problem would happen in a few cards? Why this happens... |
Hello, what are the results that you got training on 8 RTX 4090s ? |
I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14). |
This is my training log. |
20231222_222239.log |
Thanks for your advice, I will try again. |
I change the learning_rate from 1e-3 to 5e-4 and trained with 2x3090. And it trained succcessfully. The result is a little bit lower than the paper's. |
Hi, thanks for your wonderful work.
When I trained the occworld model in the second stage, I obtained the nan after
epoch 35
. And in the last epoch, the evaluation metrics are:I'm not sure why this happens.
For the first stage about training the VQVAE, I obtain the best results in
epoch_146
, the results are:Could you help me find the reasons? Thanks in advance.
The text was updated successfully, but these errors were encountered: