Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan when training the OccWorld #9

Open
SeaBird-Go opened this issue Dec 11, 2023 · 10 comments
Open

nan when training the OccWorld #9

SeaBird-Go opened this issue Dec 11, 2023 · 10 comments

Comments

@SeaBird-Go
Copy link

Hi, thanks for your wonderful work.

When I trained the occworld model in the second stage, I obtained the nan after epoch 35. And in the last epoch, the evaluation metrics are:

2023/12/10 16:25:53 - mmengine - INFO - Current val iou is [11.32928878068924, 11.171085387468338, 11.522497981786728, 11.246349662542343, 11.434061080217361, 11.45360991358757, 11.66500672698021, 11.577576398849487, 11.563246697187424, 11.626467853784561, 11.70591413974762, 11.869920790195465, 11.870458722114563, 11.824838817119598, 11.806531250476837] while the best val iou is [30.42658567428589, 34.664684534072876, 35.00288724899292, 34.887245297431946, 34.17853116989136, 34.29064750671387, 34.21706259250641, 34.07915234565735, 33.918631076812744, 33.681508898735046, 33.49155783653259, 33.52125287055969, 33.82187783718109, 33.90370011329651, 33.536869287490845]
2023/12/10 16:25:53 - mmengine - INFO - Current val miou is [1.2357821030123513, 1.3048479963532267, 1.3326199349274461, 1.286004960317822, 1.3091895496900714, 1.2971919745562928, 1.313220287727959, 1.2903256293879273, 1.2739735182977336, 1.2695381125120226, 1.268610595691237, 1.278414056115948, 1.2753155520733666, 1.2545286294292002, 1.2432696814785766] while the best val miou is [18.178497386329315, 21.72152917174732, 22.131569595897897, 22.358492542715634, 21.344122553572937, 21.65812363519388, 21.089138879495508, 20.842806030722226, 20.499638687161838, 20.376817619099334, 20.342478550532284, 20.2685889952323, 20.782285183668137, 20.823075622320175, 20.936191344962403]

I'm not sure why this happens.

For the first stage about training the VQVAE, I obtain the best results in epoch_146, the results are:

2023/12/09 09:21:31 - mmengine - INFO - Current val iou is [63.11628818511963, 63.11272978782654, 63.088274002075195, 63.07087540626526, 63.056015968322754, 63.0810022354126, 63.00719976425171, 62.911272048950195, 62.83813118934631, 62.745869159698486] while the best val iou is [63.202375173568726, 63.272738456726074, 63.23975324630737, 63.15116882324219, 63.2210910320282, 63.14346790313721, 63.13709020614624, 63.109904527664185, 63.041579723358154, 63.05258274078369]
2023/12/09 09:21:31 - mmengine - INFO - Current val miou is [66.91237109548905, 66.66993341025184, 66.66781867251676, 66.60014688968658, 66.64343599010917, 66.61616002812106, 66.6671146364773, 66.45773175884696, 66.08651210280026, 66.2292769726585] while the best val miou is [66.91237109548905, 66.98254785116981, 67.08006876356461, 66.98910103124732, 66.95465007249047, 66.91935833762673, 66.94301542113809, 67.0062887317994, 66.87079089529374, 66.90992292235879]

Could you help me find the reasons? Thanks in advance.

@VitaLemonTea1
Copy link

I get same problem

@dk-liang
Copy link

me too

@LMD0311
Copy link

LMD0311 commented Dec 24, 2023

I also have the Nan problem when using 4 RTX 4090 with batch 1.
When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results.
I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

@SeaBird-Go
Copy link
Author

I also train the model in 4 GPUs, but in RTX 3090. So the nan problem would happen in a few cards? Why this happens...

@liuziyang123
Copy link

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

@LMD0311
Copy link

LMD0311 commented Dec 25, 2023

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

@liuziyang123
Copy link

liuziyang123 commented Dec 26, 2023

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

This is my training log.
20231211_114353.log
This the eval log.
eval_stp3_0_5_11_20231212_165943.log
It seems that the results are different from the paper. Could you provide your training log? Thanks

@LMD0311
Copy link

LMD0311 commented Dec 26, 2023

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

This is my training log. 20231211_114353.log This the eval log. eval_stp3_0_5_11_20231212_165943.log It seems that the results are different from the paper. Could you provide your training log? Thanks

20231222_222239.log
eval_stp3_0_5_11_20231223_135753.log
As I said, I didn't actually finish my training. BTW, maybe you can try the best vqvae checkpoint to train the occworld.

@liuziyang123
Copy link

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

This is my training log. 20231211_114353.log This the eval log. eval_stp3_0_5_11_20231212_165943.log It seems that the results are different from the paper. Could you provide your training log? Thanks

20231222_222239.log eval_stp3_0_5_11_20231223_135753.log As I said, I didn't actually finish my training. BTW, maybe you can try the best vqvae checkpoint to train the occworld.

Thanks for your advice, I will try again.

@VitaLemonTea1
Copy link

I change the learning_rate from 1e-3 to 5e-4 and trained with 2x3090. And it trained succcessfully. The result is a little bit lower than the paper's.
Here is may result.
12/30 09:33:57 - mmengine - INFO - avg val iou is 25.569845736026764
12/30 09:33:57 - mmengine - INFO - avg val miou is 15.504468316394911

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants