nan when training the OccWorld #9

SeaBird-Go · 2023-12-11T09:04:57Z

Hi, thanks for your wonderful work.

When I trained the occworld model in the second stage, I obtained the nan after epoch 35. And in the last epoch, the evaluation metrics are:

2023/12/10 16:25:53 - mmengine - INFO - Current val iou is [11.32928878068924, 11.171085387468338, 11.522497981786728, 11.246349662542343, 11.434061080217361, 11.45360991358757, 11.66500672698021, 11.577576398849487, 11.563246697187424, 11.626467853784561, 11.70591413974762, 11.869920790195465, 11.870458722114563, 11.824838817119598, 11.806531250476837] while the best val iou is [30.42658567428589, 34.664684534072876, 35.00288724899292, 34.887245297431946, 34.17853116989136, 34.29064750671387, 34.21706259250641, 34.07915234565735, 33.918631076812744, 33.681508898735046, 33.49155783653259, 33.52125287055969, 33.82187783718109, 33.90370011329651, 33.536869287490845]
2023/12/10 16:25:53 - mmengine - INFO - Current val miou is [1.2357821030123513, 1.3048479963532267, 1.3326199349274461, 1.286004960317822, 1.3091895496900714, 1.2971919745562928, 1.313220287727959, 1.2903256293879273, 1.2739735182977336, 1.2695381125120226, 1.268610595691237, 1.278414056115948, 1.2753155520733666, 1.2545286294292002, 1.2432696814785766] while the best val miou is [18.178497386329315, 21.72152917174732, 22.131569595897897, 22.358492542715634, 21.344122553572937, 21.65812363519388, 21.089138879495508, 20.842806030722226, 20.499638687161838, 20.376817619099334, 20.342478550532284, 20.2685889952323, 20.782285183668137, 20.823075622320175, 20.936191344962403]

I'm not sure why this happens.

For the first stage about training the VQVAE, I obtain the best results in epoch_146, the results are:

2023/12/09 09:21:31 - mmengine - INFO - Current val iou is [63.11628818511963, 63.11272978782654, 63.088274002075195, 63.07087540626526, 63.056015968322754, 63.0810022354126, 63.00719976425171, 62.911272048950195, 62.83813118934631, 62.745869159698486] while the best val iou is [63.202375173568726, 63.272738456726074, 63.23975324630737, 63.15116882324219, 63.2210910320282, 63.14346790313721, 63.13709020614624, 63.109904527664185, 63.041579723358154, 63.05258274078369]
2023/12/09 09:21:31 - mmengine - INFO - Current val miou is [66.91237109548905, 66.66993341025184, 66.66781867251676, 66.60014688968658, 66.64343599010917, 66.61616002812106, 66.6671146364773, 66.45773175884696, 66.08651210280026, 66.2292769726585] while the best val miou is [66.91237109548905, 66.98254785116981, 67.08006876356461, 66.98910103124732, 66.95465007249047, 66.91935833762673, 66.94301542113809, 67.0062887317994, 66.87079089529374, 66.90992292235879]

Could you help me find the reasons? Thanks in advance.

The text was updated successfully, but these errors were encountered:

VitaLemonTea1 · 2023-12-15T09:23:33Z

I get same problem

dk-liang · 2023-12-22T08:03:50Z

me too

LMD0311 · 2023-12-24T16:37:13Z

I also have the Nan problem when using 4 RTX 4090 with batch 1.
When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results.
I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

SeaBird-Go · 2023-12-25T02:43:56Z

I also train the model in 4 GPUs, but in RTX 3090. So the nan problem would happen in a few cards? Why this happens...

liuziyang123 · 2023-12-25T06:59:10Z

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

LMD0311 · 2023-12-25T10:30:14Z

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

liuziyang123 · 2023-12-26T09:14:40Z

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

This is my training log.
20231211_114353.log
This the eval log.
eval_stp3_0_5_11_20231212_165943.log
It seems that the results are different from the paper. Could you provide your training log? Thanks

LMD0311 · 2023-12-26T09:38:28Z

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

This is my training log. 20231211_114353.log This the eval log. eval_stp3_0_5_11_20231212_165943.log It seems that the results are different from the paper. Could you provide your training log? Thanks

20231222_222239.log
eval_stp3_0_5_11_20231223_135753.log
As I said, I didn't actually finish my training. BTW, maybe you can try the best vqvae checkpoint to train the occworld.

liuziyang123 · 2023-12-26T11:42:27Z

I also have the Nan problem when using 4 RTX 4090 with batch 1. When I use 4 RTX 4090 with batch 2, I avoid the nan problem, but I can't get similar results. I find only 8 RTX 4090 works fine, any suggestions? Thanks a lot! @wzzheng

Hello, what are the results that you got training on 8 RTX 4090s ?

I didn't actually finish my training. But the 155 epoch checkpoint obtain avg IoU of 26.05, mIoU 16.47, which are similar with the paper's results (26.63 / 17.14).

This is my training log. 20231211_114353.log This the eval log. eval_stp3_0_5_11_20231212_165943.log It seems that the results are different from the paper. Could you provide your training log? Thanks

20231222_222239.log eval_stp3_0_5_11_20231223_135753.log As I said, I didn't actually finish my training. BTW, maybe you can try the best vqvae checkpoint to train the occworld.

Thanks for your advice, I will try again.

VitaLemonTea1 · 2023-12-30T01:38:15Z

I change the learning_rate from 1e-3 to 5e-4 and trained with 2x3090. And it trained succcessfully. The result is a little bit lower than the paper's.
Here is may result.
12/30 09:33:57 - mmengine - INFO - avg val iou is 25.569845736026764
12/30 09:33:57 - mmengine - INFO - avg val miou is 15.504468316394911

weimengfendou · 2025-02-23T11:29:12Z

I change the learning_rate from 1e-3 to 5e-4 and trained with 2x3090. And it trained succcessfully. The result is a little bit lower than the paper's. Here is may result. 12/30 09:33:57 - mmengine - INFO - avg val iou is 25.569845736026764 12/30 09:33:57 - mmengine - INFO - avg val miou is 15.504468316394911

Hello，How long did it take you to train this model on 2x3090?

VitaLemonTea1 · 2025-02-27T06:15:37Z

It costs too much time, I forgot details.

weimengfendou · 2025-02-27T06:29:41Z

It costs too much time, I forgot details.

thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan when training the OccWorld #9

nan when training the OccWorld #9

SeaBird-Go commented Dec 11, 2023

VitaLemonTea1 commented Dec 15, 2023

dk-liang commented Dec 22, 2023

LMD0311 commented Dec 24, 2023

SeaBird-Go commented Dec 25, 2023

liuziyang123 commented Dec 25, 2023

LMD0311 commented Dec 25, 2023

liuziyang123 commented Dec 26, 2023 •

edited

Loading

LMD0311 commented Dec 26, 2023

liuziyang123 commented Dec 26, 2023

VitaLemonTea1 commented Dec 30, 2023

weimengfendou commented Feb 23, 2025

VitaLemonTea1 commented Feb 27, 2025

weimengfendou commented Feb 27, 2025

nan when training the OccWorld #9

nan when training the OccWorld #9

Comments

SeaBird-Go commented Dec 11, 2023

VitaLemonTea1 commented Dec 15, 2023

dk-liang commented Dec 22, 2023

LMD0311 commented Dec 24, 2023

SeaBird-Go commented Dec 25, 2023

liuziyang123 commented Dec 25, 2023

LMD0311 commented Dec 25, 2023

liuziyang123 commented Dec 26, 2023 • edited Loading

LMD0311 commented Dec 26, 2023

liuziyang123 commented Dec 26, 2023

VitaLemonTea1 commented Dec 30, 2023

weimengfendou commented Feb 23, 2025

VitaLemonTea1 commented Feb 27, 2025

weimengfendou commented Feb 27, 2025

liuziyang123 commented Dec 26, 2023 •

edited

Loading