Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one problem in the network training #18

Open
WBinke opened this issue Jul 2, 2018 · 2 comments
Open

one problem in the network training #18

WBinke opened this issue Jul 2, 2018 · 2 comments

Comments

@WBinke
Copy link

WBinke commented Jul 2, 2018

When I'm training on the coco as the README declared,I meet this problem just like the blod log,and then the NMSLoss_pos and the NMSLoss_neg become nan,does anyone meet the same problem and give me some help?

_20180702195240

('lr', 0.0005, 'lr_epoch_diff', [5.33], 'lr_iters', [625027])
Epoch[0] Batch [100] Speed: 5.08 samples/sec Train-RPNAcc=0.847250, RPNLogLoss=0.376764, RPNL1Loss=0.187504, RCNNAcc=0.801361, RCNNLogLoss=1.674762, RCNNL1Loss=0.311297, NMSLoss_pos=0.035744, NMSLoss_neg=0.016391, NMSAcc_pos=0.000000, NMSAcc_neg=1.000000,
Epoch[0] Batch [200] Speed: 5.10 samples/sec Train-RPNAcc=0.865089, RPNLogLoss=0.328289, RPNL1Loss=0.176516, RCNNAcc=0.811237, RCNNLogLoss=1.380794, RCNNL1Loss=0.316205, NMSLoss_pos=0.048681, NMSLoss_neg=0.013534, NMSAcc_pos=0.000000, NMSAcc_neg=1.000000,
Epoch[0] Batch [300] Speed: 5.11 samples/sec Train-RPNAcc=0.874916, RPNLogLoss=0.302038, RPNL1Loss=0.159570, RCNNAcc=0.802546, RCNNLogLoss=1.319950, RCNNL1Loss=0.352934, NMSLoss_pos=0.057433, NMSLoss_neg=0.013499, NMSAcc_pos=0.000000, NMSAcc_neg=1.000000,
experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:128: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:129: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:133: RuntimeWarning: invalid value encountered in subtract
pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * (pred_w - 1.0)
experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:135: RuntimeWarning: invalid value encountered in subtract
pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * (pred_h - 1.0)
experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:137: RuntimeWarning: invalid value encountered in add
pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * (pred_w - 1.0)
experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:139: RuntimeWarning: invalid value encountered in add
pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * (pred_h - 1.0)
experiments/relation_rcnn/../../relation_rcnn/operator_py/proposal.py:180: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]

Epoch[0] Batch [400] Speed: 5.02 samples/sec Train-RPNAcc=0.871289, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.810123, RCNNLogLoss=1.576645, RCNNL1Loss=0.334166, NMSLoss_pos=0.054120, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999650,
Epoch[0] Batch [500] Speed: 4.91 samples/sec Train-RPNAcc=0.859804, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.836702, RCNNLogLoss=1.888214, RCNNL1Loss=0.267614, NMSLoss_pos=nan, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999720,
Epoch[0] Batch [600] Speed: 4.99 samples/sec Train-RPNAcc=0.850682, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.853031, RCNNLogLoss=1.725999, RCNNL1Loss=0.223882, NMSLoss_pos=nan, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999767,
Epoch[0] Batch [700] Speed: 4.98 samples/sec Train-RPNAcc=0.844466, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.865544, RCNNLogLoss=1.547918, RCNNL1Loss=0.192278, NMSLoss_pos=nan, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999800,

@ancientmooner
Copy link
Member

If you encounter NaN, please try more times until there is no NaN. Some random initialization might cause divergence problem. If problem still exists, it might because the base lr is too large for your task. In this case, please use a smaller base lr.

@yafz
Copy link

yafz commented Apr 18, 2019

Seconded. Either your data layer is incorrect or you need to alter the learning policy (use smaller base lr, try warmup, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants