Getting bounding box loss as nan #2

ravikantb · 2017-07-07T05:16:39Z

I took your ROIAlign layers and integrated them with my py-Faster-RCNN to replace ROIPool layers in my code base. But I am getting bounding box loss as NAN for all the iterations so far while training Fast-RCNN (stage-2) in alternate training optimization. Did you also face any such problem while training? Please let me know.
I shall dig deeper and get back if I find anything worth sharing. Following are my sample logs of the loss for your reference.

I0707 10:38:07.833065 10773 solver.cpp:228] Iteration 1940, loss = nan
I0707 10:38:07.833112 10773 solver.cpp:244]     Train net output #0: loss_bbox = nan (* 1 = nan loss)
I0707 10:38:07.833133 10773 solver.cpp:244]     Train net output #1: loss_cls = 87.3365 (* 1 = 87.3365 loss)

Thanks

The text was updated successfully, but these errors were encountered:

jasjeetIM · 2017-07-07T15:09:06Z

Hi @ravikantb,

Loss going to nan usually happens when the learning rate is too high. However, I will check the implementation of the layers again to make sure there is no bug.

Can you paste the Caffe output log (like you have above) for all iterations from Iteration 0 - Iteration 20 or so? Also, can you paste your training prototxt file here?

Please note that I am currently not working on the project due to other work and hence maybe delayed with responses.

Thanks
Jay

jasjeetIM · 2017-07-07T20:37:48Z

Hi @ravikantb,

When training Mask-RCNN with the ROIAlign layer, I do not get nan in the loss.

I0707 14:32:50.012135 3002 solver.cpp:229] Train net output #0: accuarcy = 0 I0707 14:32:50.012141 3002 solver.cpp:229] Train net output #1: loss_bbox = 6.75187 (* 1 = 6.75187 loss) I0707 14:32:50.012145 3002 solver.cpp:229] Train net output #2: loss_cls = 4.39446 (* 1 = 4.39446 loss) I0707 14:32:50.012151 3002 solver.cpp:229] Train net output #3: loss_mask = 2.45052e+20 (* 1 = 2.45052e+20 loss)
I have also checked the forward and backward passes of the ROIAlign layer ( I used the caffe gradient checker for the backward pass). I can have a look at your output logs, solver.prototxt, and train.prototxt.

ravikantb · 2017-07-10T13:39:16Z

Hi @jasjeetIM ,

Thanks for your detailed response and apologies for the delay in my response. Please find attached a zip file containing the solver.prototxt, training prototxt and sample logs for 100 iterations. I have tried with learning rates ranging from 0.001-0.000001 but the loss always becomes NaN after sometime.

Just to give you a bit more details about my implementation, I took your implementation of ROIAlign layers (CPU and GPU both) and added them in the caffe framework as per the instructions given in below two links:
https://github.com/BVLC/caffe/wiki/Development
https://github.com/BVLC/caffe/wiki/Simple-Example:-Sin-Layer

After that I replaced ROIPooling layers in my Faster-RCNN prototxts with ROIAlign. Since I am using alternate optimization technique there, I have sent you the prototxt that is used in training Fast-RCNN component of Faster-RCNN (stage-2 : In this stage output proposals from RPN are converted to fixed length vectors using ROIAlign/ROIPool layers)

I really appreciate you going back and checking your implementation for this issue. Please have a look at the provided documents and let me know if you find something wrong there.
solver_prototxt_logs.zip

Thanks,
Ravikant

jasjeetIM · 2017-07-10T15:18:14Z

Please change the below values as listed in your solver.prototxt and send logs again:
display: 1
base_lr: 0.0001
clip_gradients: 100
debug_info: true

Please make sure the logs contains training for at least 100 iterations. Also, provide logs for the above solver.prototxt with ROIPooling layer used instead of ROIAlign layer on the same training set for 100 iterations. Therefore, two training logs : 1) ROIAlign 2) ROIPooling, where both training utilize the solver.prototxt values pasted above.

Lastly, do you have your code available on an online repo?

Thanks

ravikantb · 2017-07-11T09:04:51Z

Hi @jasjeetIM

Please find attached logs for both the runs with the changes you suggested.
logs.zip

Our code base is not yet public as it is organization's property but I shall talk to my team and see if I can provide you an access to the code base. I will revert on it soon.

Honestly, I had not worked with debug mode of caffe before as I found it too verbose but it seems to output useful information in this case. I shall try to see if I can find the root cause using these logs. Meanwhile if you get time to look at these and find anything useful then please let me know.

Thanks,
Ravikant

jasjeetIM · 2017-07-11T13:41:30Z

Hi @ravikantb,

Okay, thanks. Can you do the following to troubleshoot:

Run caffe in cpu mode.
Modify the code in the roi_align_layer.cpp file : https://github.com/jasjeetIM/Mask-RCNN/blob/master/external/caffe/src/caffe/layers/roi_align_layer.cpp as follows:

Add the following after line 164:
LOG(INFO) << "(h_idx, w_idx, h_idx_n, w_idx_n) = (" << h_idx << "," <<w_idx << "," << h_idx_n << "," << w_idx_n << ")"; LOG(INFO) << "Multiplier = " << multiplier[counter]; LOG(INFO) << "Data value = " << batch_data[b_index_curr[counter]]; LOG(INFO) << "Current Pooled value = " << bisampled[smp/2];
3) Recompile caffe

Run the same experiment as you did here https://github.com/jasjeetIM/Mask-RCNN/issues/2#issuecomment-314138508. However you only need to run this for 10 iterations as the output will be very verbose and noisy.
Once done, please send the logs to me again.

This will help me look at the boundary condition that may be causing 'inf' as one of the pooled values.

ravikantb · 2017-07-12T13:29:54Z

Hi @jasjeetIM ,

Thanks for all the help. But due to some unforeseen circumstances I have to move away from this project for 2-3 days. I will get back to you with the required data after that. Please keep this issue open till then.

Thanks,
Ravikant

MartinPlantinga · 2017-08-24T15:28:20Z

Hi @ravikantb,

Could you share how you added the RoIAlign layer to the include/caffe/layers/fast-rcnn-layers.hpp (step 1 in https://github.com/BVLC/caffe/wiki/Development?

Many thanks in advance.

ravikantb · 2017-10-16T12:44:28Z

@MartinPlantinga : I added following code in 'fast_rcnn_layers.hpp' file to do this. Hope it helps.

(P.S.: I didn't use coding template of github as it was messing with the my code snippet.)

/* ROIAlignLayer - Region of Interest Align Layer /
template < typename Dtype >
class ROIAlignLayer : public Layer {
public:
explicit ROIAlignLayer(const LayerParameter& param)
: Layer(param) {}
virtual void LayerSetUp(const vector<Blob>& bottom,
const vector<Blob>& top);
virtual void Reshape(const vector<Blob>& bottom,
const vector<Blob*>& top);

virtual inline const char* type() const { return "ROIAlign"; }

virtual inline int MinBottomBlobs() const { return 2; }
virtual inline int MaxBottomBlobs() const { return 2; }
virtual inline int MinTopBlobs() const { return 1; }
virtual inline int MaxTopBlobs() const { return 1; }

protected:
virtual void Forward_cpu(const vector<Blob>& bottom,
const vector<Blob>& top);
virtual void Forward_gpu(const vector<Blob>& bottom,
const vector<Blob>& top);
virtual void Backward_cpu(const vector<Blob>& top,
const vector& propagate_down, const vector<Blob>& bottom);
virtual void Backward_gpu(const vector<Blob>& top,
const vector& propagate_down, const vector<Blob>& bottom);

int channels_;
int height_;
int width_;
int pooled_height_;
int pooled_width_;
Dtype spatial_scale_;
Blob max_idx_;
Blob max_mult_;
Blob max_pts_;
};

MartinPlantinga · 2017-10-23T15:04:51Z

Thanks @ravikantb !!

SingL3 mentioned this issue May 22, 2018

Get error in proposal layer #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting bounding box loss as nan #2

Getting bounding box loss as nan #2

ravikantb commented Jul 7, 2017

jasjeetIM commented Jul 7, 2017

jasjeetIM commented Jul 7, 2017

ravikantb commented Jul 10, 2017

jasjeetIM commented Jul 10, 2017 •

edited

Loading

ravikantb commented Jul 11, 2017

jasjeetIM commented Jul 11, 2017

ravikantb commented Jul 12, 2017

MartinPlantinga commented Aug 24, 2017 •

edited

Loading

ravikantb commented Oct 16, 2017 •

edited

Loading

MartinPlantinga commented Oct 23, 2017

Getting bounding box loss as nan #2

Getting bounding box loss as nan #2

Comments

ravikantb commented Jul 7, 2017

jasjeetIM commented Jul 7, 2017

jasjeetIM commented Jul 7, 2017

ravikantb commented Jul 10, 2017

jasjeetIM commented Jul 10, 2017 • edited Loading

ravikantb commented Jul 11, 2017

jasjeetIM commented Jul 11, 2017

ravikantb commented Jul 12, 2017

MartinPlantinga commented Aug 24, 2017 • edited Loading

ravikantb commented Oct 16, 2017 • edited Loading

MartinPlantinga commented Oct 23, 2017

jasjeetIM commented Jul 10, 2017 •

edited

Loading

MartinPlantinga commented Aug 24, 2017 •

edited

Loading

ravikantb commented Oct 16, 2017 •

edited

Loading