Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting bounding box loss as nan #2

Open
ravikantb opened this issue Jul 7, 2017 · 10 comments
Open

Getting bounding box loss as nan #2

ravikantb opened this issue Jul 7, 2017 · 10 comments

Comments

@ravikantb
Copy link

Hi @jasjeetIM ,

I took your ROIAlign layers and integrated them with my py-Faster-RCNN to replace ROIPool layers in my code base. But I am getting bounding box loss as NAN for all the iterations so far while training Fast-RCNN (stage-2) in alternate training optimization. Did you also face any such problem while training? Please let me know.
I shall dig deeper and get back if I find anything worth sharing. Following are my sample logs of the loss for your reference.

I0707 10:38:07.833065 10773 solver.cpp:228] Iteration 1940, loss = nan
I0707 10:38:07.833112 10773 solver.cpp:244]     Train net output #0: loss_bbox = nan (* 1 = nan loss)
I0707 10:38:07.833133 10773 solver.cpp:244]     Train net output #1: loss_cls = 87.3365 (* 1 = 87.3365 loss)

Thanks

@jasjeetIM
Copy link
Owner

Hi @ravikantb,

Loss going to nan usually happens when the learning rate is too high. However, I will check the implementation of the layers again to make sure there is no bug.

Can you paste the Caffe output log (like you have above) for all iterations from Iteration 0 - Iteration 20 or so? Also, can you paste your training prototxt file here?

Please note that I am currently not working on the project due to other work and hence maybe delayed with responses.

Thanks
Jay

@jasjeetIM
Copy link
Owner

Hi @ravikantb,

When training Mask-RCNN with the ROIAlign layer, I do not get nan in the loss.

I0707 14:32:50.012135 3002 solver.cpp:229] Train net output #0: accuarcy = 0 I0707 14:32:50.012141 3002 solver.cpp:229] Train net output #1: loss_bbox = 6.75187 (* 1 = 6.75187 loss) I0707 14:32:50.012145 3002 solver.cpp:229] Train net output #2: loss_cls = 4.39446 (* 1 = 4.39446 loss) I0707 14:32:50.012151 3002 solver.cpp:229] Train net output #3: loss_mask = 2.45052e+20 (* 1 = 2.45052e+20 loss)
I have also checked the forward and backward passes of the ROIAlign layer ( I used the caffe gradient checker for the backward pass). I can have a look at your output logs, solver.prototxt, and train.prototxt.

@ravikantb
Copy link
Author

Hi @jasjeetIM ,

Thanks for your detailed response and apologies for the delay in my response. Please find attached a zip file containing the solver.prototxt, training prototxt and sample logs for 100 iterations. I have tried with learning rates ranging from 0.001-0.000001 but the loss always becomes NaN after sometime.

Just to give you a bit more details about my implementation, I took your implementation of ROIAlign layers (CPU and GPU both) and added them in the caffe framework as per the instructions given in below two links:
https://github.com/BVLC/caffe/wiki/Development
https://github.com/BVLC/caffe/wiki/Simple-Example:-Sin-Layer

After that I replaced ROIPooling layers in my Faster-RCNN prototxts with ROIAlign. Since I am using alternate optimization technique there, I have sent you the prototxt that is used in training Fast-RCNN component of Faster-RCNN (stage-2 : In this stage output proposals from RPN are converted to fixed length vectors using ROIAlign/ROIPool layers)

I really appreciate you going back and checking your implementation for this issue. Please have a look at the provided documents and let me know if you find something wrong there.
solver_prototxt_logs.zip

Thanks,
Ravikant

@jasjeetIM
Copy link
Owner

jasjeetIM commented Jul 10, 2017

Please change the below values as listed in your solver.prototxt and send logs again:
display: 1
base_lr: 0.0001
clip_gradients: 100
debug_info: true

Please make sure the logs contains training for at least 100 iterations. Also, provide logs for the above solver.prototxt with ROIPooling layer used instead of ROIAlign layer on the same training set for 100 iterations. Therefore, two training logs : 1) ROIAlign 2) ROIPooling, where both training utilize the solver.prototxt values pasted above.

Lastly, do you have your code available on an online repo?

Thanks

@ravikantb
Copy link
Author

Hi @jasjeetIM

Please find attached logs for both the runs with the changes you suggested.
logs.zip

Our code base is not yet public as it is organization's property but I shall talk to my team and see if I can provide you an access to the code base. I will revert on it soon.

Honestly, I had not worked with debug mode of caffe before as I found it too verbose but it seems to output useful information in this case. I shall try to see if I can find the root cause using these logs. Meanwhile if you get time to look at these and find anything useful then please let me know.

Thanks,
Ravikant

@jasjeetIM
Copy link
Owner

Hi @ravikantb,

Okay, thanks. Can you do the following to troubleshoot:

  1. Run caffe in cpu mode.
  2. Modify the code in the roi_align_layer.cpp file : https://github.com/jasjeetIM/Mask-RCNN/blob/master/external/caffe/src/caffe/layers/roi_align_layer.cpp as follows:

Add the following after line 164:
LOG(INFO) << "(h_idx, w_idx, h_idx_n, w_idx_n) = (" << h_idx << "," <<w_idx << "," << h_idx_n << "," << w_idx_n << ")"; LOG(INFO) << "Multiplier = " << multiplier[counter]; LOG(INFO) << "Data value = " << batch_data[b_index_curr[counter]]; LOG(INFO) << "Current Pooled value = " << bisampled[smp/2];
3) Recompile caffe

  1. Run the same experiment as you did here https://github.com/jasjeetIM/Mask-RCNN/issues/2#issuecomment-314138508. However you only need to run this for 10 iterations as the output will be very verbose and noisy.

  2. Once done, please send the logs to me again.

This will help me look at the boundary condition that may be causing 'inf' as one of the pooled values.

@ravikantb
Copy link
Author

Hi @jasjeetIM ,

Thanks for all the help. But due to some unforeseen circumstances I have to move away from this project for 2-3 days. I will get back to you with the required data after that. Please keep this issue open till then.

Thanks,
Ravikant

@MartinPlantinga
Copy link

MartinPlantinga commented Aug 24, 2017

Hi @ravikantb,

Could you share how you added the RoIAlign layer to the include/caffe/layers/fast-rcnn-layers.hpp (step 1 in https://github.com/BVLC/caffe/wiki/Development?

Many thanks in advance.

@ravikantb
Copy link
Author

ravikantb commented Oct 16, 2017

@MartinPlantinga : I added following code in 'fast_rcnn_layers.hpp' file to do this. Hope it helps.

(P.S.: I didn't use coding template of github as it was messing with the my code snippet.)

/* ROIAlignLayer - Region of Interest Align Layer /
template < typename Dtype >
class ROIAlignLayer : public Layer {
public:
explicit ROIAlignLayer(const LayerParameter& param)
: Layer(param) {}
virtual void LayerSetUp(const vector<Blob
>& bottom,
const vector<Blob>& top);
virtual void Reshape(const vector<Blob
>& bottom,
const vector<Blob*>& top);

virtual inline const char* type() const { return "ROIAlign"; }

virtual inline int MinBottomBlobs() const { return 2; }
virtual inline int MaxBottomBlobs() const { return 2; }
virtual inline int MinTopBlobs() const { return 1; }
virtual inline int MaxTopBlobs() const { return 1; }

protected:
virtual void Forward_cpu(const vector<Blob>& bottom,
const vector<Blob
>& top);
virtual void Forward_gpu(const vector<Blob>& bottom,
const vector<Blob
>& top);
virtual void Backward_cpu(const vector<Blob>& top,
const vector& propagate_down, const vector<Blob
>& bottom);
virtual void Backward_gpu(const vector<Blob>& top,
const vector& propagate_down, const vector<Blob
>& bottom);

int channels_;
int height_;
int width_;
int pooled_height_;
int pooled_width_;
Dtype spatial_scale_;
Blob max_idx_;
Blob max_mult_;
Blob max_pts_;
};

@MartinPlantinga
Copy link

Thanks @ravikantb !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants