-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting bounding box loss as nan #2
Comments
Hi @ravikantb, Loss going to nan usually happens when the learning rate is too high. However, I will check the implementation of the layers again to make sure there is no bug. Can you paste the Caffe output log (like you have above) for all iterations from Iteration 0 - Iteration 20 or so? Also, can you paste your training prototxt file here? Please note that I am currently not working on the project due to other work and hence maybe delayed with responses. Thanks |
Hi @ravikantb, When training Mask-RCNN with the ROIAlign layer, I do not get nan in the loss.
|
Hi @jasjeetIM , Thanks for your detailed response and apologies for the delay in my response. Please find attached a zip file containing the solver.prototxt, training prototxt and sample logs for 100 iterations. I have tried with learning rates ranging from 0.001-0.000001 but the loss always becomes NaN after sometime. Just to give you a bit more details about my implementation, I took your implementation of ROIAlign layers (CPU and GPU both) and added them in the caffe framework as per the instructions given in below two links: After that I replaced ROIPooling layers in my Faster-RCNN prototxts with ROIAlign. Since I am using alternate optimization technique there, I have sent you the prototxt that is used in training Fast-RCNN component of Faster-RCNN (stage-2 : In this stage output proposals from RPN are converted to fixed length vectors using ROIAlign/ROIPool layers) I really appreciate you going back and checking your implementation for this issue. Please have a look at the provided documents and let me know if you find something wrong there. Thanks, |
Please change the below values as listed in your solver.prototxt and send logs again: Please make sure the logs contains training for at least 100 iterations. Also, provide logs for the above solver.prototxt with ROIPooling layer used instead of ROIAlign layer on the same training set for 100 iterations. Therefore, two training logs : 1) ROIAlign 2) ROIPooling, where both training utilize the solver.prototxt values pasted above. Lastly, do you have your code available on an online repo? Thanks |
Hi @jasjeetIM Please find attached logs for both the runs with the changes you suggested. Our code base is not yet public as it is organization's property but I shall talk to my team and see if I can provide you an access to the code base. I will revert on it soon. Honestly, I had not worked with debug mode of caffe before as I found it too verbose but it seems to output useful information in this case. I shall try to see if I can find the root cause using these logs. Meanwhile if you get time to look at these and find anything useful then please let me know. Thanks, |
Hi @ravikantb, Okay, thanks. Can you do the following to troubleshoot:
Add the following after line 164:
This will help me look at the boundary condition that may be causing 'inf' as one of the pooled values. |
Hi @jasjeetIM , Thanks for all the help. But due to some unforeseen circumstances I have to move away from this project for 2-3 days. I will get back to you with the required data after that. Please keep this issue open till then. Thanks, |
Hi @ravikantb, Could you share how you added the RoIAlign layer to the Many thanks in advance. |
@MartinPlantinga : I added following code in 'fast_rcnn_layers.hpp' file to do this. Hope it helps. (P.S.: I didn't use coding template of github as it was messing with the my code snippet.) /* ROIAlignLayer - Region of Interest Align Layer / virtual inline const char* type() const { return "ROIAlign"; } virtual inline int MinBottomBlobs() const { return 2; } protected: int channels_; |
Thanks @ravikantb !! |
Hi @jasjeetIM ,
I took your ROIAlign layers and integrated them with my py-Faster-RCNN to replace ROIPool layers in my code base. But I am getting bounding box loss as NAN for all the iterations so far while training Fast-RCNN (stage-2) in alternate training optimization. Did you also face any such problem while training? Please let me know.
I shall dig deeper and get back if I find anything worth sharing. Following are my sample logs of the loss for your reference.
Thanks
The text was updated successfully, but these errors were encountered: