-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detectron training schedule #327
Comments
|
Hmm so it appears to be stuck around 0.180 mAP. Weird.. don't know why. |
Yes. My take on is that: adopting detectron schedule directly can't be a good solution. We somehow inspect our SGD progress and devise a new schedule accordingly. Recently, detectron training logs were released, we can inspect them to take lessons. |
Hmm interesting, hadn't looked at those logs yet. They only have a log shared for ResNeXt101, but still, the regression loss appears to be approx. 0.045403 at the end and the classification loss appears to be 0.069822. That's quite different from what we have (around 1.0 for regression and 0.2 for classification). Any thoughts on this difference? I have the feeling that we at least normalize the regression incorrectly. I think we should divide the result by 4, since there are 4 values. In other words, we should compute the mean of the loss of every value that contributes (only positive anchors). Right now we only divide by the number of positive anchors. |
@hgaiser, Sorry for the late reply. I downloaded all the logs, you can find the RetinaNet R50 log in this link. As you've said there's a problem with our normalization part, and also they calculate the loss in each FPN layer separately as you've discussed earlier. You're right about regression, yet classification is different too. So, we should inspect the loss in detail, I can take a look at that in the upcoming days. And here are the first and last batch output of the log in the link:
|
Do you guys have figured out what makes the loss value different from Detectron? |
I would also be interested in a training log for a fully converged model implemented in this repo. Is there a hope to see one? |
No hope there. The current model is trained over many different runs, so there is no single log. It is also pretty difficult in my opinion because of the training optimizer and settings used. We would greatly benefit if someone takes the time to investigate how to train a COCO model more quickly. |
I have started a training with image-net weights to test the
training-schedule
branch. I will update this issue regularly to report the findings.Training setting is:
dataset
: COCObatch-size
: 1GPU
: 1 x GTX1080Ti (I can switch to aP100
in the following days.)The text was updated successfully, but these errors were encountered: