-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EMA Update of bn buffer #73
Comments
@luyvlei ohh yes, i do believe you are correct, and this paper also came to a similar conclusion https://arxiv.org/abs/2101.07525 |
Do I need to do a test to verify this modification? If this modification is effective, I can submit a PR and test report. @lucidrains |
@luyvlei so i think the issue is because the batchnorm statistics are already a moving average - i'll have to read the momentum squared paper above in detail and see if the conclusions are sound as an aside, there are papers that are starting to use SimSiam (kaimings work where the teacher is the same as the student, but with a stop gradient) successfully, and which does not require exponential moving averages as does BYOL. so i'm wondering how important these little details are, and whether it is worth the time to even debug https://arxiv.org/abs/2111.00210 |
Hi all, I am also trying to reproduce BYOL results and am falling a bit short (~1%) and am wondering if this might be related to the reason why. I figure there are two options during pretraining:
I believe # 1 is correct based on my reading of the paper and looking through some implementations. If #1 is correct, there don't need to be any changes -- also since we feed the same exact images to the target and online network, the running mean and running var calculated should be the same in the end. If # 2 is correct, then we would have to copy the buffers as suggested above. As an aside, I believe the issue in my repro is that I am following # 1 and have SyncBatchNorm for the online network, but not for the target network. |
Yep, if the target model uses train mode, the statics of BN doesn't matter since it will never be used. And in this implementation it also use train mode. But it's not clear that the EVAL mode would have yielded any better results |
The following function apply moving average to the ema model. But it didn't update the statistic(runing_mean and runing_var) since these two were not parameters but buffers.
Should I use this function instead?
The text was updated successfully, but these errors were encountered: