Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EWC mini-batch sampling #10

Open
ashra-main opened this issue Jun 3, 2022 · 2 comments
Open

EWC mini-batch sampling #10

ashra-main opened this issue Jun 3, 2022 · 2 comments

Comments

@ashra-main
Copy link

Hi, Thank you so much for this awesome repo. It's the clearest implementation I found out there :)

I have a question regarding the mini-batch sampling. In the code, it is commented that it gives similar performance to (sub-sampling with batch_size=1, i.e., the correct mathematical way). But I'm worried that they are very different.
So I'm curious to know whether there are papers that used this sampling instead and they confirmed its similar performance?

The reason for my doubt is that in general, the expected value of the squared gradients of log-likelihoods which is an estimator for the diagonal of the Fisher matrix is not the same as the expected squared expected gradients of log-likelihoods.

Thank you for your consideration,
Arash

@yenchanghsu
Copy link
Collaborator

yenchanghsu commented Jun 3, 2022

Thanks for your interest in our repo. You are right. batch_size=1 and batch_size!=1 are different. That's why the default setting in the MNIST demo uses batch_size=1 (see here), although empirically batch_size!=1 gives a similar MNIST performance. It is possible that a different dataset or exp setting may show a different result.

What could make the results similar/dissimilar? The total number of samples may be the major factor. Consider the two cases:

  1. 60000 samples with batch_size=1 versus 60000 samples with batch_size=10.
  2. 10 samples with batch_size=1 versus 10 samples with batch_size=10.

In case 1, the resulting Fisher diagonals are likely to be similar after multiplied with a scaling factor, while case 2 will be very different.

Please feel free to share your findings with us. Thank you!

@ashra-main
Copy link
Author

That makes sense. Thank you.

Cheers!
Arash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants