This is a list of (mostly ML) papers where the description of the method contains a lot of fluff, equation theatre, and it could be shortened significantly and explained much better.
This does not mean that the idea in the paper is bad or that the results of the mentioned papers are worthless. It just means that, in my opinion, they could be presented in a much better fashion.
There is no importance sampling! Nothing. Zilch! The proposed optimizer always updates embedding and lm-head and randomly selects transformer blocks. And they call this importance sampling, because the first and the last layer have a "higher importance"? At least the results look promising.
The idea is to replace (Pytorch pseudocode follow):
Conv2d(in, out, kernel_size)
With:
Sequential(
Conv2d(in, small, kernel_size),
Conv2d(small, out, kernel_size2, groups=small)
)
Aka factorized convolution in yet another way using smaller convolution + depthwise convolution.
Instead of a bad figure and an important piece of algorithm hidden in the middle of the page:
We could have a much better figure (parts taken from Shufflenet):
With this, the paper could be understood in seconds instead of hours.
30 pages of proofs, lingo, etc, could be simplified as:
I.e., sample words whose log probability is close to entropy.