Skip to content

List of (mostly ML) papers, where description of the method could be shortened significantly

Notifications You must be signed in to change notification settings

usamec/this-paper-could-be-a-tweet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

This paper could be a tweet

This is a list of (mostly ML) papers where the description of the method contains a lot of fluff, equation theatre, and it could be shortened significantly and explained much better.

This does not mean that the idea in the paper is bad or that the results of the mentioned papers are worthless. It just means that, in my opinion, they could be presented in a much better fashion.

There is no importance sampling! Nothing. Zilch! The proposed optimizer always updates embedding and lm-head and randomly selects transformer blocks. And they call this importance sampling, because the first and the last layer have a "higher importance"? At least the results look promising.

image

The idea is to replace (Pytorch pseudocode follow):

Conv2d(in, out, kernel_size)

With:

Sequential(
  Conv2d(in, small, kernel_size),
  Conv2d(small, out, kernel_size2, groups=small)
)

Aka factorized convolution in yet another way using smaller convolution + depthwise convolution.

Instead of a bad figure and an important piece of algorithm hidden in the middle of the page:

image

We could have a much better figure (parts taken from Shufflenet):

image

With this, the paper could be understood in seconds instead of hours.

30 pages of proofs, lingo, etc, could be simplified as:

image

I.e., sample words whose log probability is close to entropy.

About

List of (mostly ML) papers, where description of the method could be shortened significantly

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published