You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
.. show that it performs the same as the normal Transformer when using the same number of parameters; we achieve this by having both x1 and x2 have size d_model.
I see how the parameters of Attention and MLP does not increase. But what about
(1) the embedding layer and
(2) the final projection layer?
Question 0. Why does the parameters of the initial embedding layer not increase if we double d_model?.
The text was updated successfully, but these errors were encountered:
Regarding Reformer: paper | code
From paper:
I see how the parameters of Attention and MLP does not increase. But what about
(1) the embedding layer and
(2) the final projection layer?
Question 0. Why does the parameters of the initial embedding layer not increase if we double d_model?.
The text was updated successfully, but these errors were encountered: