You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've looked at the paper and source code, but got some questions. Based on the paper, we need to scale both activation X and weights W in the way of
Y = (X diag(s)^-1) * (diag(s) W)
However, in "smooth.py" file, I could only see that weights W are scaled by multiplying with "diag(s)"
for fc in fcs:
fc.weight.mul_(scales.view(1, -1))
I couldn't find where the activation X is scaled, i.e., missing the diag(s)^-1 factor. I then had to assume X was scaled in reference time. But then in the examples notebooks the models are scaled and then directly put into reference just like usual. Question is, where is X scaled?
In the paper, it says Considering input X is usually produced from previous linear operations (e.g., linear layers, layer norms, etc.), we can easily fuse the smoothing factor into previous layers’ parameters offline, which doe not incur kernel call overhead from an extra scaling. For some other cases, when the input is from a residual add, we can add an extra scaling to the residual branch similar to Wei et al. (2022).
I don't find which part of the code handles fusing scaling X into scaling W from previous layer, other than cancel the whole scaling procedure.
why do we scale layerNorm just like X? Is it related to previous question?
ln.weight.div_(scales)
ln.bias.div_(scales)
why for Llama like models, we don't need ln.bias.div_(scales)?
which part of the code handles scalings for residual connect layers?
The text was updated successfully, but these errors were encountered:
I think I've got the idea. The scaling for X is absorbed into layer norm. Now an extra question, the fake_quant file does NOT actually change the data type to int8, right? If right, how can we actually change the model to int8 data type?
Hi All,
I've looked at the paper and source code, but got some questions. Based on the paper, we need to scale both activation X and weights W in the way of
However, in "smooth.py" file, I could only see that weights W are scaled by multiplying with "diag(s)"
I couldn't find where the activation X is scaled, i.e., missing the diag(s)^-1 factor. I then had to assume X was scaled in reference time. But then in the examples notebooks the models are scaled and then directly put into reference just like usual. Question is, where is X scaled?
In the paper, it says
Considering input X is usually produced from previous linear operations (e.g., linear layers, layer norms, etc.), we can easily fuse the smoothing factor into previous layers’ parameters offline, which doe not incur kernel call overhead from an extra scaling. For some other cases, when the input is from a residual add, we can add an extra scaling to the residual branch similar to Wei et al. (2022).
I don't find which part of the code handles fusing scaling X into scaling W from previous layer, other than cancel the whole scaling procedure.
why do we scale layerNorm just like X? Is it related to previous question?
why for Llama like models, we don't need
ln.bias.div_(scales)
?which part of the code handles scalings for residual connect layers?
The text was updated successfully, but these errors were encountered: