Question: why not need explicit scaling for activation X #79

ghost · 2024-03-18T22:51:06Z

Hi All,

I've looked at the paper and source code, but got some questions. Based on the paper, we need to scale both activation X and weights W in the way of

Y = (X diag(s)^-1) * (diag(s) W)

However, in "smooth.py" file, I could only see that weights W are scaled by multiplying with "diag(s)"

    for fc in fcs:
        fc.weight.mul_(scales.view(1, -1))

I couldn't find where the activation X is scaled, i.e., missing the diag(s)^-1 factor. I then had to assume X was scaled in reference time. But then in the examples notebooks the models are scaled and then directly put into reference just like usual. Question is, where is X scaled?
In the paper, it says
Considering input X is usually produced from previous linear operations (e.g., linear layers, layer norms, etc.), we can easily fuse the smoothing factor into previous layers’ parameters offline, which doe not incur kernel call overhead from an extra scaling. For some other cases, when the input is from a residual add, we can add an extra scaling to the residual branch similar to Wei et al. (2022).
I don't find which part of the code handles fusing scaling X into scaling W from previous layer, other than cancel the whole scaling procedure.
why do we scale layerNorm just like X? Is it related to previous question?

    ln.weight.div_(scales)
    ln.bias.div_(scales)

why for Llama like models, we don't need
ln.bias.div_(scales)?
which part of the code handles scalings for residual connect layers?

The text was updated successfully, but these errors were encountered:

ghost · 2024-03-18T22:51:52Z

@Guangxuan-Xiao can any of the authors explain this to me?

ghost · 2024-03-21T01:14:31Z

I think I've got the idea. The scaling for X is absorbed into layer norm. Now an extra question, the fake_quant file does NOT actually change the data type to int8, right? If right, how can we actually change the model to int8 data type?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: why not need explicit scaling for activation X #79

Question: why not need explicit scaling for activation X #79

ghost commented Mar 18, 2024 •

edited by ghost

Loading

ghost commented Mar 18, 2024

ghost commented Mar 21, 2024

Question: why not need explicit scaling for activation X #79

Question: why not need explicit scaling for activation X #79

Comments

ghost commented Mar 18, 2024 • edited by ghost Loading

ghost commented Mar 18, 2024

ghost commented Mar 21, 2024

ghost commented Mar 18, 2024 •

edited by ghost

Loading