Questions about weight[j] #140

DavidZyy · 2024-12-13T17:50:05Z

DavidZyy
Dec 13, 2024

Hi @ikawrakow, your work on quantization is amazing and I really admire them. Recently, I am reading codes about this and have some questions.
For example, at funtion quantize_row_q4_0_impl and other places, weight[j] is:

weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);

I already see some discussions at here, but I still don't quite understand, Can you give me some guidance? Why do not use the following directly?

weight[j] = qw[j]

ikawrakow · 2024-12-14T08:13:19Z

ikawrakow
Dec 14, 2024
Maintainer

Hi @DavidZyy,

this is simply an empirical correction, there is no science behind it (and it was amusing to observe people trying to make scientific sense out of it). From the pre-imatrix days we have learned that it is better to assign higher weights (importance) to model weights with larger magnitudes in a weighted RMSE minimization. As there is no precise science behind that, it was just a matter of experimentation to determine how this higher importance should look like ($x^2$, $|x|$, $\sigma^2 + x^2$, $\sigma + |x|$, etc., are all variations that have been tried). When I introduced the imatrix, the hope was of course that one can get rid of such non-scientific stuff and just use the diagonal elements of the Hessian. But in practice it is rarely as simple as that. Having the $\sqrt{\sigma^2 + x^2}$ in there does improve quantization accuracy, at least as measured by perplexity or KL-divergence.

Why $\sqrt{\sigma^2 + x^2}$ and not something else?

As the Hessian already gives a lot of information about model weight importance, at some level it should be clear that the empirical correction cannot be as strongly magnitude dependent as it was without the imatrix
We definitely do not want to have the importance of small-magnitude weights become (nearly) zero
Based on the above two bullet points, and the experience from pre-imatrix quantization, $\sqrt{\sigma^2 + x^2}$ was an obvious choice that turned out to work better than anything else I tried

Why the need for correcting the Hessian in the first place?

We are using just the diagonal elements, which is an approximation. In my experience adding a correction to an approximation often improves things
From a more conceptual point of view, even if we did use the full Hessian, we still don't know if RMSE between the quantized and the full model weights is the similarity measure that we should be minimizing. RMSE is of course very convenient (expressions are very simple), so not knowing what to minimize we just use that. But in reality another similarity measure may be better, and it will have a different Hessian, so a different importance matrix, so we are back to square one where the importances being used are just a matter of empirical experimentation.

0 replies

DavidZyy · 2024-12-14T13:58:43Z

DavidZyy
Dec 14, 2024
Author

Thanks for taking time to answer this question and share information, I learned a lot from your answers.
Yes, it's very interesting :)

(and it was amusing to observe people trying to make scientific sense out of it)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about weight[j] #140

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Questions about weight[j] #140

DavidZyy Dec 13, 2024

Replies: 2 comments

ikawrakow Dec 14, 2024 Maintainer

DavidZyy Dec 14, 2024 Author

DavidZyy
Dec 13, 2024

ikawrakow
Dec 14, 2024
Maintainer

DavidZyy
Dec 14, 2024
Author