You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In trained-ternary-quantization/utils/quantization.py, line42
The last returned value of function "get_grads" is the gradient w.r.t the negative scaling factor. I think the code might be wrong...... (not sure)
Consider a simple condition that the kernel is a 1x1 tensor,
According to my understanding, during the forward pass, we have:
t = ternarize(fp_kernel),
where the function "ternarize" makes the fp_kernel become +1, -1, 0, fp_kernel is the full-precision kernel. For negative part of fp_kernel, t=-1, for positive part, t = +1.
Then, the final negative part of scaled ternary kernel is:
y = w_n * t
In my opinion, gradient w.r.t w_n should equals "grad_y * t", where "grad_y" is the gradient w.r.t negative scaled ternary kernel and corresponds to the "b*kernel_grad" in your code (line42).
Because of "t=-1" for the negative part of the kernel, i think the gradient w.r.t w_n should be
grad_y * t = grad_y * (-1) = -b*kernel_grad
This result indicates that the last return value of function "get_grads" in your code should be
"(-b*kernel_grad).sum()"
Am I right?
The text was updated successfully, but these errors were encountered:
Yeah, sounds right.
But in the original paper they calculate gradient like I do in my implementation (see page 4, equation 7). Maybe there is an error in the paper. Try writing to the authors.
In trained-ternary-quantization/utils/quantization.py, line42
The last returned value of function "get_grads" is the gradient w.r.t the negative scaling factor. I think the code might be wrong...... (not sure)
Consider a simple condition that the kernel is a 1x1 tensor,
According to my understanding, during the forward pass, we have:
t = ternarize(fp_kernel),
where the function "ternarize" makes the fp_kernel become +1, -1, 0, fp_kernel is the full-precision kernel. For negative part of fp_kernel, t=-1, for positive part, t = +1.
Then, the final negative part of scaled ternary kernel is:
y = w_n * t
In my opinion, gradient w.r.t w_n should equals "grad_y * t", where "grad_y" is the gradient w.r.t negative scaled ternary kernel and corresponds to the "b*kernel_grad" in your code (line42).
Because of "t=-1" for the negative part of the kernel, i think the gradient w.r.t w_n should be
grad_y * t = grad_y * (-1) = -b*kernel_grad
This result indicates that the last return value of function "get_grads" in your code should be
"(-b*kernel_grad).sum()"
Am I right?
The text was updated successfully, but these errors were encountered: