Could someone explain a bit about how this is different from post-training quantization supported by typical frameworks like torch and TF? #41

ghost · 2023-03-19T06:21:57Z

ghost
Mar 19, 2023

I've read through the GPT-J example (and the pending MR which includes the quantization script for that example). I understand that if a model is in, say ONNX format or something, then the frameworks' post-training quantization components may not be used, but I know that e.g. FB uses torch extensively, so I suspect that zuckerberg trained the llama with a torch..

That said, I would love to learn more about this - I understand many theoretical aspects of quantization (quantizing weights, activations, gradients, biases; that there are vector and gradient sparsification methods, or that linear and logarithmic quantization can be applied).

I understand that here we're quantizing the weights, right, but I don't really understand how this is different from what the post-training quantization support which exists in these frameworks provides.

Thanks for the explanation!

Answered by Ayushk4

Mar 19, 2023

This is more of a hacky version of quantization that works - little drop in performance with huge savings on inference memory and time.
Pytorch and friends have default Post Training Quantization (PTQ) that quantizes the entire matrix by a scaling factor and a zero-offset factor. Because of that, it is hard to go beyond int8. More recent approaches like bits and bytes & gptq divide into Bins/groups and have a separate scaling/zero-offset factor for all of those.

This divides matrix into consecutive groups of 32 elements, figures out a scaling factor based on min and max element mapping to 0 and 15 respectively given either a fixed zero-offset in Q4_0 or a variable Zero-offset in Q4_1.

View full answer

bth5032 · 2023-03-19T06:36:36Z

bth5032
Mar 19, 2023

I believe fb used fairseq to train llama, which is built on top of torch. But I'd love an answer to this too. I saw in the gpt-j script he does use fp16 quantization before he dumps the model. But it seems like his quantization diff is significantly more involved??

0 replies

Ayushk4 · 2023-03-19T18:00:27Z

Ayushk4
Mar 19, 2023

This is more of a hacky version of quantization that works - little drop in performance with huge savings on inference memory and time.
Pytorch and friends have default Post Training Quantization (PTQ) that quantizes the entire matrix by a scaling factor and a zero-offset factor. Because of that, it is hard to go beyond int8. More recent approaches like bits and bytes & gptq divide into Bins/groups and have a separate scaling/zero-offset factor for all of those.

This divides matrix into consecutive groups of 32 elements, figures out a scaling factor based on min and max element mapping to 0 and 15 respectively given either a fixed zero-offset in Q4_0 or a variable Zero-offset in Q4_1.

2 replies

bth5032 Mar 19, 2023

I'm a total noob to quantization, what makes something a variable 0 offset? IIUC you're saying in standard PostTQ, you save the offset and scale for the entire matrix at once, whereas here gg is chunking the matrices into k=32 elements and saving the scale and offset for those individually? Is that really the only difference?

If you save the offset and scale for the 32 elements, then this is called Q4_0, how does this change when you do Q4_1?

edit: I think I may have gotten it. So in qk_0 you take a batch of size qk (in the case of q4, you use 4 32 bit floats) and each float is approximated by an int such that the original float f is closest to f = (absmax/8bit range) * int - (8bit range / 2) . In this case you just assume integers above 8 map to positive numbers and below 8 map to negative numbers -- absmax is max(abs(min), abs(max)).

In qk_1, f = ((max-min)/8bit range) * int + min, where the max and min are the max and min value in the batch. This way you're not wasting any of your range, e.g. qk_0 is really only 3 bit quantization if all your floats are positive, whereas you get all 4 bits in the range if you use qk_1.

Ayushk4 Mar 19, 2023

Say you were quantizing [-10.5, -9.4, -8.3, -7.2] to 2-bits (i.e. the output range will be {0, 1, 2, 3}.
If there is a variable-zero, then we zero-offset set to -10.5 - bringing [-10.5, -9.4, -8.3, -7.2] to [0, 0.9, 1.8, 2.7] then we can have scaling of 0.9 to get [0, 1, 2, 3] to have fully recoverable original weights. From [0,1,2,3] we can then multiply scale and add-zero offset to recover exactly.
However, if we had zero fixed to say +2, then no scale factor would losslessly quantize.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could someone explain a bit about how this is different from post-training quantization supported by typical frameworks like torch and TF? #41

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Could someone explain a bit about how this is different from post-training quantization supported by typical frameworks like torch and TF? #41

ghost Mar 19, 2023

Replies: 2 comments · 2 replies

bth5032 Mar 19, 2023

Ayushk4 Mar 19, 2023

bth5032 Mar 19, 2023

Ayushk4 Mar 19, 2023

ghost
Mar 19, 2023

Replies: 2 comments 2 replies

bth5032
Mar 19, 2023

Ayushk4
Mar 19, 2023