-
I've read through the GPT-J example (and the pending MR which includes the quantization script for that example). I understand that if a model is in, say ONNX format or something, then the frameworks' post-training quantization components may not be used, but I know that e.g. FB uses torch extensively, so I suspect that zuckerberg trained the llama with a torch.. That said, I would love to learn more about this - I understand many theoretical aspects of quantization (quantizing weights, activations, gradients, biases; that there are vector and gradient sparsification methods, or that linear and logarithmic quantization can be applied). I understand that here we're quantizing the weights, right, but I don't really understand how this is different from what the post-training quantization support which exists in these frameworks provides. Thanks for the explanation! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I believe fb used fairseq to train llama, which is built on top of torch. But I'd love an answer to this too. I saw in the gpt-j script he does use fp16 quantization before he dumps the model. But it seems like his quantization diff is significantly more involved?? |
Beta Was this translation helpful? Give feedback.
-
This is more of a hacky version of quantization that works - little drop in performance with huge savings on inference memory and time. This divides matrix into consecutive groups of 32 elements, figures out a scaling factor based on min and max element mapping to 0 and 15 respectively given either a fixed zero-offset in Q4_0 or a variable Zero-offset in Q4_1. |
Beta Was this translation helpful? Give feedback.
This is more of a hacky version of quantization that works - little drop in performance with huge savings on inference memory and time.
Pytorch and friends have default Post Training Quantization (PTQ) that quantizes the entire matrix by a scaling factor and a zero-offset factor. Because of that, it is hard to go beyond int8. More recent approaches like bits and bytes & gptq divide into Bins/groups and have a separate scaling/zero-offset factor for all of those.
This divides matrix into consecutive groups of 32 elements, figures out a scaling factor based on min and max element mapping to 0 and 15 respectively given either a fixed zero-offset in Q4_0 or a variable Zero-offset in Q4_1.