Skip to content

Could someone explain a bit about how this is different from post-training quantization supported by typical frameworks like torch and TF? #41

Answered by Ayushk4
ghost asked this question in Q&A
Discussion options

You must be logged in to vote

This is more of a hacky version of quantization that works - little drop in performance with huge savings on inference memory and time.
Pytorch and friends have default Post Training Quantization (PTQ) that quantizes the entire matrix by a scaling factor and a zero-offset factor. Because of that, it is hard to go beyond int8. More recent approaches like bits and bytes & gptq divide into Bins/groups and have a separate scaling/zero-offset factor for all of those.

This divides matrix into consecutive groups of 32 elements, figures out a scaling factor based on min and max element mapping to 0 and 15 respectively given either a fixed zero-offset in Q4_0 or a variable Zero-offset in Q4_1.

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@bth5032
Comment options

@Ayushk4
Comment options

Answer selected
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants