Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BitnetForCausalLM (new model / new datatype) #7931
Add support for BitnetForCausalLM (new model / new datatype) #7931
Changes from 21 commits
076b4a1
57dfc3b
1f2e0ee
5e59660
2a01a7c
4e1ab50
ca09085
dbee0a8
1c5a8b7
3a0f8b0
97d22be
344467f
65ac3a3
abd798d
841c903
c0fd4df
de1d507
2322e9d
c0cd08d
f395dd9
5e5eee7
7a8961f
95dced0
569a03e
a03eff3
4edc958
89c7e4c
fcf2da4
fa9a742
230396b
2b09768
a58cf0d
abcdc50
c6ddfa7
55a57a5
0520d88
16f0c30
226c5ee
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only taking the last non-zero value of the tensor as the scale, if I understand correctly?
The other quants use the absmax, so this looks a bit weird.
Does it work as expected? If so, how or why?
Should it be the absmean of the non-zero values instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for remind!
Actually, at this time, the weight matrix only contains three different value : (1, 0, -1) * scale (for example, 0.66, 0, -0.66). So, as long as we don't pick 0 as scale, any non-zero value could be the scale. If we pick 0.66 as scale, we will transform (0.66, 0, -0.66) to (1, 0, -1), if we pick -0.66 as scale, we will transform (0.66, 0, -0.66) to (-1, 0, 1).
I can add a break to the loop to pick the first non-zero value to avoid useless looping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would slightly complicate the eventual Numpy implementation in
gguf-py/gguf/quants.py
forconvert-hf-to-gguf.py
. Quantization doesn't need to be particularly fast (I think?), but it needs to be reproducible (and possibly sane on even non-bitnet models, since people will try to apply this on models for which it's not appropriate). If all absolute non-zero values are the same in bitnet models, picking the absmax might be fine then.(It's dequantization that needs to be fast for good inference speed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it. I will change it into absmax to make it more reproducible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Eddie-Wang1120, actually (I only noticed it now), in section 2 of the BitNet 1.58b paper, they specifically say they use absmean:
See https://arxiv.org/html/2402.17764v1#S2
But if it's applied twice (e.g. on pre-quantized weights), then maybe the mean shouldn't include the zero values. (
absmax
is still fine, but only when the weights are pre-quantized)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And an interesting fact is, if we don't pre-quantize weights to {-1, 0, +1} * scale, the tokens generated will be wrong. That's why I put the absmean quantization (weight pre-quantization) in convert-hf-to-gguf.py, otherwise we'll get a meaningless fp32/fp16 gguf model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, maybe this could be made even more compact?
Instead of fitting only 4 ternary values per byte, it would be possible to fit 5 of them (because 3⁵ = 243, which is smaller than 256).
To avoid using modulo when dequantizing, assuming multiplication by 3 is fast (it can be turned into an addition and a bit shift), maybe storing an inverted value would work.
Not sure what speed difference it would have compared to bit shifts and masks, though.
Here's an example program verifying that multiplication can be an alternative to modulo by 3 (click to expand)
Compile and run:
Output (click to expand)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good thought!
It is certain that the weight can be even more compacted. Like you said that we can treat the weight in a 3-value and compact it in a byte. However, what makes it less efficient is that it seems we can't find a suitable SIMD solution for this compaction. Unlike we can use _mm256_sign_epi16 in 2bit compaction, 3-value compaction seems not working with these strict align requirements.
So, I chose the 2bit compaction at the begining, also looking forward for a more efficient solution at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the same scale stored 8 times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that there is a alignment restrcition for gguf, which is 32bytes, so I stored 8 times by a float32 scale to fill the alignment. It can still work if I change it to scale_ptr[0] = i2_scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the tensor-wide scales, even though the paper suggests using them, I wonder if using block scales would work too, so that it works better with the existing
ggml
type infrastructure. The scales could even all be made equal if that's a problem.When (or if) packing 5 values per byte, since the row sizes from the bitnet models are usually not multiples of 5 (e.g. 1536, 2048), and since the 3B model uses a hidden_size of 3200 which isn't a multiple of 256, using blocks of 128 elements could work. Two groups of 64 elements, with each group having 12 bytes with 5 elements per byte, with 1 more byte with 4 elements, so
2*(12+1) = 26
bytes, and then a scale. If the scale is inf32
, that would make this1.875
bits per weight, while anf16
orbf16
scale would make this1.75
bits per weight. (no scale would be1.625 bpw
, which is very close to the ideal of1.5849625 bpw
)If packing only 4 ternary values per byte (as in
i2_s
), then using blocks of 128 elements with anf32
scale would make this2.25
bits per weight, while using a scale inf16
orbf16
would make this2.125
bits per weight, and would pretty much be likeQ2_0
if that existed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, personally, I still think the tensor-wide scale is a better way for bitnet. At least for the 2bit compaction, 2.25 bpw means a around 10% model size waste, and it's kind of beyond what is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 2 bits is already wasting 20% of the tensor size compared to the 1.6 bpw ideal for ternary, maybe there could be a way to still make this a block-wise quant (e.g. 4 elements per block of 1 byte), and have a row-wise/tensor-wise scale by somehow encoding it in unused bits in the first few blocks of each row? Might be a bad idea though, but I don't know yet why (maybe the overhead in recovering the scale?). (This would also require asserting a minimal row size in the quantize function.)
Because 4 ternary values fit in 7 bits (3^4 == 81 < 128), and you're already using a lookup table to expand the packed bits into 4 bytes, this could let the SIMD vec_dot stay pretty much identical to how it is now, except maybe it could include the scaling in its result?
Not sure yet how to pack the scale in
i8_s
, though, or some other way to letggml_vec_dot_i2_i8_s
have access to its scale.Anyway, at least this gives some ideas to try eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your idea. Assume we have 128 continuous value, we can compact it into (128 / 4) * 7 = 224bits, and other 32bits will be a float32 scale, and still it's 2 bpw. The block size could be 28char(224bits) + 1float32(32bits) == 32bytes. One thing worries me a little is that we need to do some shifting to make the weight align so that we can index from the lookuptable, it may slow down the kernel, but it deserves give a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the unit for block sizes are elements, while type sizes are in bytes.
To keep the alignment, my suggestion was actually to keep using 8 bits per 4 elements (so that alignment remains easy), but also use the top bit of the first 16 or 32 bytes to store the scale. Only the lower (or upper? doesn't matter) 7 bits of the bytes would store 4 elements, using the fact that 3^4 == 81 < 128 == 2^7.
To go for maximum compactness, the same idea can be applied to 5-elements per bytes to achieve
1.625 bpw
. The type size would be 13 bytes, the block size 64 elements. 12 bytes of 5 elements per byte and 1 byte of 4 elements, plus part of the scale. But this is more complicated, because 5 isn't a power of 2, so the SIMD vec_dot would need lots of non-trivial modifications, unlike with the other 2 bpw suggestion.