-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support BitNet b1.58 ternary models #5761
Comments
Wow, that is indeed very promising. And quite different from the current quant approaches too. Seems like instead of quantizing models post-training it quants them during training. I am sure though if this approach proves to be sucessful, model trainers like Jon Durbin, Teknium and Eric Hartford will jump in quickly. Aside from obivious benefits during inference, In theory that could also allow much higher quality Lora training at less memory costs? You could theoretically train on GGUF models but that is generally not recommended as quality suffers too much from it compared to a fp16 model, so it seems this approach would help in that regard as well. @ikawrakow What do you think about this paper? |
Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than |
Let's wait till they post the actual code up... then maybe it will be more clear :) |
Is there something to implement on the inference side? Seems like it's just the training method that is different. The produced model (be it 1-bit, 2-bit or N-bit) should be possible to infer as usual, correct? But I share @ikawrakow's sentiment - let's wait and see first |
:) Yes. Let's wait till the authors' code is up. Really hoping this is going to be the way of the future :) |
As I've understand it, the figures in that table are not meant to represent the model size, but the actual GPU memory usage during inference. So those 2.22 GB include the KV-Cache. Given it's llama without GQA I would imagine it being quite big. |
The If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to |
Let's pray they used 256 block size 😄 |
No, I'm actually hoping the hidden dimension is from the Fibonacci sequence. So we finally get a |
The models presented in these papers are not quantized. They are using ternary parameters (-1, 0, 1) not quantization, so it's a full-sized model. So, I don't think expectations for the size of quantized models would apply in this case. Either way, we'll know when they release code. |
I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well. Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with ternary weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well. |
Same, and I was also hoping that given 3-4bit weights, it might reduce the solution surface so dramatically that we might even drop the backprop nonsense entirely and use something else... (for the pretraining, because for fine-tuning, it kinda makes sense because you want just little nudge, not dramatical change) If 1-2 bit is feasible, than this might again change the problem space and maybe we could go straight to random evolutionary algorithms or something like that. I wonder why nobody tried that (and I hope it's not because I'm idiot) |
If this pans out, we should see everyone switching to it and throwing 10 times more parameters in the model. Plus NVIDIA should take notice of this. |
Designing hardware around pure adders seems so damn juicy, god damn that would be so insanely fast. |
Those addition-only matrix operations are brilliant. This could be so fast in the future with dedicated ASICs. @igorbarshteyn could you clean up the title of this issue a bit though? Maybe just something like:
|
Done @EwoutH |
Code will be populated here when they are ready: |
Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right? |
What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16. |
If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better. Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques. For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model. |
it's probably easier to train model with data output from better F16 model. |
Yeah I agree! Though broadly I think of SPIN as one of the class of teacher student distillation techniques. Either way - this should be possible, and has incredible potential. I really don't see the community investing in training cutting edge 60B+ parameter <2 bit models, so we really need to find clever ways to extract the right weights starting from successful fp16 models. |
These papers might be a practical approach for existing model conversion: |
The 1-bit idea in the Bitnet paper (https://arxiv.org/abs/2310.11453) has been adopted in this recent 1-bit quantization paper (https://arxiv.org/abs/2402.11295). |
what cpu is that? |
I literally just got the numbers from @catid's AMD 7950X 3B example and extrapolated them up. From my own experience the inference speed for non Bitnet models is inversely proportional to its size provided you have enough memory to hold everything. |
A new implementation, BitMat, just got open-sourced.
Some more context on Reddit:
|
Sorry to reach out again, but looking over the repository you point to from the quote above, those model weights are all fp32. Now, I don't have the expertise to go in and look at those weights, perhaps they are all just 1.000000, 0.00000, -1.000000, but have you actually examined those models to confirm that is really what bitnet is using for weights (i.e. fp32 for each ternary weight)? I would have expected some other container than fp32 weights for storing these ternary values (some sort of int8 construct). I don't think llama.cpp can start working on an implementation for supporting that model type as it is, because those model weights aren't in a format we'd expect anyone to actually load. If you could share possibly a method to convert those weights to the correct storage format, we could start working from there. Thanks in advance, doesn't seem to be much open progress yet on ternary bitnet(s) despite the obvious benefits. |
@ExeVirus I agree that it's likely because of a lack of support for loading ternary models. I think there's nothing more to add to this discussion until @shumingma finishes training their models. Perhaps this isn't just a llama.cpp issue, but more of a ggml/gguf file format support issue? |
Sounds like it, but if that's true, we already have an open ternary weight model available, it's just in fp32 per weight. That sounds relatively straightforward to at least get a CPU only version working and doing some bit packing, since it's just llama with bit linear. At the very least, isn't Q2-Q3 post-quantization enough to represent -1, 0, 1? |
Two more pre-trained models: It now seems we have enough models to test on, and inference implementations (like bitnet_cpu and BitMat) to be inspired by. What would be a good approach to implement support for ternary models in llama.cpp, and how can we move that forward? |
My guess is that we'll need a new quantization type for this, say QB with one sign bit and one data bit for a total of two bits per weight. Activations will be 8 bits as specified in the paper. We shouldn't use an existing quant for this as those group the weights together with a scaling factor and we don't need that for Bitnet. This quant will be lossless! Assuming that the models come in f32 as -1.00, 0.00, 1.00 we would need to create the new QB quant and have some check in |
Personally I'd vote for a two bit ones' complement format, which would mean that you'd have one bit set for the value +1, the other bit set for the value -1 and neither bit set for the value 0. I believe that this will lead to the most efficient kernels using SIMD mask-and-add on both Intel and ARM CPUs. |
The 2.22GB is "memory use" which probably means for training. That would presumably include the un-quantized fp16 version of the parameters, needed during training. (And possibly the optimizer state, which for Adam would be two fp32's per parameter, although I doubt they included that). |
Update Microsoft quietly released BitBLAS: github.com/microsoft/BitBLAS What's more important, is they used this BitBLAS with the open-source-reproduced 3B ternary bitnet model. They released all that code for testing and inference here: github.com/microsoft/BitBLAS/tree/main/integration/BitNet. From my quick reading - that's more than enough information to write the kernels for inference and probably make GGUFs for the 3B/1.3B/700M models. [Everything is MIT licensed] |
If their own researchers are using the 1bitLLM reproduced model then from the looks of it Microsoft is never going to release the original BitNet paper models 😞.
The issue with that is that those are toy models solely for testing BitNet and they aren't really suitable for actual chatting. What will get a dev onboard (maybe even me if I have the time!) is for some company to release a fully trained BitNet model of Llama quality for us to play with. Otherwise the implementation will remain as an experiment and it'll probably have few users and little support. Now of course this is also a chicken and egg problem since if we can demonstrate how effective BitNet inference is with llama.cpp then some company may be incentivized to train some proper models. |
Exactly. I'm actually having one of the researchers in the RWVK community run some tests where they replace their weights with teranry weights. Seeing some initial success with that test, albeit it trains more slowly than they are used to for their architecture. If they do start that work, I'll be sure to report. Update on RWVK: they weren't satisfied with such a low learning rate compared to their normal training speeds (too expensive too give a full run). If someone has the experience with hyper parameter tunning and the hardware to test, reach out to me and I can get you set up with what they would need to prove to themselves that ternary is worth it. |
Haven't looked into it in detail yet but someone just submitted a BitNet PR! The new format uses 2 bits per weight. |
Another paper has been released that builds on BitNet with ternary weights. What's interesting here is that they made a FPGA implementation designed for ternary math. |
I suppose the issue is that FPGAs just don't have the raw FLOPS that GPUs
have - so even if you can program them to run more efficiently, they'll be
much slower?
…On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote:
Another paper has been released that builds on BitNet with ternary
weights. What's interesting here is that they made a FPGA implementation
designed for ternary math.
image.png (view on web)
<https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b>
https://arxiv.org/pdf/2406.02528
—
Reply to this email directly, view it on GitHub
<#5761 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Kinda, FPGAs let you lay out the gates exactly to match the algorithm you
want to operate.
With something like bitnet, that means you only need to design matmul
matrices for f8, and ternary 'multipliers' can be laid out in more optimal
ways so that more operations are done per 'FLOP' than in a GPU.
With a ton of optimization, FPGAs can get much more out of their FLOPs than
GPUs. The issue is if you have that much time to optimize, then an ASIC
might have made sense in the first place.
If only there were machine learning FPGA algorithm optimizers.... There was
that one with alpha zero, but it was very manual:
https://deepmind.google/discover/blog/alphadev-discovers-faster-sorting-algorithms/
On Sat, Jun 29, 2024, 6:11 AM Ronan McGovern ***@***.***>
wrote:
… I suppose the issue is that FPGAs just don't have the raw FLOPS that GPUs
have - so even if you can program them to run more efficiently, they'll be
much slower?
On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote:
> Another paper has been released that builds on BitNet with ternary
> weights. What's interesting here is that they made a FPGA implementation
> designed for ternary math.
>
> image.png (view on web)
> <
https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b>
>
> https://arxiv.org/pdf/2406.02528
>
> —
> Reply to this email directly, view it on GitHub
> <
#5761 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#5761 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKT7N2XR6SL5Z3B3SXCX773ZJ2B6DAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGA4DQOBYHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
yeah, to be more concrete.
With FPGA, the total number of logic gate operations required can be ~8x
smaller (if 2bit vs 16 bit) or ~5x (if ternary vs 16 bit). However, FPGAs
will just have far lower operations per second versus GPUs right? (at least
for current FPGAs and GPUs OR, said differently, the cost per FLOP of GPUs
is just very low and you can't make up for that just by being more
efficient on more primitive operations with FPGAs, or probably
ASICs...unless mass manufactured)
…On Sat, Jun 29, 2024 at 12:48 PM ExeVirus ***@***.***> wrote:
Kinda, FPGAs let you lay out the gates exactly to match the algorithm you
want to operate.
With something like bitnet, that means you only need to design matmul
matrices for f8, and ternary 'multipliers' can be laid out in more optimal
ways so that more operations are done per 'FLOP' than in a GPU.
With a ton of optimization, FPGAs can get much more out of their FLOPs
than
GPUs. The issue is if you have that much time to optimize, then an ASIC
might have made sense in the first place.
If only there were machine learning FPGA algorithm optimizers.... There
was
that one with alpha zero, but it was very manual:
https://deepmind.google/discover/blog/alphadev-discovers-faster-sorting-algorithms/
On Sat, Jun 29, 2024, 6:11 AM Ronan McGovern ***@***.***>
wrote:
> I suppose the issue is that FPGAs just don't have the raw FLOPS that
GPUs
> have - so even if you can program them to run more efficiently, they'll
be
> much slower?
>
> On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote:
>
> > Another paper has been released that builds on BitNet with ternary
> > weights. What's interesting here is that they made a FPGA
implementation
> > designed for ternary math.
> >
> > image.png (view on web)
> > <
>
https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b>
>
> >
> > https://arxiv.org/pdf/2406.02528
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
>
#5761 (comment)>,
>
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM>
>
> > .
> > You are receiving this because you are subscribed to this
thread.Message
> > ID: ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <
#5761 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AKT7N2XR6SL5Z3B3SXCX773ZJ2B6DAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGA4DQOBYHE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#5761 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CTDV3QGCE6O3XN2PNTZJ2NJBAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGEZDGMRVHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The less power used, the faster you can run the clock, for the most part.
On Sat, Jun 29, 2024, 10:32 AM Ronan McGovern ***@***.***>
wrote:
… yeah, to be more concrete.
With FPGA, the total number of logic gate operations required can be ~8x
smaller (if 2bit vs 16 bit) or ~5x (if ternary vs 16 bit). However, FPGAs
will just have far lower operations per second versus GPUs right? (at
least
for current FPGAs and GPUs OR, said differently, the cost per FLOP of GPUs
is just very low and you can't make up for that just by being more
efficient on more primitive operations with FPGAs, or probably
ASICs...unless mass manufactured)
On Sat, Jun 29, 2024 at 12:48 PM ExeVirus ***@***.***> wrote:
> Kinda, FPGAs let you lay out the gates exactly to match the algorithm
you
> want to operate.
>
> With something like bitnet, that means you only need to design matmul
> matrices for f8, and ternary 'multipliers' can be laid out in more
optimal
> ways so that more operations are done per 'FLOP' than in a GPU.
>
> With a ton of optimization, FPGAs can get much more out of their FLOPs
> than
> GPUs. The issue is if you have that much time to optimize, then an ASIC
> might have made sense in the first place.
>
> If only there were machine learning FPGA algorithm optimizers.... There
> was
> that one with alpha zero, but it was very manual:
>
>
https://deepmind.google/discover/blog/alphadev-discovers-faster-sorting-algorithms/
>
> On Sat, Jun 29, 2024, 6:11 AM Ronan McGovern ***@***.***>
> wrote:
>
> > I suppose the issue is that FPGAs just don't have the raw FLOPS that
> GPUs
> > have - so even if you can program them to run more efficiently,
they'll
> be
> > much slower?
> >
> > On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote:
> >
> > > Another paper has been released that builds on BitNet with ternary
> > > weights. What's interesting here is that they made a FPGA
> implementation
> > > designed for ternary math.
> > >
> > > image.png (view on web)
> > > <
> >
>
https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b>
>
> >
> > >
> > > https://arxiv.org/pdf/2406.02528
> > >
> > > —
> > > Reply to this email directly, view it on GitHub
> > > <
> >
>
#5761 (comment)>,
>
> >
> > > or unsubscribe
> > > <
> >
>
https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM>
>
> >
> > > .
> > > You are receiving this because you are subscribed to this
> thread.Message
> > > ID: ***@***.***>
> > >
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
>
#5761 (comment)>,
>
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AKT7N2XR6SL5Z3B3SXCX773ZJ2B6DAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGA4DQOBYHE>
>
> > .
> > You are receiving this because you were mentioned.Message ID:
> > ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <
#5761 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ASVG6CTDV3QGCE6O3XN2PNTZJ2NJBAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGEZDGMRVHA>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#5761 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKT7N2VIVOS6OY362AK6DC3ZJ3AN7AVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGIYTINZRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think of the FPGA implementation as more of a way to show companies that, hey, it's possible to run BitNet very efficiently on custom hardware and if it all pans out then it might be worth having special ternary units inside future CUDA cores or maybe even have a special chip just for BitNet. I don't think we'll see people installing FPGA cards in their computers for running LLMs. |
That would be nice, however bitnet is being held back by the undertrained models currently available. If the original authors got to release their 7B, 13B, and up, it would go a long way in convincing people this model architecture really does scale as promised. |
I feel like gap would only get smaller compared to full fat FP16 as it scales, but we will have to see in reality. I wonder who will be first to train a bigger one, I think someone could do better than we got right now by a bit but we need a company willing to take a risk for even bigger ones. Maybe Jamba people would be interested, they took a risk with trying Mamba |
The development was continued in #8151 |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
New paper just dropped on Arxiv describing a way to train models in 1.58 bits (with ternary values: 1,0,-1). Paper shows performance increases from equivalently-sized fp16 models, and perplexity nearly equal to fp16 models. Authors state that their test model is built on LLaMA architecture and can be easily adapted to llama.cpp.
[Edited to add: Further reading into it by fellow Redditors shows that we can't use this to quantize existing models trained to fp16. They'd have to be trained in this ternary mode from the start. But I think it would still be something that we should implement, because models of that flavor will be coming soon.]
This is all over Reddit /LocalLLaMA right now:
https://www.reddit.com/r/LocalLLaMA/comments/1b21bbx/this_is_pretty_revolutionary_for_the_local_llm/
I think, if my napkin math is right, it would let us run something like 120B models in 24 GB VRAM, or 30B in... 8 GB?
Please implement @ggerganov and friends!
https://arxiv.org/abs/2402.17764
The text was updated successfully, but these errors were encountered: