Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BitNet b1.58 ternary models #5761

Closed
igorbarshteyn opened this issue Feb 28, 2024 · 90 comments
Closed

Support BitNet b1.58 ternary models #5761

igorbarshteyn opened this issue Feb 28, 2024 · 90 comments
Labels
enhancement New feature or request stale Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes

Comments

@igorbarshteyn
Copy link

igorbarshteyn commented Feb 28, 2024

New paper just dropped on Arxiv describing a way to train models in 1.58 bits (with ternary values: 1,0,-1). Paper shows performance increases from equivalently-sized fp16 models, and perplexity nearly equal to fp16 models. Authors state that their test model is built on LLaMA architecture and can be easily adapted to llama.cpp.

[Edited to add: Further reading into it by fellow Redditors shows that we can't use this to quantize existing models trained to fp16. They'd have to be trained in this ternary mode from the start. But I think it would still be something that we should implement, because models of that flavor will be coming soon.]

This is all over Reddit /LocalLLaMA right now:

https://www.reddit.com/r/LocalLLaMA/comments/1b21bbx/this_is_pretty_revolutionary_for_the_local_llm/

I think, if my napkin math is right, it would let us run something like 120B models in 24 GB VRAM, or 30B in... 8 GB?

Please implement @ggerganov and friends!

https://arxiv.org/abs/2402.17764

@igorbarshteyn igorbarshteyn added the enhancement New feature or request label Feb 28, 2024
@igorbarshteyn igorbarshteyn changed the title This new quantization method (BitNet 1.58b) is revolutionary - and according to this new paper, can be easily built into llama.cpp This new quantization method (BitNet b1.58) is revolutionary - and according to this new paper, can be easily built into llama.cpp Feb 28, 2024
@igorbarshteyn igorbarshteyn changed the title This new quantization method (BitNet b1.58) is revolutionary - and according to this new paper, can be easily built into llama.cpp This new model training method (BitNet b1.58) is revolutionary - and according to this new paper, support can be easily built into llama.cpp Feb 28, 2024
@Dampfinchen
Copy link

Wow, that is indeed very promising. And quite different from the current quant approaches too. Seems like instead of quantizing models post-training it quants them during training. I am sure though if this approach proves to be sucessful, model trainers like Jon Durbin, Teknium and Eric Hartford will jump in quickly.

Aside from obivious benefits during inference, In theory that could also allow much higher quality Lora training at less memory costs? You could theoretically train on GGUF models but that is generally not recommended as quality suffers too much from it compared to a fp16 model, so it seems this approach would help in that regard as well.

@ikawrakow What do you think about this paper?

@ikawrakow
Copy link
Contributor

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

@igorbarshteyn
Copy link
Author

Let's wait till they post the actual code up... then maybe it will be more clear :)

@ggerganov
Copy link
Owner

Please implement @ggerganov and friends!

Is there something to implement on the inference side? Seems like it's just the training method that is different. The produced model (be it 1-bit, 2-bit or N-bit) should be possible to infer as usual, correct?

But I share @ikawrakow's sentiment - let's wait and see first

@igorbarshteyn
Copy link
Author

:) Yes. Let's wait till the authors' code is up. Really hoping this is going to be the way of the future :)

@Dampfinchen
Copy link

Dampfinchen commented Feb 28, 2024

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

As I've understand it, the figures in that table are not meant to represent the model size, but the actual GPU memory usage during inference. So those 2.22 GB include the KV-Cache. Given it's llama without GQA I would imagine it being quite big.

@ikawrakow
Copy link
Contributor

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.

If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

@ggerganov
Copy link
Owner

Let's pray they used 256 block size 😄

@ikawrakow
Copy link
Contributor

No, I'm actually hoping the hidden dimension is from the Fibonacci sequence. So we finally get a ggml that does not use blocks 😄

@jetro30087
Copy link

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

The models presented in these papers are not quantized. They are using ternary parameters (-1, 0, 1) not quantization, so it's a full-sized model. So, I don't think expectations for the size of quantized models would apply in this case. Either way, we'll know when they release code.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Feb 28, 2024

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.

Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with ternary weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.

@cztomsik
Copy link
Contributor

Well, I have been wondering for a while why nobody is training quantized models directly

Same, and I was also hoping that given 3-4bit weights, it might reduce the solution surface so dramatically that we might even drop the backprop nonsense entirely and use something else... (for the pretraining, because for fine-tuning, it kinda makes sense because you want just little nudge, not dramatical change)

If 1-2 bit is feasible, than this might again change the problem space and maybe we could go straight to random evolutionary algorithms or something like that. I wonder why nobody tried that (and I hope it's not because I'm idiot)

@errorsandwarnings
Copy link

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.

Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with 3 bit weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.

If this pans out, we should see everyone switching to it and throwing 10 times more parameters in the model. Plus NVIDIA should take notice of this.

@Gobz
Copy link

Gobz commented Feb 28, 2024

Designing hardware around pure adders seems so damn juicy, god damn that would be so insanely fast.

@Gobz
Copy link

Gobz commented Feb 28, 2024

image
Did some simple linear regression from the data in the paper, I hope their data is legit

@EwoutH
Copy link
Contributor

EwoutH commented Feb 28, 2024

Those addition-only matrix operations are brilliant. This could be so fast in the future with dedicated ASICs.

@igorbarshteyn could you clean up the title of this issue a bit though? Maybe just something like:

Support BitNet b1.58 ternary models

@igorbarshteyn igorbarshteyn changed the title This new model training method (BitNet b1.58) is revolutionary - and according to this new paper, support can be easily built into llama.cpp Support BitNet b1.58 ternary models Feb 28, 2024
@igorbarshteyn
Copy link
Author

Done @EwoutH

@Dampfinchen
Copy link

image Did some simple linear regression from the data in the paper, I hope their data is legit

Nice table, thank you for the demonstration. The cool thing is that these figures are during inference of outdated non-GQA models. So with modern GQA models, the VRAM usage would be even smaller than what's listed here.

@igorbarshteyn
Copy link
Author

Code will be populated here when they are ready:

https://github.com/microsoft/unilm/tree/master/bitnet

@kinchahoy
Copy link

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.

If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

@sorasoras
Copy link

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

@kinchahoy
Copy link

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.

Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.

For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

@sorasoras
Copy link

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.

Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.

For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

it's probably easier to train model with data output from better F16 model.
SPIN
I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation
it's hard to get better for now.

@kinchahoy
Copy link

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.
Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.
For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

it's probably easier to train model with data output from better F16 model. SPIN I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation it's hard to get better for now.

Yeah I agree! Though broadly I think of SPIN as one of the class of teacher student distillation techniques. Either way - this should be possible, and has incredible potential. I really don't see the community investing in training cutting edge 60B+ parameter <2 bit models, so we really need to find clever ways to extract the right weights starting from successful fp16 models.

@WebsiteInc
Copy link

WebsiteInc commented Feb 29, 2024

These papers might be a practical approach for existing model conversion:
[https://openreview.net/forum?id=FUnEkOkodU](Token-Scaled Logit Distillation for Ternary Weight Generative Language Models)
[https://huggingface.co/papers/2306.01841](Binary and Ternary Natural Language Generation)

@tuyen-huynh
Copy link

The 1-bit idea in the Bitnet paper (https://arxiv.org/abs/2310.11453) has been adopted in this recent 1-bit quantization paper (https://arxiv.org/abs/2402.11295).

@sorasoras
Copy link

I was just curious how fast it could run on CPU. The answer is not very fast compared to GPUs sadly, even for 3B models. The best I could do was upper-bound about 40 tokens/second on Xeon workstation processors. I mean it's like 5x faster than Gemma 3B but nothing stellar in the grand scheme of things.

BitNet speedup on AVX2: lithium0003 on Github suggested using the _mm256_sign_epi8() intrinsic to greatly speed up the AVX2 version. It's now running at 28 tokens/second using AVX2 on Intel 12th gen CPU and 50 tokens/second using AMD Ryzen 9 7950X with AVX-512. Code is checked in

It may not be GPU fast, but it's decently fast. Inference speeds scale up pretty linearly so you'll get something like this:
Model Size Speed (t/s)
3B 50
7B 25
13B 12.5
30B 6
70B 3.5

Keep in mind that Bitnet is supposed to have F16 equivalent performance, so it may might more sense to compare those numbers against the F16 or perhaps Q8_0 equivalent. People who want something faster can always run on GPU and I'm sure someone will create good kernels for Bitnet.

what cpu is that?

@netrunnereve
Copy link
Collaborator

netrunnereve commented Apr 2, 2024

I literally just got the numbers from @catid's AMD 7950X 3B example and extrapolated them up. From my own experience the inference speed for non Bitnet models is inversely proportional to its size provided you have enough memory to hold everything.

@EwoutH
Copy link
Contributor

EwoutH commented Apr 3, 2024

A new implementation, BitMat, just got open-sourced.

BitMat: Improving Ternary Matrix Multiplication with Triton

BitMat is a Python package designed to optimize matrix multiplication operations by utilizing custom kernels written in Triton. Our package leverages the principles outlined in the "1bit-LLM Era" paper, specifically utilizing packed int8 data to enhance computational efficiency and performance in deep learning and numerical computing tasks.

Some more context on Reddit:

During the training phase, we implement a custom forward and backward propagation mechanism that calculates the gradient as outlined in the paper. This is done in FP16 precision using the pre-quantized weights (W) and inputs (X). For inference, we've optimized storage by packing the model weights into int8 format, achieving a 4x reduction in size. Our kernels then unpack these values during computation, perform the necessary multiplications, and accumulate the results. This approach significantly reduces the memory footprint required for storing and operating the model, aligning with the paper's findings.

@ExeVirus
Copy link

Thank you all for the interest in our BitNet b1.58 work! I believe that it will be a great benefit for the community to implement it in the awesome llama.cpp.

I noticed that someone had open-sourced some models (https://huggingface.co/1bitLLM) that reproduced the results in our paper. The implementation and the results look good to me. I think we should be able to get started from it.

Please let me know if I can help with the implementation, thank you!

@shumingma

Sorry to reach out again, but looking over the repository you point to from the quote above, those model weights are all fp32.

Now, I don't have the expertise to go in and look at those weights, perhaps they are all just 1.000000, 0.00000, -1.000000, but have you actually examined those models to confirm that is really what bitnet is using for weights (i.e. fp32 for each ternary weight)? I would have expected some other container than fp32 weights for storing these ternary values (some sort of int8 construct).

I don't think llama.cpp can start working on an implementation for supporting that model type as it is, because those model weights aren't in a format we'd expect anyone to actually load. If you could share possibly a method to convert those weights to the correct storage format, we could start working from there.

Thanks in advance, doesn't seem to be much open progress yet on ternary bitnet(s) despite the obvious benefits.

@paperdev-code
Copy link

paperdev-code commented Apr 21, 2024

those model weights are all fp32.

@ExeVirus I agree that it's likely because of a lack of support for loading ternary models. I think there's nothing more to add to this discussion until @shumingma finishes training their models. Perhaps this isn't just a llama.cpp issue, but more of a ggml/gguf file format support issue?

@ExeVirus
Copy link

Sounds like it, but if that's true, we already have an open ternary weight model available, it's just in fp32 per weight. That sounds relatively straightforward to at least get a CPU only version working and doing some bit packing, since it's just llama with bit linear. At the very least, isn't Q2-Q3 post-quantization enough to represent -1, 0, 1?

@EwoutH
Copy link
Contributor

EwoutH commented Apr 22, 2024

Two more pre-trained models:

It now seems we have enough models to test on, and inference implementations (like bitnet_cpu and BitMat) to be inspired by.

What would be a good approach to implement support for ternary models in llama.cpp, and how can we move that forward?

@netrunnereve
Copy link
Collaborator

Sounds like it, but if that's true, we already have an open ternary weight model available, it's just in fp32 per weight. That sounds relatively straightforward to at least get a CPU only version working and doing some bit packing, since it's just llama with bit linear. At the very least, isn't Q2-Q3 post-quantization enough to represent -1, 0, 1?

What would be a good approach to implement support for ternary models in llama.cpp, and how can we move that forward?

My guess is that we'll need a new quantization type for this, say QB with one sign bit and one data bit for a total of two bits per weight. Activations will be 8 bits as specified in the paper. We shouldn't use an existing quant for this as those group the weights together with a scaling factor and we don't need that for Bitnet. This quant will be lossless!

Assuming that the models come in f32 as -1.00, 0.00, 1.00 we would need to create the new QB quant and have some check in quantize to make sure that all the f32 weights are in the correct tenary format before quantizing. Looking into the K-quants PRs should give some guidance on how to create a new quant, which also requires you to write the math kernels for the actual calculations. Now this may be sound straightforward for someone who's done this before but it'll probably be really tricky in practice 😉.

@nickovs
Copy link

nickovs commented Apr 23, 2024

My guess is that we'll need a new quantization type for this, say QB with one sign bit and one data bit for a total of two bits per weight.

Personally I'd vote for a two bit ones' complement format, which would mean that you'd have one bit set for the value +1, the other bit set for the value -1 and neither bit set for the value 0. I believe that this will lead to the most efficient kernels using SIMD mask-and-add on both Intel and ARM CPUs.

@danpovey
Copy link

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

The 2.22GB is "memory use" which probably means for training. That would presumably include the un-quantized fp16 version of the parameters, needed during training. (And possibly the optimizer state, which for Adam would be two fp32's per parameter, although I doubt they included that).

@ExeVirus
Copy link

ExeVirus commented Jun 9, 2024

Update

Microsoft quietly released BitBLAS: github.com/microsoft/BitBLAS
This is a release of their specialized kernel creator for quantized bit inference.

What's more important, is they used this BitBLAS with the open-source-reproduced 3B ternary bitnet model.

They released all that code for testing and inference here: github.com/microsoft/BitBLAS/tree/main/integration/BitNet.

From my quick reading - that's more than enough information to write the kernels for inference and probably make GGUFs for the 3B/1.3B/700M models.

[Everything is MIT licensed]

@netrunnereve
Copy link
Collaborator

What's more important, is they used this BitBLAS with the open-source-reproduced 3B ternary bitnet model.

If their own researchers are using the 1bitLLM reproduced model then from the looks of it Microsoft is never going to release the original BitNet paper models 😞.

From my quick reading - that's more than enough information to write the kernels for inference and probably make GGUFs for the 3B/1.3B/700M models.

The issue with that is that those are toy models solely for testing BitNet and they aren't really suitable for actual chatting. What will get a dev onboard (maybe even me if I have the time!) is for some company to release a fully trained BitNet model of Llama quality for us to play with. Otherwise the implementation will remain as an experiment and it'll probably have few users and little support.

Now of course this is also a chicken and egg problem since if we can demonstrate how effective BitNet inference is with llama.cpp then some company may be incentivized to train some proper models.

@ExeVirus
Copy link

ExeVirus commented Jun 10, 2024

Exactly. I'm actually having one of the researchers in the RWVK community run some tests where they replace their weights with teranry weights.

Seeing some initial success with that test, albeit it trains more slowly than they are used to for their architecture.

If they do start that work, I'll be sure to report.

Update on RWVK: they weren't satisfied with such a low learning rate compared to their normal training speeds (too expensive too give a full run). If someone has the experience with hyper parameter tunning and the hardware to test, reach out to me and I can get you set up with what they would need to prove to themselves that ternary is worth it.

@netrunnereve
Copy link
Collaborator

Haven't looked into it in detail yet but someone just submitted a BitNet PR! The new format uses 2 bits per weight.

#7931

@mofosyne mofosyne added the Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes label Jun 15, 2024
@netrunnereve
Copy link
Collaborator

Another paper has been released that builds on BitNet with ternary weights. What's interesting here is that they made a FPGA implementation designed for ternary math.

image

https://arxiv.org/pdf/2406.02528

@RonanKMcGovern
Copy link

RonanKMcGovern commented Jun 29, 2024 via email

@ExeVirus
Copy link

ExeVirus commented Jun 29, 2024 via email

@RonanKMcGovern
Copy link

RonanKMcGovern commented Jun 29, 2024 via email

@ExeVirus
Copy link

ExeVirus commented Jun 29, 2024 via email

@netrunnereve
Copy link
Collaborator

However, FPGAs
will just have far lower operations per second versus GPUs right? (at least
for current FPGAs and GPUs OR, said differently, the cost per FLOP of GPUs
is just very low and you can't make up for that just by being more
efficient on more primitive operations with FPGAs, or probably
ASICs...unless mass manufactured)

I think of the FPGA implementation as more of a way to show companies that, hey, it's possible to run BitNet very efficiently on custom hardware and if it all pans out then it might be worth having special ternary units inside future CUDA cores or maybe even have a special chip just for BitNet. I don't think we'll see people installing FPGA cards in their computers for running LLMs.

@paperdev-code
Copy link

paperdev-code commented Jun 29, 2024

maybe even have a special chip just for BitNet.

That would be nice, however bitnet is being held back by the undertrained models currently available. If the original authors got to release their 7B, 13B, and up, it would go a long way in convincing people this model architecture really does scale as promised.

@nonetrix
Copy link

I feel like gap would only get smaller compared to full fat FP16 as it scales, but we will have to see in reality. I wonder who will be first to train a bigger one, I think someone could do better than we got right now by a bit but we need a company willing to take a risk for even bigger ones. Maybe Jamba people would be interested, they took a risk with trying Mamba

@github-actions github-actions bot added the stale label Jul 30, 2024
@Green-Sky
Copy link
Collaborator

The development was continued in #8151

@Green-Sky Green-Sky removed the stale label Aug 4, 2024
@github-actions github-actions bot added the stale label Sep 4, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
None yet
Development

No branches or pull requests