-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BitnetForCausalLM (new model / new datatype) #7931
Conversation
Am I understanding correctly that these new quant types (I2_S, I8_S) will ONLY work with bitnet models, and not across all models? The code itself doesn't imply that (but also doesn't include I8_S in QUANT_OPTIONS) so just want to clarify |
Oh god, it's finally happening. You are doing the lord's work! Super excited for this. |
First, good work! I've been trying out the build with CUDA support and encountered an error. Here are the steps I followed and the results:
However, when executing the main command, the following error was produced: llama_new_context_with_model: n_ctx = 2048 ======== Interestingly, when compiling and running with just CPU support, everything works fine. It seems like there might be an issue specifically related to CUDA integration. Any insights or help would be greatly appreciated! |
Thanks for the advice, already merge into the master and fix the whitespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx_split
should only be used for matrices.
After changing a little bit of the |
Hello! I don't know if this BitNet-b1.58 is a good reproduction of the archiecture proposed in the original research. They said in pag 5 section 3 that all RMSNorm before Attention and SwiGLU (MLP?) should be removed, but it seems that both layers are still present in the decoder block: hidden_states = self.input_layernorm(hidden_states) and hidden_states = self.post_attention_layernorm(hidden_states) Furthermore, it is not entirely clear to me whether this RMSNorm should be a parameter-free layer, otherwise this would create conflicts with the inference proposed on pag 6 section 3, since in this case there is only one identical RMSNorm for all. I don't know if this is entirely true, I left an issue in the HuggingFace thread to get the doubt solved. Anyone who can help clarify this? Thanks for the great work you are doing to integrate new technologies into useful libraries like llama.cpp 😄 |
This comment was marked as resolved.
This comment was marked as resolved.
Does it matter? Does implementing this flavour limit a future implementation of the originally proposed version? "if it works, it ain't stupid" |
|
||
// input for next layer | ||
inpL = cur; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should extend llm_build_ffn()
to support _scale
tensors and reuse it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried that if I change the llm_build_ffn
api, for example adding ffn_gate_scale
/ ffn_up_scale
/ ffn_down_scale
to the function parameters, than I have to change the code for all models which use llm_build_ffn
, seems not the things I should do in this PR. If supporting _scales
tensor is neccessary, I can contribute a new PR and make this change suits all model after this PR merged.
llama.cpp
Outdated
cb(Kcur, "Kcur", il); | ||
|
||
cur = llm_build_kv(ctx0, model, hparams, cparams, kv_self, gf, | ||
nullptr, model.layers[il].bo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing model.layers[il].bo
here seems incorrect. I think it should be added below after the projection block:
cur = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
cur = ggml_mul(ctx0, cur, model.layers[il].wo_scale);
cb(cur, "attn_o_out", il);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the code about model.layers[il].bo
may cause some misunderstanding of BitNet, even though at this time model.layers[il].bo
is nullptr. Already change it in lateset commit, please review. @ggerganov
This is probably not relevant, but I just excitedly tried to compile Lllama.cpp from this PR through Wllama (Emscriptem/Wasm), to see if I could run "Bitnet in the Browser". I tried the ggml-model-q2_k-pad.gguf BitNet model that GGerganov provided. Unfortunately I got this error:
I also tried I then tried to compile the current Llama.cpp version from a few minutes ago, without the Wllama wrapper. But then I saw an error too, the same one on both models: Mac OS, M1 pro |
Truly q2_2 looks absolutely insane. Splendid work! I do wonder if that also makes fine tuning models on lower end hardware possible, as the quantized models were previously not of high enough quality to fine tune on and fp16 had to be used instead. I think this is a great opportunity to change the need for fp16 models when it comes to training models. |
* hf bitnet v1 * hf bitnet e2e v2 * finish bitnet e2e * finish f16 hf bitnet e2e * remove unsed * finish bitnet i2 e2e * move i2s to quantize v1 * move i2 to quantize * clean code * clean code 2 * fix codestyle * fix code * fix * fix code * fix merge * remove unused * change table name * fix whitespace * delete redundant * i2_s to absmax * finish i2_s/i8_s vec_dot x86 simd * i2s->q22 * fix code * remove block scale * add dequantize * fix seq * update avx2 * remove q2_2 * remove q22_grid * fix whitespace * reuse llm_build_kv * fix bo --------- Co-authored-by: root <root@wangjinheng>
* hf bitnet v1 * hf bitnet e2e v2 * finish bitnet e2e * finish f16 hf bitnet e2e * remove unsed * finish bitnet i2 e2e * move i2s to quantize v1 * move i2 to quantize * clean code * clean code 2 * fix codestyle * fix code * fix * fix code * fix merge * remove unused * change table name * fix whitespace * delete redundant * i2_s to absmax * finish i2_s/i8_s vec_dot x86 simd * i2s->q22 * fix code * remove block scale * add dequantize * fix seq * update avx2 * remove q2_2 * remove q22_grid * fix whitespace * reuse llm_build_kv * fix bo --------- Co-authored-by: root <root@wangjinheng>
* hf bitnet v1 * hf bitnet e2e v2 * finish bitnet e2e * finish f16 hf bitnet e2e * remove unsed * finish bitnet i2 e2e * move i2s to quantize v1 * move i2 to quantize * clean code * clean code 2 * fix codestyle * fix code * fix * fix code * fix merge * remove unused * change table name * fix whitespace * delete redundant * i2_s to absmax * finish i2_s/i8_s vec_dot x86 simd * i2s->q22 * fix code * remove block scale * add dequantize * fix seq * update avx2 * remove q2_2 * remove q22_grid * fix whitespace * reuse llm_build_kv * fix bo --------- Co-authored-by: root <root@wangjinheng>
PR Intro
This PR is to support BitnetForCausalLM for llama.cpp, which includes several points:
add new datatype I2_S / I8_S (deprecated)add new datatype Q2_2 (deprecated)add Q2_2 quantization / matmul kernel (deprecated)Note
This PR only contains BitNet model support.
Q2_2 / I2_S and I8_S are deprecated now, you can still try it by checkout to commit
Also many thanks to @compilade for a new 1.625bpw datatype Q1_3, can be found in compilade/bitnet-ternary
How to use Q2_2?
Q2_2 Results
(test by llama.cpp on wikitext-2)
(test by llama.cpp llama-bench on 12th Gen Intel(R) Core(TM) i5-12500H)
Why add I2_S and I8_S?
Bitnet does not use per-channel but per-tensor quantization both for activation (int8) and weight (1, 0, -1). This means each activation or weight for special matmul operations (attn_q / attn_k / attn_v / attn_o / ffn_up / ffn_gate / ffn_down) only has one scale. However, it seems that quantization types in llama.cpp all use block as basic unit, which is suitable for per-channel quantization, but not working with per-tensor quantization. In this case, I designed two new datatypes for 2bit and 8bit per-tensor quantization respectively, called I2_s and I8_s, can solve the problem.
How to use I2_S and I8_S?
I2_S I8_S Results
(test by llama.cpp on wikitext-2)
(test by llama.cpp llama-bench on 13th Gen Intel(R) Core(TM) i5-13400F)
I2_S has lower ppl with model size more than twice as small as q4_0 and iq4_nl, also has a inference speed improvements than q4_0 and iq4_nl.
Questions
Will llama.cpp support non-block quantization datatype? @ggerganov I tried my best but new datatype can't merge into llama.cpp without special judgment (src0->type == GGML_TYPE_I2_S). It would be so great if llama.cpp could support it.
TODO