-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for IBM Granite #7116
Comments
The PR to add granite support for transformers (add MLP bias - gate, up, down) can be found here: https://github.com/huggingface/transformers/pull/30031/files |
Based on the discussion in transformers mlp_bias PR, It's similar to Llama with just the |
I tried to do this here: https://github.com/sroecker/llama.cpp/tree/add_mlp_bias
|
@sroecker are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models |
Ah, not yet. Thanks! I guess then we need to define an additional ARCH (or
save the mlp_bias boolean in the GGUF) and implement it like with MPT
https://github.com/ggerganov/llama.cpp/blob/7e0b6a7b3ba94ff624dc27c1e0e735fded8819b8/llama.cpp#L5287
Mayank Mishra ***@***.***> schrieb am Mi., 8. Mai 2024, 10:59:
… @sroecker <https://github.com/sroecker> are you tying the word
embeddings? unlike llama, the input word embeddings and output projection
matrix are tied for granite models
—
Reply to this email directly, view it on GitHub
<#7116 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACYR3PWJGC2D4PTSLE7D6DZBHSNZAVCNFSM6AAAAABHKJDCHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGA4TQMZSG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages. |
You could hack something with gguf writer https://pypi.org/project/gguf/ |
So I've adapted build_llama to include the MLP biases as well. I've added a few FIXMEs to my branch to indicate places that might need to be adapted for the different Granite models. |
I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do |
+1 to this |
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
Is there any progress for the support of Granite models? |
AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since. |
Yes, unfortunately. The lab version of granite works well with llama.cpp: https://huggingface.co/instructlab/granite-7b-lab-GGUF It doesn't have the MLP bias nodes and uses a different tokenizer though. |
the lab version is a different model |
I'm aware of that, it did work out of the box with LLM_ARCH_LLAMA settings though so I'm trying to find out why exactly. But you're right to point this out, a few people mixed these up. I will check the |
Hmm, a quick question: are we tying the word embeddings and output logits matrix? |
If no output layer is found the word embeddings are used instead: Lines 4926 to 4932 in 5416002
|
Hmm, ok so there are these differences between llama and granite:
|
Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses ggerganov/issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
Do all Granite code models use the starcoder tokenizer? Based on your HF repo comment I tried to get 20 and 34b to run. They are recognized as Starcoder arch by the convert-hf-to-gguf script and all I had to modify was to tie the embedding weights. 20b instruct works quite well, even with the bos token. The Q3_K_L quant comes down to 11GB. For the 3 and 8b models 1) and 4) remain. We have to check if the attention bias is set up correctly in |
yeah all are using starcoder tokenizer. |
Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses /issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
…nov#7324) Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses ggerganov/issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
If help for 8b-instruct model.
got this err when start ./llama.cpp/main -m ./granite-8b-ins/granite-8b-instruct.bin
The same numbers shows when using convert-hf-to-gguf.py. During the conversion it shows 578 Last two lines Tried with q8_0, f16 and f32 same err. Thank you for this great work! |
I don't think that 3b and 8b are working yet @DigitLib 20b-base GGUF is available now: https://huggingface.co/ibm-granite/granite-20b-code-base-GGUF |
@mayank31398 I know just wanted to help with 8b-instruct. Thank you! |
@DigitLib you need sroecker@36dc5bb to load the smaller models |
if that commit is working, can we open a PR @sroecker ? |
that doesn't seem enough. The model is loaded but it doesn't produce any good result #7116 (comment) |
https://huggingface.co/coder543/granite-20b-code-instruct-GGUF/tree/main I've uploaded the q8_0, q6_K, and q4_0 gguf files for the 20B Instruct model here. I've only lightly tested them, and this is my first time quantizing any LLMs, but it seemed like they were working okay? If anyone wants to test them, I'm curious if they work for you. The chat template seems to be something like this:
|
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
I've managed to get some output that makes some sense with the 3b model, I've opened a PR: IMHO it makes sense to define a new architecture for granite, as there are substantial differences with the base llama model. To convert the hf model using the code in my PR, I modified the config.json file in the granite model and used: "architectures": [
"GraniteForCausalLM"
], @mayank31398 what do you think? |
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
To reproduce locally you can run the following:
Inference output should be something like the following (ignoring logging output for brevity): print("Hello World") |
@giuseppe I think the problem is that it won't work out of the box when converting the model. |
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
I've tried to extend the new arch to the bigger models, but it doesn't make sense as To avoid confusion, I've renamed the new architecture to |
could we fix the name in the model files or would it cause other issues? |
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
Is there no other solution? maybe by checking if mlp_bias is true or not? |
that looks like a generic setting that other models could use in future, but I have no weight in this decision. I will implement as it fits better for the granite model and llama.cpp maintainers. Could a flag to the conversion script that forces the arch be good enough? @ggerganov do you have any suggestions? |
I've added the following patch to the PR: diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 2d05de42..eb7d061a 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -2571,6 +2571,10 @@ def parse_args() -> argparse.Namespace:
"--no-lazy", action="store_true",
help="use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)",
)
+ parser.add_argument(
+ "--architecture", type=str, default=None,
+ help="force the architecture to use",
+ )
parser.add_argument(
"--model-name", type=str, default=None,
help="name of the model",
@@ -2626,7 +2630,7 @@ def main() -> None:
hparams = Model.load_hparams(dir_model)
with torch.inference_mode():
- model_class = Model.from_model_architecture(hparams["architectures"][0])
+ model_class = Model.from_model_architecture(args.architecture if args.architecture is not None else hparams["architectures"][0])
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy)
logger.info("Set model parameters") so we can just add Is this an acceptable solution? |
I think its better @ggerganov gives this a review. |
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
* Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses /issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <[email protected]> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <[email protected]> --------- Signed-off-by: Giuseppe Scrivano <[email protected]> Co-authored-by: Steffen Roecker <[email protected]>
== Relevant log messages from source repo: commit b864b50ce5e2beefc8c2fd31733e4e1a978b7754 Author: Meng, Hengyu <[email protected]> Date: Wed May 29 07:00:24 2024 +0800 [SYCL] Align GEMM dispatch (#7566) * align GEMM dispatch commit 02c1ecad07f0e2d2febe8196271bcc64bdc9c006 Author: jaime-m-p <[email protected]> Date: Tue May 28 21:46:34 2024 +0200 Tokenizer WPM fixes (#7500) * Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found. commit 6bd12ce409f949012935b7d1b15a21ffa473a565 Author: Georgi Gerganov <[email protected]> Date: Tue May 28 22:22:50 2024 +0300 sycl : fix assert (#7563) commit 5442939fcc5e6ae41abf40612a95fd71377e487e Author: Giuseppe Scrivano <[email protected]> Date: Tue May 28 20:49:49 2024 +0200 llama : support small Granite models (#7481) * Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/llama.cpp/issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <[email protected]> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <[email protected]> --------- Signed-off-by: Giuseppe Scrivano <[email protected]> Co-authored-by: Steffen Roecker <[email protected]> commit 56411a950f255b523a9edd684fd1632752474399 Author: k.h.lai <[email protected]> Date: Wed May 29 01:25:08 2024 +0800 vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (#7552) commit 2b737caae100cf0ac963206984332e422058f2b9 Author: Radoslav Gerganov <[email protected]> Date: Tue May 28 18:13:36 2024 +0300 rpc : resource management rework (#7562) * rpc : resource management rework * address review comments
So, how to convert https://huggingface.co/ibm-granite/granite-3.0-8b-instruct to GGUF now? |
@0wwafa - it simply converted with the tool for me.
Unfortunately I couldn't actually use the container for inference because it was seg-faulting on this platform (AmpereOne), but the resulting gguf file worked fine with a llama.cpp built from source.
|
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
IBM recently released their Granite models. A series of 3b -> 34b coding models with base and instruct finetunes.
https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330
https://github.com/ibm-granite
Many thanks to the llama.cpp community for their awesome work! It would be awesome to see this feature added. GGUF's can be made already, but when you try to load them you get a tokenizer error.
The text was updated successfully, but these errors were encountered: