-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Nemotron/Minitron GGUF Conversion & Inference Support #8922
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: As of Transformers 4.44.0 Nemotron is supported, so no need to install transformers from source.
Awesome! Thank you for sharing @Vaibhavs10! I've updated the original PR description. |
Thank you @compilade for the comments and suggestions! Committed changes accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @suhara - can you rebase to main, specifically make sure this commit is in - this should fix the failing requirements.txt
test.
Co-authored-by: compilade <[email protected]>
Co-authored-by: compilade <[email protected]>
4bb8d50
to
bd76198
Compare
Hi @Vaibhavs10 , thanks for reviewing! I rebased it onto the latest main branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what am I missing here but I wasn't able to make the GGUF run tried to test the PR via this:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && gh pr checkout 8922
huggingface-cli download nvidia/Minitron-4B-Base --local-dir minitron --local-dir-use-symlinks False
python convert_hf_to_gguf.py minitron --outtype f16 --outfile model.gguf
llama-cli -m model.gguf -p "Meaning to life is"
I get error loading model architecture: unknown model architecture: 'nemotron'
EDIT: I'm stupid, I was using an older binary!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it (using the steps mentioned above), it works quite well!
Let's wait for @compilade to review + approve then we can merge! 🤗
src/llama.cpp
Outdated
// optional MLP bias | ||
layer.ffn_down_b = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "bias", i), {n_embd}, llama_model_loader::TENSOR_NOT_REQUIRED); | ||
layer.ffn_up_b = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP, "bias", i), {n_ff}, llama_model_loader::TENSOR_NOT_REQUIRED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not correct to use ctx_split
for bias tensors, it should use ctx_layer
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment @slaren !
Sorry for the naive question. What's the difference between ctx_split
and ctx_layer
?
Something not clear is that some part of llama.cpp
uses ctx_split
for bias tensors as well.
For example,
- https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6108-L6110
- https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6343
Should they be corrected (which is out of the scope of this PR but wanted to ask to have a better understanding of them)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx_split
only makes a difference when using tensor parallelism with -sm row
, which is only supported on the CUDA backend when using multiple GPUs. When using -sm row
, ctx_split
splits the rows of the matrix between the available GPUs. This is only supported for matrix multiplication, so it should only be used with the matrix portion of linear/dense layers. The other cases are also wrong and should be corrected as well, but it doesn't need to be done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Updated the two lines accordingly. Agree with you that the other parts should be fixed outside this PR.
Hi @compilade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Thank you all for your reviews and support @compilade @Vaibhavs10 @ggerganov @slaren ! Could anybody help merge this PR? Thank you! |
Sorry for disturbing, but when I try to convert the linked minitron-4b model with transformers 4.44.0 and current llama.cpp, it simply complains about missing tokenizer.model. Any idea why that could be? |
Hi @schmorp I think the repo has been updated and You can actually extract
There are two tokenizer files but they are the same and either can be renamed as
|
@suhara thanks a lot! |
Minitron-8B converts, but then can't be used: llm_load_tensors: ggml ctx size = 0.15 MiB |
Minitron-4B seems to work. So it seems Minitron-8B is not quite supported yet. |
I'll look into this but I think I know the root cause. 8B uses Many HF models including Llama asserts
FYI, for 4B, |
That's good news, thanks for looking into this. I'll have a try at the 340B. |
For the 340B, conversion instantly fails flat because there isn't a config.json file. |
I tried The only option seems to be using the SafeTensor conversion provided by @mgoin under https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5. He unfortunately never shared how he converted nemo into safetensor. |
@nicoboss if the conversion steps and script would be useful, I can document this tomorrow! |
This would be absolutely awesome. Thanks a lot! I’m very interested in how the conversion works. Maybe it would even be possible to implement it inside convert_hf_to_gguf.py. I'm currently working together with @schmorp to GGUF quantize all Nemotron-3, Nemotron-4 and "Minitron" models. While your collection is great it unfortunately misses many Nemotron-3 models which we could convert by our own if you share your tools and knowledge. Nemotron-4-340B-Instruct is one of my favorite models and I can't thank you enough to convert it into a usable format. |
And just to document this here, Llama-3.1-Minitron-4B-Width-Base fails with: cvs/llama.cpp/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed |
…8922) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <[email protected]> * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <[email protected]>
…8922) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <[email protected]> * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <[email protected]>
This PR adds HF->GGUF conversion & inference support for Nemotron models including Nemotron-3, Nemotron-4 and "Minitron" models.
The PR should support any Nemotron/Minitron models but has been primarily tested with the following Minitron model
HF support for Nemotron has been recently added and as of Transformers 4.44.0 Nemotron is supported (Thank you @Vaibhavs10 for the information!). You may need to install a newer version of the transformers library by running
pip install transformers>=4.44.0
.Please see this PR for details.
The Nemotron architecture is similar to the Llama-2 architecture with a few key differences:
You can find details about the model architecture in the following papers:
This PR was created in collaboration with @SpaceCowboy850, who is another contributor to this PR.