-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: support for nvidia/Llama-3.1-Minitron-4B-Width-Base #9060
Comments
|
Here is also an element that could possibly fit under the "Motivation" category for this feature request: The community seems to be quite interested in this new model 🤗 |
Also experiencing this. Will post here if i get a working solution |
This seems to be an issue with the llama3.1 rope scaling and custom head_dim being specified together, you can make a working quant removing the |
Sucks.. 4B looks like a good candidate to FFT on |
Llama-3.1-Minitron-4B-Width-Base looks amazing. It's really a pity that llama.cpp does not support it. |
this is the fix, it works. |
Ok but doesn't this fix limit the context from 32k to 8k? |
Prerequisites
Feature Description
Please support https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base
When I try to run F16 with llama-cli or produce imatrix usig llama-imatrix, i get the following crash:
Motivation
This 4B model is pruned and distilled form Llama 3.1 8B. It would be a great alternative for gemma 2b.
https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: