Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(convert_hf_to_gguf): support q4_0 and q4_1 quantifications #10008

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

trufae
Copy link

@trufae trufae commented Oct 22, 2024

@github-actions github-actions bot added the python python script changes label Oct 22, 2024
@trufae
Copy link
Author

trufae commented Oct 31, 2024

ping

@wooooyeahhhh
Copy link

Wouldn't this have a negative effect on output quality compared to converting to f16 then using the quantise program? Because the output and embedding tensors would be converted to q4_0/q4_1 and I don't think that the quantize program produces a pure quant.

@compilade
Copy link
Collaborator

Related to #9022.

Basically, the q4_0 and q4_1 options of llama-quantize also use q4_k and q6_k for the token embeddings and output tensors, and those types are not yet supported by the Python re-implementation in gguf-py/gguf/quants.py, partly because it would be slow, but mostly because the k-quants rounding is not platform-independent (because of different rounding depending of whether or not FMA was used).

But for quantization types smaller than Q8_0, there's also a lot of heuristics in llama_tensor_get_type to "choose" the type of each tensor, which is more complicated than the current type selection logic of convert_hf_to_gguf.py, (which fortunately gives exactly the same selections for {F32, F16, BF16, Q8_0}, but not other types).

Ideally convert_hf_to_gguf.py should produce the exact same model files as llama-quantize (which it does for F32, F16, BF16, and Q8_0) to reduce confusion, but as explained above, it's more complicated for smaller types without changing the existing mixtures produced by llama-quantize.

Eventually, the k-quants rounding will be platform-independent and k-quantization will be implemented in gguf-py/gguf/quants.py, and then direct conversion to Q4_0, Q4_1, Q5_0, and Q5_1 could be added to convert_hf_to_gguf.py, but the type selection heuristics for smaller quants would need to be ported to the convert scripts too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants