-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes for the existing quant strategies / FTYPEs and new ones #8836
base: master
Are you sure you want to change the base?
Conversation
Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models: - The tensor attn.v.weight passed in Q4_K for models like Gemma (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models. - The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts. - The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes. More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under an different kind of tree mixing these 5 quant strategies. I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be standard.
In coherence with the proposed modifications to the IQ2 quant strategies, which make even more sense for the IQ3 quant strategies.
And application of the attn.v.weight logic I used for IQ2 and IQ3, but only when such logic is already implied by the existing quant strategies, as a compromise to not disturb too much Ikawrakow's quant strategies.
If FTYPE IQ4_XS has attn.output.tensor in IQ4_XS (4.5BPW), there's no reason to have FTYPE IQ3_M to have attn.output.tensor in Q4_K (4.5BPW). In terms of perplexity, on a Llama 3.1 70b model, the proposed change reduces the size by 1%, and increases the preplexity by 0.25%.
If FTYPE IQ4_XS has attn_qkv.weight in IQ4_XS, then FTYPE IQ3_M should not have it in Q4_K (4.5BPW), but in IQ4_XS (4.25BPW) also.
It would be useful to have some objective data (such as perplexity tests) to evaluate the effect of these changes. |
Relevant examples in head post. Considering the results obtained, I think that it's worth it, the size remains around 260MB below IQ3_XS. The overall high BPW of Gemma is justified mostly by the monolithic embd/output tensor in Q5_K, as the output.weight are usually on FTYPE IQ3_XXS. The non GQA/non MOE models are not affected. For FType IQ3_M and its attn_output.weight and attn.qkv.weight in Q4_K, Ikawrakow setting was made prior to IQ4_XS quants and never edited since, considering that FType IQ4_XS has an attn.output.weight in.. IQ4_XS without any problem signaled about it. A PPL test is imo not even necessary there. For FTYpe IQ2_M, jumping from attn_output IQ2_XS (FTYPE IQ2_XS) to attn_ouput IQ3_S (FTYPE IQ2_S, which is mainly made of IQ2_XS tensors as well), is literally overkill, and Ikawrakow likely didn't pay attention to it and just threw in the value : IQ2_S was simply jumped over, as well as IQ3_XXS for that tensor. Edit : I will actually do the tests, and share the quants on Huggingface. Edit 2 : tests made on Gemma v2 9 it. I think it's conclusive. |
attn.v in Q5_K attn.k in IQ4_XS
Attn_q in Q3_K for experts >= 8 Attn_k in Q5_K for experts >= 8 Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS Attn_output in Q4_K for experts >= 8
With attn_k set for all quants bellow 3bpw except Q2_K_S.
And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S
- Progressivity for token embeddings and attn_qkv - FFN down for IQ1 and IQ2 quants - FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.
Plus some adjustments on the FFNs
Merge b3565
and fix parenthesis mistake on IQ3_S
this to clarify what it does, especially with the 5 additional levels of difquant
And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS Reformat attn_ouput mess and split GQA4/GQA2
IQ2_XS doesn't seem to work as such, back to IQ2_S
And complete FFN up Shrink a bit more non GQA models
for more granularity in low quants.
Of which the size is more sensitive to the non repeating tensors
Q3_K_XL takes the place of Q3_K_L. Q3_K_L becomes intermediary between Q3_K_M and XL.
Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models (edit: it's turning into an overhaul of the quant strategies ^^):
The tensor attn.v.weight passed in Q4_K for models like Gemma v2 (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.
The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.
The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.
More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under a different kind of tree mixing these 5 quant strategies.
I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be used as default.
Partial "changelog" :
Edit : I applied the attn.v.weight modifications to the IQ3 quant strategies as well.
Edit 2 : I looked furthermore at the attn.v.weight "tree" and made changes in coherence with what I did for IQ2 and IQ3, but only when Ikawrakow's code was clearly going in such direction already.
Edit 3 : I harmonized all the n_expert == 8 into n_expert >= 8 within the quant strategies.
Edit 4 and 5 : attn_output.weight and attn_qkv.weight changed from Q4_K (4.5bpw) to IQ4_XS (4.25bpw) for FTYPE IQ3_M, considering that FTYPE IQ4_XS is a more qualitative quant and has both tensors in IQ4_XS.
Edit 6/7 : attn.k.weight is relatively small on GQA & MOE models, and penalizing it on IQ3_XXS and IQ3_XS quant strategies isn't pertinent imo on such models. I simply removed the penalty in such cases.
Edit 8 : bolster a bit IQ3_M with a bump on its attn_v and attn_k tensors.
Edit 9 : get rid of the IQ1/IQ2 quant strategy tree, and replace these quants in the usual tensors tree, and increase their attn_ tensors when they are used to quantize MOEs.
Edit 10 : Shorten a bit the formatting to remove a few lines.
Edit 11 : Refactor partly the attn_k tensors tree and add progressivity in the quants.
Edit 12 : Lower the threshold of 8 to 4 for the big MOE-specific quant parameters.
Edit 13 : Rework a bit the attn.v tensors tree for more progressivity.
Edit 14 : Some revamp done on token embeddings, attn_qkv, and the ffns.
Edit 15 and 16 : New quants : Q2_K_L, IQ2_XL, IQ3_XL
Edit 17 : Merge master b3565
Edit 18 and 19 : New Quant : IQ1_XS
Edit 20 and 21 : Some adjustments and reversals.
Edit 21 : New IQ1_XL quant strategy, and some corrections
Edit 22 : Merge master b3569
Examples:
Current Gemma 2 9b It testing quants are here : https://huggingface.co/Nexesenex/google_gemma-2-9b-it_iMat.GGUF/tree/main
Current Llama 3.1 8b It testing quants are here : https://huggingface.co/Nexesenex/tomMeta_Llama-3.1-8b-it_iMat_Custom_Quant_Stategies-GGUF/tree/main
Results for Gemma 2 9b It :