Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for the existing quant strategies / FTYPEs and new ones #8836

Draft
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

Nexesenex
Copy link
Contributor

@Nexesenex Nexesenex commented Aug 2, 2024

Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models (edit: it's turning into an overhaul of the quant strategies ^^):

  • The tensor attn.v.weight passed in Q4_K for models like Gemma v2 (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.

  • The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.

  • The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.

More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under a different kind of tree mixing these 5 quant strategies.

I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be used as default.

Partial "changelog" :

Edit : I applied the attn.v.weight modifications to the IQ3 quant strategies as well.
Edit 2 : I looked furthermore at the attn.v.weight "tree" and made changes in coherence with what I did for IQ2 and IQ3, but only when Ikawrakow's code was clearly going in such direction already.
Edit 3 : I harmonized all the n_expert == 8 into n_expert >= 8 within the quant strategies.
Edit 4 and 5 : attn_output.weight and attn_qkv.weight changed from Q4_K (4.5bpw) to IQ4_XS (4.25bpw) for FTYPE IQ3_M, considering that FTYPE IQ4_XS is a more qualitative quant and has both tensors in IQ4_XS.

Edit 6/7 : attn.k.weight is relatively small on GQA & MOE models, and penalizing it on IQ3_XXS and IQ3_XS quant strategies isn't pertinent imo on such models. I simply removed the penalty in such cases.
Edit 8 : bolster a bit IQ3_M with a bump on its attn_v and attn_k tensors.
Edit 9 : get rid of the IQ1/IQ2 quant strategy tree, and replace these quants in the usual tensors tree, and increase their attn_ tensors when they are used to quantize MOEs.
Edit 10 : Shorten a bit the formatting to remove a few lines.

Edit 11 : Refactor partly the attn_k tensors tree and add progressivity in the quants.
Edit 12 : Lower the threshold of 8 to 4 for the big MOE-specific quant parameters.
Edit 13 : Rework a bit the attn.v tensors tree for more progressivity.
Edit 14 : Some revamp done on token embeddings, attn_qkv, and the ffns.
Edit 15 and 16 : New quants : Q2_K_L, IQ2_XL, IQ3_XL

Edit 17 : Merge master b3565
Edit 18 and 19 : New Quant : IQ1_XS
Edit 20 and 21 : Some adjustments and reversals.
Edit 21 : New IQ1_XL quant strategy, and some corrections
Edit 22 : Merge master b3569

Examples:

Current Gemma 2 9b It testing quants are here : https://huggingface.co/Nexesenex/google_gemma-2-9b-it_iMat.GGUF/tree/main

Current Llama 3.1 8b It testing quants are here : https://huggingface.co/Nexesenex/tomMeta_Llama-3.1-8b-it_iMat_Custom_Quant_Stategies-GGUF/tree/main

Results for Gemma 2 9b It :

IQ1_XS

PR Init : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.15 GiB (2.00 BPW)
Arc-C 299     42.80936455   
Arc-E 570     68.24561404  
PPL 512 wikitext : 15.1105 +/- 0.11363

PR current : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.16 GiB (2.01 BPW)
PPL 512 wikitext : 14.9768 +/- 0.11234

IQ1_S

MASTER : Gemma 2 9b It IQ1_S, quant made from BF16
Size : 2.21 GiB (2.05 BPW)
Arc-C 299     42.47491639
Arc-E 570     66.84210526
PPL 512 wikitext : 15.9317 +/- 0.11979

PR Init : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.23 GiB (2.07 BPW)
Arc-C 299     43.14381271
Arc-E 570     68.42105263
PPL 512 wikitext : 14.1578 +/- 0.10530

PR current : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.24 GiB (2.08 BPW)
PPL 512 wikitext : 14.0207 +/- 0.10399

IQ1_M

MASTER : Gemma 2 9b It IQ1_M, quant made from BF16
Size : 2.37 GiB (2.20 BPW)
Arc-C 299     45.81939799  
Arc-E 570     73.85964912
PPL 512 wikitext : 13.7215 +/- 0.10231

PR Current : Gemma 2 9b It IQ1_M quant made from BF16
Size : 2.36 GiB (2.19 BPW)
Arc-C 299     45.81939799
Arc-E 570     74.56140351
PPL 512 wikitext : 12.6773 +/- 0.09336

IQ1_XL

PR Init : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.48 GiB (2.30 BPW)
Arc-C 299     47.49163880  
Arc-E 570     73.33333333
PPL 512 wikitext : 11.5001 +/- 0.08487

IQ1_XL
PR Current : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.47 GiB (2.29 BPW)
PPL 512 wikitext : 11.4824 +/- 0.08451

IQ2_XXS

MASTER : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.63 GiB (2.44 BPW)
Arc-C 299     48.16053512   
Arc-E 570     73.15789474   
PPL 512 wikitext : 11.2527 +/- 0.08307

PR Init : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.73 GiB (2.54 BPW)
Arc-C 299     48.82943144
Arc-E 570     74.56140351
PPL 512 wikitext : 10.8439 +/- 0.08026

PR 2 : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.72 GiB (2.53 BPW)
PPL 512 wikitext : 10.8173 +/- 0.07986

PR Current : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.62 GiB (2.43 BPW)
PPL 512 wikitext : 10.8388 +/- 0.08010

IQ2_XS

MASTER : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.85 GiB (2.65 BPW)
Arc-C 299     49.49832776
Arc-E 570     78.24561404  
PPL 512 wikitext : 10.5698 +/- 0.07803

PR Init : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.91 GiB (2.70 BPW)
Arc-C 299     49.16387960
Arc-E 570     78.59649123
PPL 512 wikitext : 10.3607 +/- 0.07660

PR Current : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.77 GiB (2.58 BPW)
PPL 512 wikitext : 10.3922 +/- 0.07672

IQ2_S

MASTER : Gemma 2 9b It IQ2_S (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 2.99 GiB (2.77 BPW)
Arc-C 299     52.84280936
Arc-E 570     77.54385965
PPL 512 wikitext : 10.3868 +/- 0.07787

PR Int : Gemma 2 9b It IQ2_S (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     49.83277592
Arc-E 570     77.71929825
PPL 512 wikitext : 10.1303 +/- 0.07486

PR Current : Gemma 2 9b It IQ2_S, quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     52.17391304
Arc-E 570     77.89473684
PPL 512 wikitext : 10.1071 +/- 0.07450 

IQ2_M

MASTER : Gemma 2 9b It IQ2_M (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 3.19 GiB (2.97 BPW)
Arc-C 299     56.52173913
Arc-E 570     77.01754386
PPL 512 wikitext : 9.8154 +/- 0.07324

PR init : Gemma 2 9b It IQ2_M (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.20 GiB (2.98 BPW)
Arc-C 299     54.18060201
Arc-E 570     78.07017544
PPL 512 wikitext :  9.5734 +/- 0.07040

PR CURRENT : Gemma 2 9b It IQ2_M, quant made from BF16
Size : 3.29 GiB (3.06 BPW)
Arc-C 299     55.85284281
Arc-E 570     78.07017544
PPL 512 wikitext : 9.4128 +/- 0.06881

IQ2_XL

PR CURRENT : Gemma 2 9b It IQ2_XL, quant made from BF16
Size : 3.41 GiB (3.17 BPW)
Arc-C 299     56.18729097
Arc-E 570     78.07017544
PPL 512 wikitext : 9.3283 +/- 0.06820

Q2_K_L

PR CURRENT : Gemma 2 9b It Q2_K_L, quant made from BF16
Size : 3.70 GiB (3.44 BPW)
Arc-C 299     58.19397993
Arc-E 570     79.29824561
PPL 512 wikitext : around 9.25

IQ3_XXS

MASTER : Gemma 2 9b It IQ3_XXS (with iMatrix, attn_k in IQ2_S, and attn_v in IQ3_XXS), quant made from BF16
Size : 3.53 GiB (3.28 BPW)
Arc-C 299 56.52173913
Arc-E 570 79.12280702
PPL 512 wikitext : 9.4116 +/- 0.06982

PR CURRENT : Gemma 2 9b It IQ3_XXS (with Imatrix, attn_k in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.60 GiB (3.35 BPW)
Arc-C 299 56.18729097
Arc-E 570 78.77192982
PPL 512 wikitext : 9.2026 +/- 0.06781

IQ3_XS

MASTER : Gemma 2 9b It IQ3_XS (with iMatrix)), quant made from BF16
Size : 3.85 GiB (3.58 BPW)
Arc-C 299     58.86287625
Arc-E 570     78.94736842
PPL 512 wikitext : 9.2584 +/- 0.06866

PR CURRENT : Gemma 2 9b It IQ3_XS (with Imatrix), quant made from BF16
Size : 3.82 GiB (3.55 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext :  9.0658 +/- 0.06633

IQ3_S

MASTER : Gemma 2 9b It IQ3_S (with iMatrix, attn_v in IQ3_S), quant made from BF16
Size : 4.03 GiB (3.75 BPW)
Arc-C 299     57.52508361
Arc-E 570     77.71929825
PPL 512 wikitext : 9.2100 +/- 0.06859

PR : Gemma 2 9b It IQ3_S (with Imatrix, attn_v in Q4_K), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext : 9.0082 +/- 0.06633

PR rev 2: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     56.85618729
Arc-E 570     78.42105263
PPL 512 wikitext : 9.0082 +/- 0.06633
(I think ARC differences are due to the b3565 merge)

PR rev3 - CURRENT: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.05 GiB (3.76 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.42105263
PPL 512 wikitext : 8.9969 +/- 0.06610

IQ3_M

MASTER : Gemma 2 9b It IQ3_M (with iMatrix, attn_output in Q4_K), quant made from BF16
Size : 4.18 GiB (3.89 BPW)
Arc-C 299     56.85618729
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9697 +/- 0.06598

PR : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS), quant made from BF16
Size : 4.16 GiB (3.87 BPW)
Arc-C 299     57.19063545
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9556 +/- 0.06586

PR rev2 : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K), quant made from BF16
Size : 4.20 GiB (3.90 BPW)²
Arc-C 299     58.52842809²
Arc-E 570     77.54385965²
PPL 512 wikitext : 8.9445 +/- 0.06576²

PR rev3 - CURRENT : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K, attn.k IQ4_XS), quant made from BF16
Size : 4.23 GiB (3.93 BPW)
Arc-C 299     58.19397993
Arc-E 570     77.19298246
PPL 512 wikitext : 8.9082 +/- 0.06536

IQ3_XL

PR CURRENT : Gemma 2 9b It IQ3_XL (with Imatrix), quant made from BF16
Size : 4.50 GiB (4.18 BPW)
Arc-C 299     56.85618729 
Arc-E 570     78.42105263
PPL 512 wikitext : 8.8843 +/- 0.06558

IQ4_XS

MASTER : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.87 GiB (4.52 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.24561404
PPL 512 wikitext : 8.8456 +/- 0.06533

PR CURRENT : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.91 GiB (4.56 BPW)
PPL 512 wikitext : 8.8370 +/- 0.06525

FP16

MASTER : Gemma 2 9b It F16.
Size : 17.22 GiB (16.00 BPW)
Arc-C 299     59.53177258
Arc-E 570     78.77192982
PPL 512 wikitext : 8.7881 +/- 0.06533

Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models:

- The tensor attn.v.weight passed in Q4_K for models like Gemma (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.

- The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.

- The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.

More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under an different kind of tree mixing these 5 quant strategies.

I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be standard.
In coherence with the proposed modifications to the IQ2 quant strategies, which make even more sense for the IQ3 quant strategies.
@Nexesenex Nexesenex changed the title Small changes for IQ2 quant strategies (notably IQ2_S and IQ2_M) Small changes for IQ2/IQ3 quant strategies Aug 2, 2024
And application of the attn.v.weight logic I used for IQ2 and IQ3, but only when such logic is already implied by the existing quant strategies, as a compromise to not disturb too much Ikawrakow's quant strategies.
@Nexesenex Nexesenex changed the title Small changes for IQ2/IQ3 quant strategies Small changes for IQ2/IQ3 and attn.v.weight related quant strategies Aug 2, 2024
If FTYPE IQ4_XS has attn.output.tensor in IQ4_XS (4.5BPW), there's no reason to have FTYPE IQ3_M to have attn.output.tensor in Q4_K (4.5BPW).
In terms of perplexity, on a Llama 3.1 70b model, the proposed change reduces the size by 1%, and increases the preplexity by 0.25%.
If FTYPE IQ4_XS has attn_qkv.weight in IQ4_XS, then FTYPE IQ3_M should not have it in Q4_K (4.5BPW), but in IQ4_XS (4.25BPW) also.
@Nexesenex Nexesenex changed the title Small changes for IQ2/IQ3 and attn.v.weight related quant strategies Small changes for the quant strategies (mostly on the attn_*.weight tensors) Aug 4, 2024
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 5, 2024
@slaren
Copy link
Collaborator

slaren commented Aug 8, 2024

It would be useful to have some objective data (such as perplexity tests) to evaluate the effect of these changes.

@Nexesenex
Copy link
Contributor Author

Nexesenex commented Aug 9, 2024

@slaren:

Relevant examples in head post.

Considering the results obtained, I think that it's worth it, the size remains around 260MB below IQ3_XS. The overall high BPW of Gemma is justified mostly by the monolithic embd/output tensor in Q5_K, as the output.weight are usually on FTYPE IQ3_XXS.

The non GQA/non MOE models are not affected.

For FType IQ3_M and its attn_output.weight and attn.qkv.weight in Q4_K, Ikawrakow setting was made prior to IQ4_XS quants and never edited since, considering that FType IQ4_XS has an attn.output.weight in.. IQ4_XS without any problem signaled about it. A PPL test is imo not even necessary there.

For FTYpe IQ2_M, jumping from attn_output IQ2_XS (FTYPE IQ2_XS) to attn_ouput IQ3_S (FTYPE IQ2_S, which is mainly made of IQ2_XS tensors as well), is literally overkill, and Ikawrakow likely didn't pay attention to it and just threw in the value : IQ2_S was simply jumped over, as well as IQ3_XXS for that tensor.

Edit : I will actually do the tests, and share the quants on Huggingface.

Edit 2 : tests made on Gemma v2 9 it. I think it's conclusive.

attn.v in Q5_K
attn.k in IQ4_XS
@Nexesenex Nexesenex marked this pull request as draft August 9, 2024 23:16
Attn_q in Q3_K for experts >= 8
Attn_k in Q5_K for experts >= 8
Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS
Attn_output in Q4_K for experts >= 8
@Nexesenex Nexesenex changed the title Small changes for the quant strategies (mostly on the attn_*.weight tensors) Changes for the quant strategies (mostly on the attn_*.weight tensors) Aug 10, 2024
With attn_k set for all quants bellow 3bpw except Q2_K_S.
And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S
- Progressivity for token embeddings and attn_qkv
- FFN down for IQ1 and IQ2 quants
- FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.
Plus some adjustments on the FFNs
Merge b3565
@Nexesenex Nexesenex closed this Aug 10, 2024
@Nexesenex Nexesenex deleted the patch-1 branch August 10, 2024 18:46
@Nexesenex Nexesenex restored the patch-1 branch August 10, 2024 18:47
and fix parenthesis mistake on IQ3_S
this to clarify what it does, especially with the 5 additional levels of difquant
And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS

Reformat attn_ouput mess and split GQA4/GQA2
IQ2_XS doesn't seem to work as such, back to IQ2_S
And complete FFN up
Shrink a bit more non GQA models
for more granularity in low quants.
Of which the size is more sensitive to the non repeating tensors
Q3_K_XL takes the place of Q3_K_L.
Q3_K_L becomes intermediary between Q3_K_M and XL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants