Changes for the existing quant strategies / FTYPEs and new ones #8836

Nexesenex · 2024-08-02T18:40:55Z

Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models (edit: it's turning into an overhaul of the quant strategies ^^):

The tensor attn.v.weight passed in Q4_K for models like Gemma v2 (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.
The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.
The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.

More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under a different kind of tree mixing these 5 quant strategies.

I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be used as default.

Partial "changelog" :

Edit : I applied the attn.v.weight modifications to the IQ3 quant strategies as well.
Edit 2 : I looked furthermore at the attn.v.weight "tree" and made changes in coherence with what I did for IQ2 and IQ3, but only when Ikawrakow's code was clearly going in such direction already.
Edit 3 : I harmonized all the n_expert == 8 into n_expert >= 8 within the quant strategies.
Edit 4 and 5 : attn_output.weight and attn_qkv.weight changed from Q4_K (4.5bpw) to IQ4_XS (4.25bpw) for FTYPE IQ3_M, considering that FTYPE IQ4_XS is a more qualitative quant and has both tensors in IQ4_XS.

Edit 6/7 : attn.k.weight is relatively small on GQA & MOE models, and penalizing it on IQ3_XXS and IQ3_XS quant strategies isn't pertinent imo on such models. I simply removed the penalty in such cases.
Edit 8 : bolster a bit IQ3_M with a bump on its attn_v and attn_k tensors.
Edit 9 : get rid of the IQ1/IQ2 quant strategy tree, and replace these quants in the usual tensors tree, and increase their attn_ tensors when they are used to quantize MOEs.
Edit 10 : Shorten a bit the formatting to remove a few lines.

Edit 11 : Refactor partly the attn_k tensors tree and add progressivity in the quants.
Edit 12 : Lower the threshold of 8 to 4 for the big MOE-specific quant parameters.
Edit 13 : Rework a bit the attn.v tensors tree for more progressivity.
Edit 14 : Some revamp done on token embeddings, attn_qkv, and the ffns.
Edit 15 and 16 : New quants : Q2_K_L, IQ2_XL, IQ3_XL

Edit 17 : Merge master b3565
Edit 18 and 19 : New Quant : IQ1_XS
Edit 20 and 21 : Some adjustments and reversals.
Edit 21 : New IQ1_XL quant strategy, and some corrections
Edit 22 : Merge master b3569

Examples:

Current Gemma 2 9b It testing quants are here : https://huggingface.co/Nexesenex/google_gemma-2-9b-it_iMat.GGUF/tree/main

Current Llama 3.1 8b It testing quants are here : https://huggingface.co/Nexesenex/tomMeta_Llama-3.1-8b-it_iMat_Custom_Quant_Stategies-GGUF/tree/main

Results for Gemma 2 9b It :

IQ1_XS

PR Init : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.15 GiB (2.00 BPW)
Arc-C 299     42.80936455   
Arc-E 570     68.24561404  
PPL 512 wikitext : 15.1105 +/- 0.11363

PR current : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.16 GiB (2.01 BPW)
PPL 512 wikitext : 14.9768 +/- 0.11234

IQ1_S

MASTER : Gemma 2 9b It IQ1_S, quant made from BF16
Size : 2.21 GiB (2.05 BPW)
Arc-C 299     42.47491639
Arc-E 570     66.84210526
PPL 512 wikitext : 15.9317 +/- 0.11979

PR Init : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.23 GiB (2.07 BPW)
Arc-C 299     43.14381271
Arc-E 570     68.42105263
PPL 512 wikitext : 14.1578 +/- 0.10530

PR current : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.24 GiB (2.08 BPW)
PPL 512 wikitext : 14.0207 +/- 0.10399

IQ1_M

MASTER : Gemma 2 9b It IQ1_M, quant made from BF16
Size : 2.37 GiB (2.20 BPW)
Arc-C 299     45.81939799  
Arc-E 570     73.85964912
PPL 512 wikitext : 13.7215 +/- 0.10231

PR Current : Gemma 2 9b It IQ1_M quant made from BF16
Size : 2.36 GiB (2.19 BPW)
Arc-C 299     45.81939799
Arc-E 570     74.56140351
PPL 512 wikitext : 12.6773 +/- 0.09336

IQ1_XL

PR Init : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.48 GiB (2.30 BPW)
Arc-C 299     47.49163880  
Arc-E 570     73.33333333
PPL 512 wikitext : 11.5001 +/- 0.08487

IQ1_XL
PR Current : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.47 GiB (2.29 BPW)
PPL 512 wikitext : 11.4824 +/- 0.08451

IQ2_XXS

MASTER : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.63 GiB (2.44 BPW)
Arc-C 299     48.16053512   
Arc-E 570     73.15789474   
PPL 512 wikitext : 11.2527 +/- 0.08307

PR Init : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.73 GiB (2.54 BPW)
Arc-C 299     48.82943144
Arc-E 570     74.56140351
PPL 512 wikitext : 10.8439 +/- 0.08026

PR 2 : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.72 GiB (2.53 BPW)
PPL 512 wikitext : 10.8173 +/- 0.07986

PR Current : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.62 GiB (2.43 BPW)
PPL 512 wikitext : 10.8388 +/- 0.08010

IQ2_XS

MASTER : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.85 GiB (2.65 BPW)
Arc-C 299     49.49832776
Arc-E 570     78.24561404  
PPL 512 wikitext : 10.5698 +/- 0.07803

PR Init : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.91 GiB (2.70 BPW)
Arc-C 299     49.16387960
Arc-E 570     78.59649123
PPL 512 wikitext : 10.3607 +/- 0.07660

PR Current : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.77 GiB (2.58 BPW)
PPL 512 wikitext : 10.3922 +/- 0.07672

IQ2_S

MASTER : Gemma 2 9b It IQ2_S (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 2.99 GiB (2.77 BPW)
Arc-C 299     52.84280936
Arc-E 570     77.54385965
PPL 512 wikitext : 10.3868 +/- 0.07787

PR Int : Gemma 2 9b It IQ2_S (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     49.83277592
Arc-E 570     77.71929825
PPL 512 wikitext : 10.1303 +/- 0.07486

PR Current : Gemma 2 9b It IQ2_S, quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     52.17391304
Arc-E 570     77.89473684
PPL 512 wikitext : 10.1071 +/- 0.07450 

IQ2_M

MASTER : Gemma 2 9b It IQ2_M (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 3.19 GiB (2.97 BPW)
Arc-C 299     56.52173913
Arc-E 570     77.01754386
PPL 512 wikitext : 9.8154 +/- 0.07324

PR init : Gemma 2 9b It IQ2_M (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.20 GiB (2.98 BPW)
Arc-C 299     54.18060201
Arc-E 570     78.07017544
PPL 512 wikitext :  9.5734 +/- 0.07040

PR CURRENT : Gemma 2 9b It IQ2_M, quant made from BF16
Size : 3.29 GiB (3.06 BPW)
Arc-C 299     55.85284281
Arc-E 570     78.07017544
PPL 512 wikitext : 9.4128 +/- 0.06881

IQ2_XL

PR CURRENT : Gemma 2 9b It IQ2_XL, quant made from BF16
Size : 3.41 GiB (3.17 BPW)
Arc-C 299     56.18729097
Arc-E 570     78.07017544
PPL 512 wikitext : 9.3283 +/- 0.06820

Q2_K_L

PR CURRENT : Gemma 2 9b It Q2_K_L, quant made from BF16
Size : 3.70 GiB (3.44 BPW)
Arc-C 299     58.19397993
Arc-E 570     79.29824561
PPL 512 wikitext : around 9.25

IQ3_XXS

MASTER : Gemma 2 9b It IQ3_XXS (with iMatrix, attn_k in IQ2_S, and attn_v in IQ3_XXS), quant made from BF16
Size : 3.53 GiB (3.28 BPW)
Arc-C 299 56.52173913
Arc-E 570 79.12280702
PPL 512 wikitext : 9.4116 +/- 0.06982

PR CURRENT : Gemma 2 9b It IQ3_XXS (with Imatrix, attn_k in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.60 GiB (3.35 BPW)
Arc-C 299 56.18729097
Arc-E 570 78.77192982
PPL 512 wikitext : 9.2026 +/- 0.06781

IQ3_XS

MASTER : Gemma 2 9b It IQ3_XS (with iMatrix)), quant made from BF16
Size : 3.85 GiB (3.58 BPW)
Arc-C 299     58.86287625
Arc-E 570     78.94736842
PPL 512 wikitext : 9.2584 +/- 0.06866

PR CURRENT : Gemma 2 9b It IQ3_XS (with Imatrix), quant made from BF16
Size : 3.82 GiB (3.55 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext :  9.0658 +/- 0.06633

IQ3_S

MASTER : Gemma 2 9b It IQ3_S (with iMatrix, attn_v in IQ3_S), quant made from BF16
Size : 4.03 GiB (3.75 BPW)
Arc-C 299     57.52508361
Arc-E 570     77.71929825
PPL 512 wikitext : 9.2100 +/- 0.06859

PR : Gemma 2 9b It IQ3_S (with Imatrix, attn_v in Q4_K), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext : 9.0082 +/- 0.06633

PR rev 2: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     56.85618729
Arc-E 570     78.42105263
PPL 512 wikitext : 9.0082 +/- 0.06633
(I think ARC differences are due to the b3565 merge)

PR rev3 - CURRENT: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.05 GiB (3.76 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.42105263
PPL 512 wikitext : 8.9969 +/- 0.06610

IQ3_M

MASTER : Gemma 2 9b It IQ3_M (with iMatrix, attn_output in Q4_K), quant made from BF16
Size : 4.18 GiB (3.89 BPW)
Arc-C 299     56.85618729
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9697 +/- 0.06598

PR : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS), quant made from BF16
Size : 4.16 GiB (3.87 BPW)
Arc-C 299     57.19063545
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9556 +/- 0.06586

PR rev2 : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K), quant made from BF16
Size : 4.20 GiB (3.90 BPW)²
Arc-C 299     58.52842809²
Arc-E 570     77.54385965²
PPL 512 wikitext : 8.9445 +/- 0.06576²

PR rev3 - CURRENT : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K, attn.k IQ4_XS), quant made from BF16
Size : 4.23 GiB (3.93 BPW)
Arc-C 299     58.19397993
Arc-E 570     77.19298246
PPL 512 wikitext : 8.9082 +/- 0.06536

IQ3_XL

PR CURRENT : Gemma 2 9b It IQ3_XL (with Imatrix), quant made from BF16
Size : 4.50 GiB (4.18 BPW)
Arc-C 299     56.85618729 
Arc-E 570     78.42105263
PPL 512 wikitext : 8.8843 +/- 0.06558

IQ4_XS

MASTER : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.87 GiB (4.52 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.24561404
PPL 512 wikitext : 8.8456 +/- 0.06533

PR CURRENT : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.91 GiB (4.56 BPW)
PPL 512 wikitext : 8.8370 +/- 0.06525

FP16

MASTER : Gemma 2 9b It F16.
Size : 17.22 GiB (16.00 BPW)
Arc-C 299     59.53177258
Arc-E 570     78.77192982
PPL 512 wikitext : 8.7881 +/- 0.06533

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models: - The tensor attn.v.weight passed in Q4_K for models like Gemma (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models. - The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts. - The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes. More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under an different kind of tree mixing these 5 quant strategies. I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be standard.

In coherence with the proposed modifications to the IQ2 quant strategies, which make even more sense for the IQ3 quant strategies.

And application of the attn.v.weight logic I used for IQ2 and IQ3, but only when such logic is already implied by the existing quant strategies, as a compromise to not disturb too much Ikawrakow's quant strategies.

If FTYPE IQ4_XS has attn.output.tensor in IQ4_XS (4.5BPW), there's no reason to have FTYPE IQ3_M to have attn.output.tensor in Q4_K (4.5BPW). In terms of perplexity, on a Llama 3.1 70b model, the proposed change reduces the size by 1%, and increases the preplexity by 0.25%.

If FTYPE IQ4_XS has attn_qkv.weight in IQ4_XS, then FTYPE IQ3_M should not have it in Q4_K (4.5BPW), but in IQ4_XS (4.25BPW) also.

slaren · 2024-08-08T21:21:09Z

It would be useful to have some objective data (such as perplexity tests) to evaluate the effect of these changes.

Nexesenex · 2024-08-09T16:24:28Z

@slaren:

Relevant examples in head post.

Considering the results obtained, I think that it's worth it, the size remains around 260MB below IQ3_XS. The overall high BPW of Gemma is justified mostly by the monolithic embd/output tensor in Q5_K, as the output.weight are usually on FTYPE IQ3_XXS.

The non GQA/non MOE models are not affected.

For FType IQ3_M and its attn_output.weight and attn.qkv.weight in Q4_K, Ikawrakow setting was made prior to IQ4_XS quants and never edited since, considering that FType IQ4_XS has an attn.output.weight in.. IQ4_XS without any problem signaled about it. A PPL test is imo not even necessary there.

For FTYpe IQ2_M, jumping from attn_output IQ2_XS (FTYPE IQ2_XS) to attn_ouput IQ3_S (FTYPE IQ2_S, which is mainly made of IQ2_XS tensors as well), is literally overkill, and Ikawrakow likely didn't pay attention to it and just threw in the value : IQ2_S was simply jumped over, as well as IQ3_XXS for that tensor.

Edit : I will actually do the tests, and share the quants on Huggingface.

Edit 2 : tests made on Gemma v2 9 it. I think it's conclusive.

attn.v in Q5_K attn.k in IQ4_XS

Attn_q in Q3_K for experts >= 8 Attn_k in Q5_K for experts >= 8 Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS Attn_output in Q4_K for experts >= 8

With attn_k set for all quants bellow 3bpw except Q2_K_S.

And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S

- Progressivity for token embeddings and attn_qkv - FFN down for IQ1 and IQ2 quants - FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.

Plus some adjustments on the FFNs

Merge b3565

and fix parenthesis mistake on IQ3_S

this to clarify what it does, especially with the 5 additional levels of difquant

And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS Reformat attn_ouput mess and split GQA4/GQA2

IQ2_XS doesn't seem to work as such, back to IQ2_S

And complete FFN up Shrink a bit more non GQA models

for more granularity in low quants.

Of which the size is more sensitive to the non repeating tensors

Q3_K_XL takes the place of Q3_K_L. Q3_K_L becomes intermediary between Q3_K_M and XL.

Nexesenex added 2 commits August 2, 2024 20:40

Apply the GQA2/Expert2 conditionality to the IQ3 quants

6398663

In coherence with the proposed modifications to the IQ2 quant strategies, which make even more sense for the IQ3 quant strategies.

Nexesenex changed the title ~~Small changes for IQ2 quant strategies (notably IQ2_S and IQ2_M)~~ Small changes for IQ2/IQ3 quant strategies Aug 2, 2024

Slight reorder of the attn.weight tree

7d337d0

And application of the attn.v.weight logic I used for IQ2 and IQ3, but only when such logic is already implied by the existing quant strategies, as a compromise to not disturb too much Ikawrakow's quant strategies.

Nexesenex changed the title ~~Small changes for IQ2/IQ3 quant strategies~~ Small changes for IQ2/IQ3 and attn.v.weight related quant strategies Aug 2, 2024

Nexesenex added 3 commits August 3, 2024 03:04

More occurences of n_experts == 8 changed to >= in quant strategies

d5779c2

attn_qkv.weight in IQ4_XS for FTYPE IQ3_M

59c5d47

If FTYPE IQ4_XS has attn_qkv.weight in IQ4_XS, then FTYPE IQ3_M should not have it in Q4_K (4.5BPW), but in IQ4_XS (4.25BPW) also.

Nexesenex changed the title ~~Small changes for IQ2/IQ3 and attn.v.weight related quant strategies~~ Small changes for the quant strategies (mostly on the attn_*.weight tensors) Aug 4, 2024

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 5, 2024

Nexesenex added 2 commits August 8, 2024 18:50

Avoid to shrink attn.k.weight for IQ3_XS and XXS when GQA or MOE

8006b15

correct mistake in conditionality for attn.k

1118c04

Bump IQ3_M

1bc4dc5

attn.v in Q5_K attn.k in IQ4_XS

Nexesenex marked this pull request as draft August 9, 2024 23:16

Nexesenex added 2 commits August 10, 2024 12:52

IQ1 and IQ2 refactor

7212098

Attn_q in Q3_K for experts >= 8 Attn_k in Q5_K for experts >= 8 Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS Attn_output in Q4_K for experts >= 8

Shortening formatting

8f1b99f

Nexesenex changed the title ~~Small changes for the quant strategies (mostly on the attn_*.weight tensors)~~ Changes for the quant strategies (mostly on the attn_*.weight tensors) Aug 10, 2024

Nexesenex added 6 commits August 10, 2024 16:33

Further refactor attn_k

aa4eb59

With attn_k set for all quants bellow 3bpw except Q2_K_S.

Settings for MOE >= 8 experts applied to >= 4 experts

8c8e43c

Refactor furthermore attn.v

415d5e4

And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S

Advancing on several tensors

49617b1

- Progressivity for token embeddings and attn_qkv - FFN down for IQ1 and IQ2 quants - FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.

IQ2_XL , IQ3_XL , Q2_K_L

f0806ac

Plus some adjustments on the FFNs

2 forgotten files

8bc7a98

github-actions bot added the examples label Aug 10, 2024

Merge b3565

14f4f40

Merge b3565

Nexesenex closed this Aug 10, 2024

Nexesenex deleted the patch-1 branch August 10, 2024 18:46

Nexesenex restored the patch-1 branch August 10, 2024 18:47

Nexesenex added 30 commits August 20, 2024 00:48

Merge branch 'master' into pr/8836

8c1a3c5

Reorder, corrections, settling lower IQ3 quants

207ffe6

Rework IQ3_XXS and IQ3_XS

fddff02

and fix parenthesis mistake on IQ3_S

Merge branch 'master' into pr/8836

cfe866e

change function use_*_bits into difquant_*_tensors

ce86019

this to clarify what it does, especially with the 5 additional levels of difquant

harmonize formatting of tensor type conditions

dbadcdd

Shrink a bit IQ3_XXS, bump a bit IQ3_M

d7b9d21

Improve IQ1 and IQ2 quants

32f6ead

And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS Reformat attn_ouput mess and split GQA4/GQA2

Correction too small tensor embeddings to quantize

644aa9f

IQ2_XS doesn't seem to work as such, back to IQ2_S

Little rework of the difquant formulas

179ad0f

Further adjustments difquant formulas

1607a02

Overhaul of FFN, if GQA and if not

e05da54

Revamp IQ4_XSR, remove IQ3_XXXL

3a027b8

Merge branch 'master' into pr/8836

fb2b9ea

Readd variable attn_k, attn_q, attn_o after merge

596a4ae

Revamp FFN down and attn_k

f796954

And complete FFN up Shrink a bit more non GQA models

Revamp a bit output weight

6b5cebf

for more granularity in low quants.

Ravamp attn_output

6081085

Fix IQ4_XSR

380b53d

various corrections on IQ2_S+ and IQ3 quants

16e9c37

Usage of n_head to discriminate very small models

1bde168

Of which the size is more sensitive to the non repeating tensors

Revamp Q2_K and Q3_K quants

5ae5971

Q3_K_XL takes the place of Q3_K_L. Q3_K_L becomes intermediary between Q3_K_M and XL.

bad indent

844d11b

Remove deprecated rules for token embeddings

53b8eaa

Bump a bit ffn_gate and down for some GQA<2 models

8fc46df

Put back ffn_down tree where it was before.

f63860e

Bad indents and trailing whitespaces

dd3df75

correction

16aee45

Merge branch 'master' into pr/8836

5644d4c

Soften the token embeddings bump for experts >= 4

26aac8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for the existing quant strategies / FTYPEs and new ones #8836

Changes for the existing quant strategies / FTYPEs and new ones #8836

Nexesenex commented Aug 2, 2024 •

edited

Loading

slaren commented Aug 8, 2024

Nexesenex commented Aug 9, 2024 •

edited

Loading

Changes for the existing quant strategies / FTYPEs and new ones #8836

Are you sure you want to change the base?

Changes for the existing quant strategies / FTYPEs and new ones #8836

Conversation

Nexesenex commented Aug 2, 2024 • edited Loading

slaren commented Aug 8, 2024

Nexesenex commented Aug 9, 2024 • edited Loading

Nexesenex commented Aug 2, 2024 •

edited

Loading

Nexesenex commented Aug 9, 2024 •

edited

Loading