Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for the existing quant strategies / FTYPEs and new ones #8836

Draft
wants to merge 70 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
b77cdd8
Small changes for IQ2 quant strategies (notably IQ2_S and IQ2_M)
Nexesenex Aug 2, 2024
6398663
Apply the GQA2/Expert2 conditionality to the IQ3 quants
Nexesenex Aug 2, 2024
7d337d0
Slight reorder of the attn.weight tree
Nexesenex Aug 2, 2024
d5779c2
More occurences of n_experts == 8 changed to >= in quant strategies
Nexesenex Aug 3, 2024
93c35f8
attn.output.tensor of FYPE IQ3_M in IQ4_XS
Nexesenex Aug 4, 2024
59c5d47
attn_qkv.weight in IQ4_XS for FTYPE IQ3_M
Nexesenex Aug 4, 2024
8006b15
Avoid to shrink attn.k.weight for IQ3_XS and XXS when GQA or MOE
Nexesenex Aug 8, 2024
1118c04
correct mistake in conditionality for attn.k
Nexesenex Aug 8, 2024
1bc4dc5
Bump IQ3_M
Nexesenex Aug 9, 2024
7212098
IQ1 and IQ2 refactor
Nexesenex Aug 10, 2024
8f1b99f
Shortening formatting
Nexesenex Aug 10, 2024
aa4eb59
Further refactor attn_k
Nexesenex Aug 10, 2024
8c8e43c
Settings for MOE >= 8 experts applied to >= 4 experts
Nexesenex Aug 10, 2024
415d5e4
Refactor furthermore attn.v
Nexesenex Aug 10, 2024
49617b1
Advancing on several tensors
Nexesenex Aug 10, 2024
f0806ac
IQ2_XL , IQ3_XL , Q2_K_L
Nexesenex Aug 10, 2024
8bc7a98
2 forgotten files
Nexesenex Aug 10, 2024
14f4f40
Merge b3565
Nexesenex Aug 10, 2024
8ad71f4
IQ1_XS
Nexesenex Aug 10, 2024
e2e2d77
misplaced file lol
Nexesenex Aug 10, 2024
ef83a87
Revert of ffn gate and up on IQ3_M
Nexesenex Aug 10, 2024
1268d58
More adjustments
Nexesenex Aug 11, 2024
91db53b
IQ1_XL and some corrections
Nexesenex Aug 11, 2024
8c2c03f
Merge b3569
Nexesenex Aug 11, 2024
1ad18f8
Adjustments on attn_k
Nexesenex Aug 11, 2024
df9e6fd
Adjustments on output and embeddings
Nexesenex Aug 11, 2024
3e2eb6d
Merge branch 'master' into pr/8836
Nexesenex Aug 12, 2024
cd92ba6
IQ4_XSR (test FTYPE) and attention_wv logic for all attn_*.weights
Nexesenex Aug 12, 2024
8c10533
Merge branch 'master' into pr/8836
Nexesenex Aug 12, 2024
8c9017b
Simplify IQ4_XSR
Nexesenex Aug 12, 2024
eeccd31
Merge branch 'master' into pr/8836
Nexesenex Aug 15, 2024
e4c506d
Merge branch 'master' into pr/8836
Nexesenex Aug 18, 2024
17b7151
Update IQ3_M attn_k and IQ3_XL token_embd
Nexesenex Aug 16, 2024
4ba5618
Adapt token embeddings and output.weight to vocab size
Nexesenex Aug 17, 2024
b02eaf6
Mass use of the few/some/more/many bits bump logic
Nexesenex Aug 17, 2024
a79633b
Merge branch 'master' into pr/8836
Nexesenex Aug 18, 2024
ddb1373
IQ3_XXL and IQ3_XXXL
Nexesenex Aug 18, 2024
503048a
Correct IQ3_M
Nexesenex Aug 18, 2024
caeb839
Boost embeddings and output weights for MOEs.
Nexesenex Aug 18, 2024
a7f9164
Fix mistake
Nexesenex Aug 19, 2024
8c1a3c5
Merge branch 'master' into pr/8836
Nexesenex Aug 19, 2024
207ffe6
Reorder, corrections, settling lower IQ3 quants
Nexesenex Aug 18, 2024
fddff02
Rework IQ3_XXS and IQ3_XS
Nexesenex Aug 18, 2024
cfe866e
Merge branch 'master' into pr/8836
Nexesenex Aug 21, 2024
ce86019
change function use_*_bits into difquant_*_tensors
Nexesenex Aug 21, 2024
dbadcdd
harmonize formatting of tensor type conditions
Nexesenex Aug 20, 2024
d7b9d21
Shrink a bit IQ3_XXS, bump a bit IQ3_M
Nexesenex Aug 20, 2024
32f6ead
Improve IQ1 and IQ2 quants
Nexesenex Aug 19, 2024
644aa9f
Correction too small tensor embeddings to quantize
Nexesenex Aug 21, 2024
179ad0f
Little rework of the difquant formulas
Nexesenex Aug 21, 2024
1607a02
Further adjustments difquant formulas
Nexesenex Aug 23, 2024
e05da54
Overhaul of FFN, if GQA and if not
Nexesenex Aug 22, 2024
3a027b8
Revamp IQ4_XSR, remove IQ3_XXXL
Nexesenex Aug 22, 2024
fb2b9ea
Merge branch 'master' into pr/8836
Nexesenex Aug 25, 2024
596a4ae
Readd variable attn_k, attn_q, attn_o after merge
Nexesenex Aug 22, 2024
f796954
Revamp FFN down and attn_k
Nexesenex Aug 23, 2024
6b5cebf
Revamp a bit output weight
Nexesenex Aug 23, 2024
6081085
Ravamp attn_output
Nexesenex Aug 23, 2024
380b53d
Fix IQ4_XSR
Nexesenex Aug 23, 2024
16e9c37
various corrections on IQ2_S+ and IQ3 quants
Nexesenex Aug 23, 2024
1bde168
Usage of n_head to discriminate very small models
Nexesenex Aug 23, 2024
5ae5971
Revamp Q2_K and Q3_K quants
Nexesenex Aug 24, 2024
844d11b
bad indent
Nexesenex Aug 24, 2024
53b8eaa
Remove deprecated rules for token embeddings
Nexesenex Aug 24, 2024
8fc46df
Bump a bit ffn_gate and down for some GQA<2 models
Nexesenex Aug 24, 2024
f63860e
Put back ffn_down tree where it was before.
Nexesenex Aug 25, 2024
dd3df75
Bad indents and trailing whitespaces
Nexesenex Aug 25, 2024
16aee45
correction
Nexesenex Aug 25, 2024
5644d4c
Merge branch 'master' into pr/8836
Nexesenex Sep 19, 2024
26aac8e
Soften the token embeddings bump for experts >= 4
Nexesenex Aug 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 15 additions & 7 deletions examples/quantize/quantize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,22 +24,30 @@ static const std::vector<struct quant_option> QUANT_OPTIONS = {
{ "IQ2_XS", LLAMA_FTYPE_MOSTLY_IQ2_XS, " 2.31 bpw quantization", },
{ "IQ2_S", LLAMA_FTYPE_MOSTLY_IQ2_S, " 2.5 bpw quantization", },
{ "IQ2_M", LLAMA_FTYPE_MOSTLY_IQ2_M, " 2.7 bpw quantization", },
{ "IQ2_XL", LLAMA_FTYPE_MOSTLY_IQ2_XL, " 2.85 bpw quantization mix", },
{ "IQ1_XS", LLAMA_FTYPE_MOSTLY_IQ1_XS, " 1.6-1.7 bpw quantization mix", },
{ "IQ1_S", LLAMA_FTYPE_MOSTLY_IQ1_S, " 1.56 bpw quantization", },
{ "IQ1_M", LLAMA_FTYPE_MOSTLY_IQ1_M, " 1.75 bpw quantization", },
{ "IQ1_XL", LLAMA_FTYPE_MOSTLY_IQ1_XL, " 1.90 bpw quantization", },
{ "TQ1_0", LLAMA_FTYPE_MOSTLY_TQ1_0, " 1.69 bpw ternarization", },
{ "TQ2_0", LLAMA_FTYPE_MOSTLY_TQ2_0, " 2.06 bpw ternarization", },
{ "Q2_K", LLAMA_FTYPE_MOSTLY_Q2_K, " 2.96G, +3.5199 ppl @ Llama-3-8B", },
{ "Q2_K_S", LLAMA_FTYPE_MOSTLY_Q2_K_S, " 2.96G, +3.1836 ppl @ Llama-3-8B", },
{ "Q2_K_L", LLAMA_FTYPE_MOSTLY_Q2_K_L, " 3.20G, +3.1836 ppl @ Llama-3-8B", },
{ "IQ3_XXS", LLAMA_FTYPE_MOSTLY_IQ3_XXS, " 3.06 bpw quantization", },
{ "IQ3_S", LLAMA_FTYPE_MOSTLY_IQ3_S, " 3.44 bpw quantization", },
{ "IQ3_M", LLAMA_FTYPE_MOSTLY_IQ3_M, " 3.66 bpw quantization mix", },
{ "IQ3_M", LLAMA_FTYPE_MOSTLY_IQ3_M, " 3.70 bpw quantization mix", },
{ "IQ3_XL", LLAMA_FTYPE_MOSTLY_IQ3_XL, " 3.90 bpw quantization mix", },
{ "IQ3_XXL", LLAMA_FTYPE_MOSTLY_IQ3_XXL, " 4.10 bpw quantization mix", },
{ "Q3_K", LLAMA_FTYPE_MOSTLY_Q3_K_M, "alias for Q3_K_M" },
{ "IQ3_XS", LLAMA_FTYPE_MOSTLY_IQ3_XS, " 3.3 bpw quantization", },
{ "Q3_K_S", LLAMA_FTYPE_MOSTLY_Q3_K_S, " 3.41G, +1.6321 ppl @ Llama-3-8B", },
{ "Q3_K_M", LLAMA_FTYPE_MOSTLY_Q3_K_M, " 3.74G, +0.6569 ppl @ Llama-3-8B", },
{ "Q3_K_L", LLAMA_FTYPE_MOSTLY_Q3_K_L, " 4.03G, +0.5562 ppl @ Llama-3-8B", },
{ "Q3_K_L", LLAMA_FTYPE_MOSTLY_Q3_K_L, " 4.10 bpw quantization mix", },
{ "Q3_K_XL", LLAMA_FTYPE_MOSTLY_Q3_K_XL, " 4.03G, +0.5562 ppl @ Llama-3-8B", },
{ "IQ4_NL", LLAMA_FTYPE_MOSTLY_IQ4_NL, " 4.50 bpw non-linear quantization", },
{ "IQ4_XS", LLAMA_FTYPE_MOSTLY_IQ4_XS, " 4.25 bpw non-linear quantization", },
{ "IQ4_XSR", LLAMA_FTYPE_MOSTLY_IQ4_XSR, " 4.xx bpw non-linear quantization", },
{ "Q4_K", LLAMA_FTYPE_MOSTLY_Q4_K_M, "alias for Q4_K_M", },
{ "Q4_K_S", LLAMA_FTYPE_MOSTLY_Q4_K_S, " 4.37G, +0.2689 ppl @ Llama-3-8B", },
{ "Q4_K_M", LLAMA_FTYPE_MOSTLY_Q4_K_M, " 4.58G, +0.1754 ppl @ Llama-3-8B", },
Expand Down Expand Up @@ -406,13 +414,13 @@ int main(int argc, char ** argv) {
}

if ((params.ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || params.ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS ||
params.ftype == LLAMA_FTYPE_MOSTLY_IQ2_S ||
params.ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S ||
params.ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || params.ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ||
params.ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S || params.ftype == LLAMA_FTYPE_MOSTLY_Q2_K ||
params.ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
params.ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) && imatrix_data.empty()) {
fprintf(stderr, "\n==========================================================================================================\n");
fprintf(stderr, "Please do not use IQ1_S, IQ1_M, IQ2_S, IQ2_XXS, IQ2_XS or Q2_K_S quantization without an importance matrix\n");
fprintf(stderr, "==========================================================================================================\n\n\n");
fprintf(stderr, "\n==========================================================================================\n");
fprintf(stderr, "Please do not use IQ1_*, IQ2_*, Q2_K_S, or Q2_K quantization without an importance matrix!\n");
fprintf(stderr, "==========================================================================================\n\n\n");
return 1;
}

Expand Down
10 changes: 9 additions & 1 deletion gguf-py/gguf/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -1370,7 +1370,7 @@ class LlamaFileType(IntEnum):
MOSTLY_Q2_K = 10 # except 1d tensors
MOSTLY_Q3_K_S = 11 # except 1d tensors
MOSTLY_Q3_K_M = 12 # except 1d tensors
MOSTLY_Q3_K_L = 13 # except 1d tensors
MOSTLY_Q3_K_XL = 13 # except 1d tensors
MOSTLY_Q4_K_S = 14 # except 1d tensors
MOSTLY_Q4_K_M = 15 # except 1d tensors
MOSTLY_Q5_K_S = 16 # except 1d tensors
Expand All @@ -1395,6 +1395,14 @@ class LlamaFileType(IntEnum):
MOSTLY_Q4_0_8_8 = 35 # except 1d tensors
MOSTLY_TQ1_0 = 36 # except 1d tensors
MOSTLY_TQ2_0 = 37 # except 1d tensors
MOSTLY_IQ2_XL = 38 # except 1d tensors
MOSTLY_IQ3_XL = 39 # except 1d tensors
MOSTLY_Q2_K_L = 40 # except 1d tensors
MOSTLY_IQ1_XS = 41 # except 1d tensors
MOSTLY_IQ1_XL = 42 # except 1d tensors
MOSTLY_IQ4_XSR = 43 # except 1d tensors
MOSTLY_IQ3_XXL = 44 # except 1d tensors
MOSTLY_Q3_K_L = 45 # except 1d tensors

GUESSED = 1024 # not specified in the model file

Expand Down
10 changes: 9 additions & 1 deletion include/llama.h
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ extern "C" {
LLAMA_FTYPE_MOSTLY_Q2_K = 10, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_S = 11, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_M = 12, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_L = 13, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_XL = 13, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_K_S = 14, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_K_M = 15, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q5_K_S = 16, // except 1d tensors
Expand All @@ -174,6 +174,14 @@ extern "C" {
LLAMA_FTYPE_MOSTLY_Q4_0_8_8 = 35, // except 1d tensors
LLAMA_FTYPE_MOSTLY_TQ1_0 = 36, // except 1d tensors
LLAMA_FTYPE_MOSTLY_TQ2_0 = 37, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ2_XL = 38, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ3_XL = 39, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q2_K_L = 40, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ1_XS = 41, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ1_XL = 42, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ4_XSR = 43, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ3_XXL = 44, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_L = 45, // except 1d tensors

LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
};
Expand Down
Loading
Loading