Bug: Coredump when quanting to Q4_0__ with imatrix #8767

ThomasBaruzier · 2024-07-30T09:53:04Z

What happened?

Hello,

I don't know if I am supposed to use imatrix with the new ARM dedicated quants (Q4_0_*_*). However, when I try to, I get Aborted (core dumped).

Is not using imatrix with those quants intentional? If that is the case, why does the quantization to q4_* and q5_* work with imatrix?

Name and Version

Rope scaling fix for L3.1 commit. Can't get the latest build due to a new build error I am investigating.

What operating system are you seeing the problem on?

Linux

Relevant log output

~/files/ai/llama.cpp/git/llama-quantize --imatrix imatrix.dat Meta-Llama-3.1-8B-Instruct-F16.gguf Q4_0_4_4
load_imatrix: imatrix dataset='misc/calibration_datav3.txt'
load_imatrix: loaded 224 importance matrix entries from imatrix.dat computed on 125 chunks
prepare_imatrix: have 224 importance matrix entries
main: build = 83 (b5e9546)
main: built with cc (GCC) 14.1.1 20240720 for x86_64-pc-linux-gnu
main: quantizing 'Meta-Llama-3.1-8B-Instruct-F16.gguf' to 'ggml-model-Q4_0_4_4.gguf' as Q4_0_4_4
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from Meta-Llama-3.1-8B-Instruct-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta llama_Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = meta-llama_Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:  226 tensors
================================ Have weights data with 224 entries
[   1/ 292]                    rope_freqs.weight - [   64,     1,     1,     1], type =    f32, size =    0.000 MB
[   2/ 292]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, 
====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to q4_0 .. size =  1002.00 MiB ->   281.81 MiB
[   3/ 292]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   4/ 292]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0_4x4 .. ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.
ptrace: Operation not permitted.
No stack.
The program is not being run.
ptrace: Operation not permitted.
No stack.
The program is not being run.
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

ThomasBaruzier · 2024-07-30T10:15:53Z

Nevermind, the build error is because llama.cpp doesn't support Cuda 12.5. It works well with 12.4, though.

I now run the latest commit, and I experience the same issue as above:

version: 1 (c887d8b)
built with cc (GCC) 14.1.1 20240720 for x86_64-pc-linux-gnu

ThomasBaruzier · 2024-08-26T21:34:31Z

Fixed by #9192

ThomasBaruzier added bug-unconfirmed medium severity labels Jul 30, 2024

ThomasBaruzier closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Coredump when quanting to Q4_0__ with imatrix #8767

Bug: Coredump when quanting to Q4_0__ with imatrix #8767

ThomasBaruzier commented Jul 30, 2024 •

edited

Loading

ThomasBaruzier commented Jul 30, 2024

ThomasBaruzier commented Aug 26, 2024

Bug: Coredump when quanting to Q4_0_*_* with imatrix #8767

Bug: Coredump when quanting to Q4_0_*_* with imatrix #8767

Comments

ThomasBaruzier commented Jul 30, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ThomasBaruzier commented Jul 30, 2024

ThomasBaruzier commented Aug 26, 2024

Bug: Coredump when quanting to Q4_0__ with imatrix #8767

Bug: Coredump when quanting to Q4_0__ with imatrix #8767

ThomasBaruzier commented Jul 30, 2024 •

edited

Loading