Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Coredump when quanting to Q4_0_*_* with imatrix #8767

Closed
ThomasBaruzier opened this issue Jul 30, 2024 · 2 comments
Closed

Bug: Coredump when quanting to Q4_0_*_* with imatrix #8767

ThomasBaruzier opened this issue Jul 30, 2024 · 2 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@ThomasBaruzier
Copy link

ThomasBaruzier commented Jul 30, 2024

What happened?

Hello,

I don't know if I am supposed to use imatrix with the new ARM dedicated quants (Q4_0_*_*). However, when I try to, I get Aborted (core dumped).

Is not using imatrix with those quants intentional? If that is the case, why does the quantization to q4_* and q5_* work with imatrix?

Name and Version

Rope scaling fix for L3.1 commit. Can't get the latest build due to a new build error I am investigating.

What operating system are you seeing the problem on?

Linux

Relevant log output

~/files/ai/llama.cpp/git/llama-quantize --imatrix imatrix.dat Meta-Llama-3.1-8B-Instruct-F16.gguf Q4_0_4_4
load_imatrix: imatrix dataset='misc/calibration_datav3.txt'
load_imatrix: loaded 224 importance matrix entries from imatrix.dat computed on 125 chunks
prepare_imatrix: have 224 importance matrix entries
main: build = 83 (b5e9546)
main: built with cc (GCC) 14.1.1 20240720 for x86_64-pc-linux-gnu
main: quantizing 'Meta-Llama-3.1-8B-Instruct-F16.gguf' to 'ggml-model-Q4_0_4_4.gguf' as Q4_0_4_4
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from Meta-Llama-3.1-8B-Instruct-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta llama_Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = meta-llama_Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:  226 tensors
================================ Have weights data with 224 entries
[   1/ 292]                    rope_freqs.weight - [   64,     1,     1,     1], type =    f32, size =    0.000 MB
[   2/ 292]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, 
====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to q4_0 .. size =  1002.00 MiB ->   281.81 MiB
[   3/ 292]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   4/ 292]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0_4x4 .. ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.
ggml/src/ggml.c:20623: GGML_ASSERT(result == nrows * row_size) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.
ptrace: Operation not permitted.
No stack.
The program is not being run.
ptrace: Operation not permitted.
No stack.
The program is not being run.
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
@ThomasBaruzier ThomasBaruzier added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 30, 2024
@ThomasBaruzier
Copy link
Author

Nevermind, the build error is because llama.cpp doesn't support Cuda 12.5. It works well with 12.4, though.

I now run the latest commit, and I experience the same issue as above:

version: 1 (c887d8b)
built with cc (GCC) 14.1.1 20240720 for x86_64-pc-linux-gnu

@ThomasBaruzier
Copy link
Author

Fixed by #9192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

1 participant