Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: GGML_ASSERT(llama_add_eos_token(model) != 1) failed llama-server critical error with flan-t5 models #8990

Closed
fabiomatricardi opened this issue Aug 12, 2024 · 7 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@fabiomatricardi
Copy link

What happened?

direct llama-cli call for Flan-T5 based models working fine. When trying to set up server, critical error stop execution

.\llama-server.exe -m .\models\LaMini-Flan-T5-248M.Q8_0.gguf -c 512

GGML_ASSERT(llama_add_eos_token(model) != 1) failed

model repo: https://huggingface.co/Felladrin/gguf-LaMini-Flan-T5-248M

model file: https://huggingface.co/Felladrin/gguf-LaMini-Flan-T5-248M/resolve/main/LaMini-Flan-T5-248M.Q8_0.gguf

Name and Version

.\llama-cli.exe --version

version: 3570 (4134999e)
built with cc (GCC) 14.2.0 for x86_64-w64-mingw32

Windows 11 with Python 3.11

What operating system are you seeing the problem on?

Windows

Relevant log output

.\llama-server.exe -m .\models\LaMini-Flan-T5-248M.Q8_0.gguf -c 512
INFO [                    main] build info | tid="1" timestamp=1723424718 build=3570 commit="4134999e"
INFO [                    main] system info | tid="1" timestamp=1723424718 n_threads=4 n_threads_batch=-1 total_threads=4 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 28 key-value pairs and 282 tensors from .\models\LaMini-Flan-T5-248M.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = t5
llama_model_loader: - kv   1:                               general.name str              = T5
llama_model_loader: - kv   2:                          t5.context_length u32              = 512
llama_model_loader: - kv   3:                        t5.embedding_length u32              = 768
llama_model_loader: - kv   4:                     t5.feed_forward_length u32              = 2048
llama_model_loader: - kv   5:                             t5.block_count u32              = 12
llama_model_loader: - kv   6:                    t5.attention.head_count u32              = 12
llama_model_loader: - kv   7:                    t5.attention.key_length u32              = 64
llama_model_loader: - kv   8:                  t5.attention.value_length u32              = 64
llama_model_loader: - kv   9:            t5.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  10:        t5.attention.relative_buckets_count u32              = 32
llama_model_loader: - kv  11:        t5.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                  t5.decoder_start_token_id u32              = 0
llama_model_loader: - kv  13:                          general.file_type u32              = 7
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32128]   = ["<pad>", "</s>", "<unk>", "Γûü", "X"...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32128]   = [0.000000, 0.000000, 0.000000, -2.012...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32128]   = [3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  20:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  21:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 2
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   64 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type q8_0:  216 tensors
llm_load_vocab: special tokens cache size = 131
llm_load_vocab: token to piece cache size = 0.2123 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = t5
llm_load_print_meta: vocab type       = UGM
llm_load_print_meta: n_vocab          = 32128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 2048
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = -1
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 250M
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 247.58 M
llm_load_print_meta: model size       = 295.12 MiB (10.00 BPW)
llm_load_print_meta: general.name     = T5
llm_load_print_meta: EOS token        = 1 '</s>'
llm_load_print_meta: UNK token        = 2 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 3 'Γûü'
llm_load_print_meta: max token length = 20
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =   295.12 MiB
.......................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    18.00 MiB
llama_new_context_with_model: KV self size  =   18.00 MiB, K (f16):    9.00 MiB, V (f16):    9.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.25 MiB
llama_new_context_with_model:        CPU compute buffer size =    45.50 MiB
llama_new_context_with_model: graph nodes  = 425
llama_new_context_with_model: graph splits = 1
examples/server/server.cpp:696: GGML_ASSERT(llama_add_eos_token(model) != 1) failed
@fabiomatricardi fabiomatricardi added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 12, 2024
@ggerganov
Copy link
Owner

Probably fixed via #8997

@fairydreaming
Copy link
Collaborator

@ggerganov I think #8997 fixes only T5 model loading, but they wont work correctly. For T5 models to work llama-server still needs to call llama_encode() and prepare input for llama_decode() with decoder start tokens like it's done in llama-cli:

if (llama_model_has_encoder(model)) {
int enc_input_size = embd_inp.size();
llama_token * enc_input_buf = embd_inp.data();
if (llama_encode(ctx, llama_batch_get_one(enc_input_buf, enc_input_size, 0, 0))) {
LOG_TEE("%s : failed to eval\n", __func__);
return 1;
}
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
if (decoder_start_token_id == -1) {
decoder_start_token_id = llama_token_bos(model);
}
embd_inp.clear();
embd_inp.push_back(decoder_start_token_id);
}

@joelbarmettlerUZH
Copy link

I have the same issue with

./llama-server -m models/gte-Qwen2-1.5B-instruct-Q4_K_M.gguf -ngl 29 --embedding --pooling mean -c 32000

Pulled #8997 but issue remains.

version: 3539 (a8dbc6f7)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

@Aisuko
Copy link
Contributor

Aisuko commented Aug 17, 2024

In my case, I quantized the original HuggingFace SmolLM to https://huggingface.co/aisuko/SmolLM-135M-Instruct-gguf. And it works fine with llama-cli.

However, it doesn't work on the fine-tuned version of SmolLM-135M-Instruct https://huggingface.co/aisuko/ft-smollm-135M-instruct-on-hf-ultrafeedback.

When I fine-tuning the SmoILM, the tokenizer part is below:
https://www.kaggle.com/code/aisuko/ft-smollm-135m-instruct-on-hf-ultrafeedback

from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("TOKENIZER_NAME"), add_eos_token=True)

tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding_side="left"

After I convert the ft-smollm-135M-instruct-on-hf-ultrafeedback to gguf and try to launch it through llama-cli, the error log see below.

ec2-user@ip-10-110-145-139:~/workspace$ ./llama.cpp/llama-cli -m ft-smollm-135M-instruct-on-hf-ultrafeedback-f16.gguf -n 128
Log start
main: build = 3584 (5fd89a70)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1723856531
llama_model_loader: loaded meta data with 38 key-value pairs and 272 tensors from ft-smollm-135M-instruct-on-hf-ultrafeedback-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SmolLM 135M Instruct
llama_model_loader: - kv   3:                       general.organization str              = HuggingFaceTB
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = SmolLM
llama_model_loader: - kv   6:                         general.size_label str              = 135M
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = SmolLM 135M Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = HuggingFaceTB
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv  12:                               general.tags arr[str,3]       = ["trl", "orpo", "generated_from_train...
llama_model_loader: - kv  13:                          llama.block_count u32              = 30
llama_model_loader: - kv  14:                       llama.context_length u32              = 2048
llama_model_loader: - kv  15:                     llama.embedding_length u32              = 576
llama_model_loader: - kv  16:                  llama.feed_forward_length u32              = 1536
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 9
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 3
llama_model_loader: - kv  19:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                          general.file_type u32              = 1
llama_model_loader: - kv  22:                           llama.vocab_size u32              = 49152
llama_model_loader: - kv  23:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  24:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = smollm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,48900]   = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   61 tensors
llama_model_loader: - type  f16:  211 tensors
llm_load_vocab: special tokens cache size = 17
llm_load_vocab: token to piece cache size = 0.3170 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 49152
llm_load_print_meta: n_merges         = 48900
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 576
llm_load_print_meta: n_layer          = 30
llm_load_print_meta: n_head           = 9
llm_load_print_meta: n_head_kv        = 3
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 192
llm_load_print_meta: n_embd_v_gqa     = 192
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 134.52 M
llm_load_print_meta: model size       = 256.63 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = SmolLM 135M Instruct
llm_load_print_meta: BOS token        = 1 '<|im_start|>'
llm_load_print_meta: EOS token        = 2 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: PAD token        = 2 '<|im_end|>'
llm_load_print_meta: LF token         = 143 'Ä'
llm_load_print_meta: EOT token        = 0 '<|endoftext|>'
llm_load_print_meta: max token length = 162
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors:        CPU buffer size =   256.63 MiB
....................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    45.00 MiB
llama_new_context_with_model: KV self size  =   45.00 MiB, K (f16):   22.50 MiB, V (f16):   22.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.19 MiB
llama_new_context_with_model:        CPU compute buffer size =    98.25 MiB
llama_new_context_with_model: graph nodes  = 966
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
examples/main/main.cpp:272: GGML_ASSERT(llama_add_eos_token(model) != 1) failed
./llama.cpp/llama-cli(+0x5f5bb)[0x5ecb6f5815bb]
./llama.cpp/llama-cli(+0x61477)[0x5ecb6f583477]
./llama.cpp/llama-cli(+0x437e9)[0x5ecb6f5657e9]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x759a1ea29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x759a1ea29e40]
./llama.cpp/llama-cli(+0x43ca5)[0x5ecb6f565ca5]
Aborted (core dumped)

@fairydreaming
Copy link
Collaborator

fairydreaming commented Aug 17, 2024

@Aisuko I think the problem is that your model has "add_eos_token": true, in tokenizer_config.json file. Considering the fact that it's a decoder-only model and it should generate EOS token by itself, I think there's no need for this to be true. Perhaps you should simply set it to false and repeat the model conversion process?
Edit: I think you can also set it to false in GGUF file with gguf-py/scripts/gguf_set_metadata.py

@Aisuko
Copy link
Contributor

Aisuko commented Aug 17, 2024

@Aisuko I think the problem is that your model has "add_eos_token": true, in tokenizer_config.json file. Considering the fact that it's a decoder-only model and it should generate EOS token by itself, I think there's no need for this to be true. Perhaps you should simply set it to false and repeat the model conversion process? Edit: I think you can also set it to false in GGUF file with gguf-py/scripts/gguf_set_metadata.py

COOL COOL, thank you @fairydreaming I will test it later.

Update: It works. And you are right. Thanks.

@github-actions github-actions bot added the stale label Sep 17, 2024
Copy link
Contributor

github-actions bot commented Oct 1, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants