Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: CANN error E89999 on Ascend 910b #10777

Open
JerryKwan opened this issue Dec 11, 2024 · 0 comments
Open

Eval bug: CANN error E89999 on Ascend 910b #10777

JerryKwan opened this issue Dec 11, 2024 · 0 comments

Comments

@JerryKwan
Copy link

JerryKwan commented Dec 11, 2024

Name and Version

./llama-cli --version
version: 4302 (43041d2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

Operating systems

Linux

GGML backends

CANN

Hardware

Huawei Ascend 910b

Models

QwQ-32B-Q4_0

Problem description & steps to reproduce

When I run the following command to start llama-cli, it crashed with CANN error CANN error: E89999: Inner Error!

./llama-cli -m /models/QwQ-32B-Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer

First Bad Commit

No response

Relevant log output

./llama-cli -m /models/QwQ-32B-Q4_0.ggup -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
build: 4302 (43041d2e) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CANN0 (Ascend910B2) - 62152 MiB free
llama_load_model_from_file: using device CANN1 (Ascend910B2) - 62152 MiB free
llama_load_model_from_file: using device CANN2 (Ascend910B2) - 62152 MiB free
llama_load_model_from_file: using device CANN3 (Ascend910B2) - 62152 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 771 tensors from /models/QwQ-32B-Q4_0.ggup (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ32B
llama_model_loader: - kv   3:                         general.size_label str              = 33B
llama_model_loader: - kv   4:                          qwen2.block_count u32              = 64
llama_model_loader: - kv   5:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   6:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   7:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv   8:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   9:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_0:  449 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 17.35 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = QwQ32B
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 33 repeating layers to GPU
llm_load_tensors: offloaded 33/65 layers to GPU
llm_load_tensors:        CANN0 model buffer size =  2354.66 MiB
llm_load_tensors:        CANN1 model buffer size =  2093.03 MiB
llm_load_tensors:        CANN2 model buffer size =  2093.03 MiB
llm_load_tensors:        CANN3 model buffer size =  2093.03 MiB
llm_load_tensors:   CPU_Mapped model buffer size =  9137.25 MiB
................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:      CANN0 KV buffer size =   144.00 MiB
llama_kv_cache_init:      CANN1 KV buffer size =   128.00 MiB
llama_kv_cache_init:      CANN2 KV buffer size =   128.00 MiB
llama_kv_cache_init:      CANN3 KV buffer size =   128.00 MiB
llama_kv_cache_init:        CPU KV buffer size =   496.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      CANN0 compute buffer size =   368.02 MiB
llama_new_context_with_model:      CANN1 compute buffer size =   368.00 MiB
llama_new_context_with_model:      CANN2 compute buffer size =   368.00 MiB
llama_new_context_with_model:      CANN3 compute buffer size =   368.00 MiB
llama_new_context_with_model:  CANN_Host compute buffer size =   307.00 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 441 (with bs=512), 6 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/workspace/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:63: CANN error: E89999: Inner Error!
E89999: [PID: 174412] 2024-12-11-06:51:31.011.789 op[Range], outSize from framework (OFF) is 1, but outSize from tiling (OFT) is 64,which maybe calc OFF by double, but calc OFT by floatplease use float to calc OFF while you wanner input's dtype is float[FUNC:CalculateOutputNum][FILE:range.cc][LINE:113]
        TraceBack (most recent call last):
       op[Range], calculate output_total_num value fail.[FUNC:AppendTilingArgs][FILE:range.cc][LINE:182]
       op[Range], append tiling args fail.[FUNC:Tiling4Range][FILE:range.cc][LINE:255]
       Tiling failed
       Tiling Failed.
       Kernel Run failed. opType: 7, Range
       launch failed for Range, errno:561103.

  current device: 0, in function aclnn_arange at /workspace/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:300
CANN error
  aclnnArange(workspaceAddr, workspaceSize, executor, ctx.stream())
Aborted (core dumped)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant