We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
./llama-cli --version version: 4302 (43041d2) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
Linux
CANN
Huawei Ascend 910b
QwQ-32B-Q4_0
When I run the following command to start llama-cli, it crashed with CANN error CANN error: E89999: Inner Error!
./llama-cli -m /models/QwQ-32B-Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
No response
./llama-cli -m /models/QwQ-32B-Q4_0.ggup -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer build: 4302 (43041d2e) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu main: llama backend init main: load the model and apply lora adapter, if any llama_load_model_from_file: using device CANN0 (Ascend910B2) - 62152 MiB free llama_load_model_from_file: using device CANN1 (Ascend910B2) - 62152 MiB free llama_load_model_from_file: using device CANN2 (Ascend910B2) - 62152 MiB free llama_load_model_from_file: using device CANN3 (Ascend910B2) - 62152 MiB free llama_model_loader: loaded meta data with 24 key-value pairs and 771 tensors from /models/QwQ-32B-Q4_0.ggup (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = QwQ32B llama_model_loader: - kv 3: general.size_label str = 33B llama_model_loader: - kv 4: qwen2.block_count u32 = 64 llama_model_loader: - kv 5: qwen2.context_length u32 = 32768 llama_model_loader: - kv 6: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 7: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 8: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 9: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_0: 449 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 5 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 27648 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 32B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 32.76 B llm_load_print_meta: model size = 17.35 GiB (4.55 BPW) llm_load_print_meta: general.name = QwQ32B llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|endoftext|>' llm_load_print_meta: EOG token = 151645 '<|im_end|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 llm_load_tensors: offloading 33 repeating layers to GPU llm_load_tensors: offloaded 33/65 layers to GPU llm_load_tensors: CANN0 model buffer size = 2354.66 MiB llm_load_tensors: CANN1 model buffer size = 2093.03 MiB llm_load_tensors: CANN2 model buffer size = 2093.03 MiB llm_load_tensors: CANN3 model buffer size = 2093.03 MiB llm_load_tensors: CPU_Mapped model buffer size = 9137.25 MiB ................................................................................................ llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_ctx_per_seq = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized llama_kv_cache_init: CANN0 KV buffer size = 144.00 MiB llama_kv_cache_init: CANN1 KV buffer size = 128.00 MiB llama_kv_cache_init: CANN2 KV buffer size = 128.00 MiB llama_kv_cache_init: CANN3 KV buffer size = 128.00 MiB llama_kv_cache_init: CPU KV buffer size = 496.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CANN0 compute buffer size = 368.02 MiB llama_new_context_with_model: CANN1 compute buffer size = 368.00 MiB llama_new_context_with_model: CANN2 compute buffer size = 368.00 MiB llama_new_context_with_model: CANN3 compute buffer size = 368.00 MiB llama_new_context_with_model: CANN_Host compute buffer size = 307.00 MiB llama_new_context_with_model: graph nodes = 2246 llama_new_context_with_model: graph splits = 441 (with bs=512), 6 (with bs=1) common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) /workspace/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:63: CANN error: E89999: Inner Error! E89999: [PID: 174412] 2024-12-11-06:51:31.011.789 op[Range], outSize from framework (OFF) is 1, but outSize from tiling (OFT) is 64,which maybe calc OFF by double, but calc OFT by floatplease use float to calc OFF while you wanner input's dtype is float[FUNC:CalculateOutputNum][FILE:range.cc][LINE:113] TraceBack (most recent call last): op[Range], calculate output_total_num value fail.[FUNC:AppendTilingArgs][FILE:range.cc][LINE:182] op[Range], append tiling args fail.[FUNC:Tiling4Range][FILE:range.cc][LINE:255] Tiling failed Tiling Failed. Kernel Run failed. opType: 7, Range launch failed for Range, errno:561103. current device: 0, in function aclnn_arange at /workspace/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:300 CANN error aclnnArange(workspaceAddr, workspaceSize, executor, ctx.stream()) Aborted (core dumped)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Name and Version
./llama-cli --version
version: 4302 (43041d2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
Operating systems
Linux
GGML backends
CANN
Hardware
Huawei Ascend 910b
Models
QwQ-32B-Q4_0
Problem description & steps to reproduce
When I run the following command to start llama-cli, it crashed with CANN error CANN error: E89999: Inner Error!
./llama-cli -m /models/QwQ-32B-Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: