Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CANN] Support Q8_0 #8805

Merged
merged 1 commit into from
Aug 1, 2024
Merged

[CANN] Support Q8_0 #8805

merged 1 commit into from
Aug 1, 2024

Conversation

wangshuai09
Copy link
Contributor

This PR fixes the MulMat_Q8_0 on CANN backend and CANN backend supports Q8_0 Model with this PR.

Test with LLama2-7b-Q8-0:

(base) root@4018f537ff75:/home/wangshuai/downloads/src/llama.cpp/build# ./bin/llama-cli -m /home/wangshuai/models/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q8_0.gguf -ngl 32 -p "Building a website can be done in 10 simple steps:" --seed 1024
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 3489 (5af1609e)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1024
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/wangshuai/models/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 6.67 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors:        CPU buffer size =  6828.64 MiB
llm_load_tensors:       CANN buffer size =  6563.01 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:       CANN KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:       CANN compute buffer size =   296.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

system_info: n_threads = 192 / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1


 Building a website can be done in 10 simple steps:
1. Define your website's purpose: Before you begin building your website, you need to determine its purpose. What do you want your website to achieve? What message do you want to convey to your visitors? Knowing the purpose of your website will help you determine the type of content and features you need to include.
2. Choose a domain name: Your domain name is the address of your website (e.g., [www.yoursite.com](http://www.yoursite.com)). It should be easy to remember, easy to spell, and relevant to your website's content.
3. Select a web hosting provider: Your web hosting provider will store your website's files and make them accessible to visitors. Look for a provider that offers reliable uptime, adequate storage, and good customer support.
4. Plan your website's structure: Determine how you want your website to be organized. This will involve creating a site map, which is a visual representation of your website's pages and how they are linked together.
5. Design your website: Use a website builder or a graphics program to create the visual design of your website. This includes choosing a color scheme, selecting fonts, and creating images or graphics.
6. Write and edit content: Create the content for your website, including text, images, and other media. Make sure your content is engaging, informative, and optimized for search engines.
7. Add functionality: Consider adding interactive features to your website, such as forms, galleries, and videos. You can also integrate third-party tools and services, such as social media widgets or e-commerce plugins.
8. Launch your website: Once you have built and tested your website, it's time to launch it. Make sure to check for any bugs or technical issues before going live.
9. Maintain and update your website: Your website is not a static entity – it needs to be updated and maintained regularly. This includes adding new content, fixing broken links, and keeping your website's software and plugins up to date.
10. Monitor and analyze your website's performance: Use tools like Google Analytics to track your website's traffic, engagement, and conversion rates. This will help you identify areas for improvement and optimize your website for better results.
Building a website can be a complex process, but following these 10 simple steps can help you get started. [end of text]

llama_print_timings:        load time =   11814.90 ms
llama_print_timings:      sample time =      39.35 ms /   519 runs   (    0.08 ms per token, 13190.33 tokens per second)
llama_print_timings: prompt eval time =      64.72 ms /    14 tokens (    4.62 ms per token,   216.32 tokens per second)
llama_print_timings:        eval time =   26427.93 ms /   518 runs   (   51.02 ms per token,    19.60 tokens per second)
llama_print_timings:       total time =   26862.97 ms /   532 tokens
Log end

@hipudding hipudding self-requested a review August 1, 2024 01:19
@hipudding hipudding merged commit c8a0090 into ggerganov:master Aug 1, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Aug 2, 2024
@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants