Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BMG dgfx][ipex-llm[cpp]] low performance and gpu using when running llama.cpp inference on B580 #12586

Open
jianjungu opened this issue Dec 20, 2024 · 1 comment
Assignees

Comments

@jianjungu
Copy link

I'm running llama.cpp which is in ipex-llm[cpp] build 2024.12.17 on Intel B580 dGfx.

and I found the gpu usage is near 40% and the token per second is around 20TPS.

image

The command I'm using is

set SYCL_CACHE_PERSISTENT=1

llama-cli -m ..\glm4.gguf -n 32 --prompt "why sky is blue?" -c 2048 -e -ngl 999 --color --no-mmap -n 4096

and the console output



C:\Users\test\Documents\bmg_test\libs>llama-cli -m ..\glm4.gguf -n 32 --prompt "why sky is blue?" -c 2048 -e -ngl 999 --color --no-mmap -n 4096
build: 1 (1133019) with MSVC 19.38.33133.0 for
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from ..\glm4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = chatglm
llama_model_loader: - kv   1:                               general.name str              = glm-4-9b-chat
llama_model_loader: - kv   2:                     chatglm.context_length u32              = 131072
llama_model_loader: - kv   3:                   chatglm.embedding_length u32              = 4096
llama_model_loader: - kv   4:                chatglm.feed_forward_length u32              = 13696
llama_model_loader: - kv   5:                        chatglm.block_count u32              = 40
llama_model_loader: - kv   6:               chatglm.attention.head_count u32              = 32
llama_model_loader: - kv   7:            chatglm.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:   chatglm.attention.layer_norm_rms_epsilon f32              = 0.000000
llama_model_loader: - kv   9:                          general.file_type u32              = 2
llama_model_loader: - kv  10:               chatglm.rope.dimension_count u32              = 64
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                     chatglm.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = chatglm-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151073]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  20:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = [gMASK]<sop>{% for item in messages %...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  161 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = chatglm
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 151073
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.6e-07
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 5.08 GiB (4.64 BPW)
llm_load_print_meta: general.name     = glm-4-9b-chat
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
llm_load_print_meta: EOG token        = 151329 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  4863.85 MiB
llm_load_tensors:  SYCL_Host buffer size =   333.00 MiB
................................................................................
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
llama_new_context_with_model: n_ctx      = 2048
ggml_check_sycl: GGML_SYCL_F16: no
llama_new_context_with_model: n_batch    = 2048
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |
    |
|  |                   |                                       |       |compute|Max work|sub  |mem    |
    |
llama_new_context_with_model: n_ubatch   = 2048
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
llama_new_context_with_model: flash_attn = 0
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
| 0| [level_zero:gpu:0]|                Intel Arc B580 Graphics|    1.6|    160|    1024|   32| 12450M|            1.3.31155|
llama_kv_cache_init:      SYCL0 KV buffer size =    80.00 MiB
llama_new_context_with_model: KV self size  =   80.00 MiB, K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1248.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    48.02 MiB
llama_new_context_with_model: graph nodes  = 1446
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 20

system_info: n_threads = 20 (n_threads_batch = 20) / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 4071779348
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 4096, n_predict = 4096, n_keep = 0

why sky is blue? what makes the color of the sky blue?
The color of the sky is blue because of the way Earth's atmosphere interacts with sunlight. When sunlight passes through the Earth's atmosphere, it encounters molecules and particles of gas, dust, and water vapor. These particles scatter the light in all directions.
The scattering of light is more effective for shorter wavelengths, such as blue light, than for longer wavelengths, such as red light. This is because the shorter waves of blue light are more easily bent as they pass around the particles in the atmosphere. This phenomenon is known as Rayleigh scattering.
As the blue light scatters in all directions, it illuminates the sky from all angles, making the sky appear blue during the day. The intensity of the blue color is strongest at the horizon and becomes lighter towards the zenith (the highest point in the sky), which is why the sky often appears a lighter blue or even white at high altitudes or during sunrise and sunset.
At night, when the sun is below the horizon, the sky is no longer illuminated by sunlight, and therefore does not appear blue. Instead, the sky often appears black, or a dark blue, because we are looking into space rather than through the atmosphere.
In addition to Rayleigh scattering, which is responsible for the blue color of the sky during the day, there are other factors that can affect the color of the sky, such as:
- **Rayleigh scattering by larger particles**: This can cause the sky to appear whitish or grayish when there are a lot of particles in the air, such as during dust storms or volcanic eruptions.
- **Mie scattering**: This type of scattering, which is caused by larger particles such as water droplets, can cause the sky to appear red or orange during sunrise and sunset.
- **Molecular absorption**: Certain gases in the atmosphere, such as ozone, can absorb certain wavelengths of light, which can affect the color of the sky.
In summary, the blue color of the sky is primarily due to the scattering of sunlight by molecules in the Earth's atmosphere, with shorter wavelengths (like blue) being scattered more than longer wavelengths (like red). This phenomenon is known as Rayleigh scattering, and it is the primary reason we see a blue sky during the day. During sunrise and sunset, other scattering processes can also contribute to the colors we see. [end of text]


llama_perf_sampler_print:    sampling time =      40.85 ms /   482 runs   (    0.08 ms per token, 11800.42 tokens per second)
llama_perf_context_print:        load time =    8706.65 ms
llama_perf_context_print: prompt eval time =     120.56 ms /     5 tokens (   24.11 ms per token,    41.47 tokens per second)
llama_perf_context_print:        eval time =   22397.03 ms /   476 runs   (   47.05 ms per token,    21.25 tokens per second)
llama_perf_context_print:       total time =   22625.44 ms /   481 tokens

@leonardozcm
Copy link
Contributor

hi, I have tried glm4 in BMG and it runs faster(25.81ms) than 47.05ms per token. Since your driver version is newer than mine(101.6236), I wonder what is your oneapi version(2024.2.1 recommended).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants