With Intel GPU on Windows, llama_perf_context_print reports invalid performance metrics #1853

dnoliver · 2024-12-02T19:07:47Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Running with Intel GPU should report valid metrics (tokens generated per seconds and such)

Current Behavior

Metrics are reported with invalid values (see logs)

Environment and Context

2th Gen Intel Core i7-1270P
Intel Iris Xe Graphics
Windows 11
Python 3.11.10
Visual Studio 2022
Intel oneAPI Toolkit 2025.0

Failure Information (for bugs)

The sample code runs fine, but the metrics are coming back empty.

Steps to Reproduce

Follow the steps at https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md#windows to get the SYCL build ready
Follow the build process described in the Expected Behavior section
Run the build with:

set CMAKE_GENERATOR=Ninja
set CMAKE_ARGS=-DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release
pip install -e .

Run the example code posted below

from llama_cpp import Llama

llm = Llama(
      model_path="C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf",
      n_gpu_layers=-1,
      seed=1337,
      n_ctx=2048,
)
output = llm(
      "Name the planets in the solar system.",
      max_tokens=256,
      echo=True
)
print(output)

Failure Logs

This is a follow up from #1709, which describes the build steps for using an Intel iGPU with oneAPI.

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>git log -n 1
commit f3fb90b114835cc50c4816787d56bac2fe1180c3 (HEAD -> main, origin/main, origin/HEAD)
Author: Andrei Betlen <[email protected]>
Date:   Thu Nov 28 18:27:55 2024 -0500

    feat: Update llama.cpp

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>python --version
Python 3.11.10

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>pip list | findstr /C:numpy /C:fastapi /C:sse-starlette /C:uvicorn
numpy                        1.26.4

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>python test.py
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Iris(R) Xe Graphics) - 14798 MiB free
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: tensor 'token_embd.weight' (q4_0) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =  3577.56 MiB
llm_load_tensors:   CPU_Mapped model buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |
    |
|  |                   |                                       |       |compute|Max work|sub  |mem    |
    |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|   12.3|     96|     512|   32| 15517M|            1.3.28044|
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   210.32 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     4.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1 (with bs=512), 2 (with bs=1)
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
Model metadata: {'general.name': 'LLaMA v2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '11008', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '2', 'llama.attention.head_count_kv': '32', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0'}
Using fallback chat format: llama-2
llama_perf_context_print:        load time =   15870.59 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    10 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   255 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  109988.14 ms /   265 tokens
{'id': 'cmpl-c992020f-eb88-4493-b18e-37db1ea42ec8', 'object': 'text_completion', 'created': 1733166304, 'model': 'C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf', 'choices': [{'text': 'Name the planets in the solar system.ϊ. What are the names of the nine planets in the solar system? 2. Name the five largest planets in the solar system. 3. Name the four inner planets in the solar system. 4. Name the outer planets in the solar system. 5. Name the four inner planets of the solar system. 6. Name the four outer planets of the solar system. 7. Name the seven planets of the solar system. 8. Name the nine planets of the solar system.\nThe Solar System 3.2\nA planet is a round celestial body that orbits around the Sun. There are 8 planets in the Solar System. These are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. The Sun is not a planet. The Solar System is the collection of all the planets, their moons, and other objects that are gravitationally bound to the Sun. The planets of the Solar System are made of rock and metal, and they orbit the Sun in a flat, circular path called an ellipse. They travel around the Sun in an almost circular, flat path', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 10, 'completion_tokens': 256, 'total_tokens': 266}}

The sample works, use the GPU, but the metrics section is invalid:

llama_perf_context_print:        load time =   15870.59 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    10 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   255 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  109988.14 ms /   265 tokens

The text was updated successfully, but these errors were encountered:

dnoliver mentioned this issue Dec 2, 2024

igpu #1709

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With Intel GPU on Windows, llama_perf_context_print reports invalid performance metrics #1853

With Intel GPU on Windows, llama_perf_context_print reports invalid performance metrics #1853

dnoliver commented Dec 2, 2024 •

edited

Loading

With Intel GPU on Windows, llama_perf_context_print reports invalid performance metrics #1853

With Intel GPU on Windows, llama_perf_context_print reports invalid performance metrics #1853

Comments

dnoliver commented Dec 2, 2024 • edited Loading

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

dnoliver commented Dec 2, 2024 •

edited

Loading