You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fromllama_cppimportLlamallm=Llama(
model_path="C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf",
n_gpu_layers=-1,
seed=1337,
n_ctx=2048,
)
output=llm(
"Name the planets in the solar system.",
max_tokens=256,
echo=True
)
print(output)
Failure Logs
This is a follow up from #1709, which describes the build steps for using an Intel iGPU with oneAPI.
(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>git log -n 1
commit f3fb90b114835cc50c4816787d56bac2fe1180c3 (HEAD -> main, origin/main, origin/HEAD)
Author: Andrei Betlen <[email protected]>
Date: Thu Nov 28 18:27:55 2024 -0500
feat: Update llama.cpp
(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>python --version
Python 3.11.10
(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>pip list | findstr /C:numpy /C:fastapi /C:sse-starlette /C:uvicorn
numpy 1.26.4
(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>python test.py
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Iris(R) Xe Graphics) - 14798 MiB free
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: control token: 2 '</s>' is not marked as EOG
llm_load_vocab: control token: 1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: tensor 'token_embd.weight' (q4_0) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 model buffer size = 3577.56 MiB
llm_load_tensors: CPU_Mapped model buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global |
|
| | | | |compute|Max work|sub |mem |
|
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Iris Xe Graphics| 12.3| 96| 512| 32| 15517M| 1.3.28044|
llama_kv_cache_init: SYCL0 KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 210.32 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 4.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1 (with bs=512), 2 (with bs=1)
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
Model metadata: {'general.name': 'LLaMA v2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '11008', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '2', 'llama.attention.head_count_kv': '32', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0'}
Using fallback chat format: llama-2
llama_perf_context_print: load time = 15870.59 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 10 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 255 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 109988.14 ms / 265 tokens
{'id': 'cmpl-c992020f-eb88-4493-b18e-37db1ea42ec8', 'object': 'text_completion', 'created': 1733166304, 'model': 'C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf', 'choices': [{'text': 'Name the planets in the solar system.ϊ. What are the names of the nine planets in the solar system? 2. Name the five largest planets in the solar system. 3. Name the four inner planets in the solar system. 4. Name the outer planets in the solar system. 5. Name the four inner planets of the solar system. 6. Name the four outer planets of the solar system. 7. Name the seven planets of the solar system. 8. Name the nine planets of the solar system.\nThe Solar System 3.2\nA planet is a round celestial body that orbits around the Sun. There are 8 planets in the Solar System. These are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. The Sun is not a planet. The Solar System is the collection of all the planets, their moons, and other objects that are gravitationally bound to the Sun. The planets of the Solar System are made of rock and metal, and they orbit the Sun in a flat, circular path called an ellipse. They travel around the Sun in an almost circular, flat path', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 10, 'completion_tokens': 256, 'total_tokens': 266}}
The sample works, use the GPU, but the metrics section is invalid:
llama_perf_context_print: load time = 15870.59 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 10 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 255 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 109988.14 ms / 265 tokens
The text was updated successfully, but these errors were encountered:
Prerequisites
Expected Behavior
Running with Intel GPU should report valid metrics (tokens generated per seconds and such)
Current Behavior
Metrics are reported with invalid values (see logs)
Environment and Context
Failure Information (for bugs)
The sample code runs fine, but the metrics are coming back empty.
Steps to Reproduce
Failure Logs
This is a follow up from #1709, which describes the build steps for using an Intel iGPU with oneAPI.
The sample works, use the GPU, but the metrics section is invalid:
The text was updated successfully, but these errors were encountered: