Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel GPU not enabled when using -DLLAVA_BUILD=OFF #1851

Open
4 tasks done
dnoliver opened this issue Dec 2, 2024 · 0 comments
Open
4 tasks done

Intel GPU not enabled when using -DLLAVA_BUILD=OFF #1851

dnoliver opened this issue Dec 2, 2024 · 0 comments

Comments

@dnoliver
Copy link

dnoliver commented Dec 2, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Disabling LLAVA for GPU builds should get you a version that use the Intel GPU.

Current Behavior

This is a follow up from #1709, which describes the build steps for using an Intel iGPU with oneAPI.
We noted that, when you use -DLLAVA_BUILD=OFF , the resulting build doesn't have iGPU support
So building this project like:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
set CMAKE_ARGS="-DLLAVA_BUILD=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON"
pip install -e . --verbose

And then running the following sample code:

from llama_cpp import Llama

llm = Llama(
      model_path="C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf",
      n_gpu_layers=-1,
      seed=1337,
      n_ctx=2048,
)
output = llm(
      "Name the planets in the solar system.",
      max_tokens=256,
      echo=True
)
print(output)

Works, but it only uses the CPU.

Environment and Context

12th Gen Intel Core i7-1270P
Intel Iris Xe Graphics
Windows 11
Python 3.11.10
Visual Studio 2022
Intel oneAPI Toolkit 2025.0

Failure Information (for bugs)

There is no GPU usage, as noted by the failure log

Steps to Reproduce

  1. Follow the steps at https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md#windows to get the SYCL build ready
  2. Follow the build process described in the Expected Behavior section
  3. Run the python code in the Expected Behavior section

Failure Logs

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>git log -n 1
commit f3fb90b114835cc50c4816787d56bac2fe1180c3 (HEAD -> main, origin/main, origin/HEAD)
Author: Andrei Betlen <[email protected]>
Date:   Thu Nov 28 18:27:55 2024 -0500

    feat: Update llama.cpp

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>python --version
Python 3.11.10

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>pip list | findstr /C:numpy /C:fastapi /C:sse-starlette /C:uvicorn
numpy                        1.26.4

(poc) C:\Users\dnoliver\GitHub\dnoliver\llama-cpp-python>python test.py
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: tensor 'token_embd.weight' (q4_0) (and 290 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors:   CPU_Mapped model buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   164.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
Model metadata: {'general.name': 'LLaMA v2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '11008', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '2', 'llama.attention.head_count_kv': '32', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0'}
Using fallback chat format: llama-2
llama_perf_context_print:        load time =    4966.41 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    10 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   255 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  100959.94 ms /   265 tokens
{'id': 'cmpl-a95738f7-4f38-499e-a512-cca7a4dd30c3', 'object': 'text_completion', 'created': 1733165270, 'model': 'C:/Users/dnoliver/Downloads/llama-2-7b.Q4_0.gguf', 'choices': [{'text': 'Name the planets in the solar system.ϊν. The Sun, the Moon, and the planets (Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, and Pluto).\nA planet is a large celestial body that orbits a star and rotates on its axis. Planets are the second largest class of objects in the solar system. The largest are the stars.\nThe solar system consists of the Sun, planets, and other celestial bodies that revolve around the Sun. The nine planets revolve around the Sun in the order: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, and Pluto.\nThe Sun is the star at the center of our solar system. It is the only star in our solar system that has planets. The Sun is the largest object in the solar system and the most massive. The Sun is about 109 times the size of Earth and its mass is 333,000 times the mass of Earth.\nThe planets are large objects that revolve around the Sun. The nine planets in our solar system are Mercury, Venus, Earth,', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 10, 'completion_tokens': 256, 'total_tokens': 266}}

It works, but it doesn't use the GPU backend

@dnoliver dnoliver changed the title Intel iGPU not enabled when using -DLLAVA_BUILD=OFF Intel GPU not enabled when using -DLLAVA_BUILD=OFF Dec 2, 2024
@dnoliver dnoliver mentioned this issue Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant