Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Unable to run llama3.2 on ipex-llm[cpp] #12598

Open
ajatprabha opened this issue Dec 23, 2024 · 15 comments
Open

RuntimeError: Unable to run llama3.2 on ipex-llm[cpp] #12598

ajatprabha opened this issue Dec 23, 2024 · 15 comments
Assignees

Comments

@ajatprabha
Copy link

I'm trying to run ollama on an integrated GPU of Intel i5-1240P processor. I followed this doc.

Everything is installed okay, however, when I try to run the model, it crashes at runtime.

Attaching the error details:

Details

[GIN] 2024/12/23 - 19:55:38 | 200 |      19.579µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/12/23 - 19:55:38 | 200 |   13.552991ms |       127.0.0.1 | POST     "/api/show"
time=2024-12-23T19:55:38.186Z level=INFO source=server.go:105 msg="system memory" total="16.0 GiB" free="15.2 GiB" free_swap="8.0 GiB"
time=2024-12-23T19:55:38.186Z level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[15.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.3 GiB" memory.required.partial="0 B" memory.required.kv="896.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2024-12-23T19:55:38.187Z level=INFO source=server.go:401 msg="starting llama server" cmd="/tmp/ollama1778822854/runners/ipex_llm/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 999 --threads 4 --no-mmap --parallel 4 --port 46365"
time=2024-12-23T19:55:38.191Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-23T19:55:38.191Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2024-12-23T19:55:38.191Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2024-12-23T19:55:38.221Z level=INFO source=runner.go:941 msg="starting go runner"
time=2024-12-23T19:55:38.221Z level=INFO source=runner.go:942 msg=system info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=4
time=2024-12-23T19:55:38.222Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46365"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-12-23T19:55:38.442Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  1918.36 MiB
llm_load_tensors:  SYCL_Host buffer size =   308.23 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel UHD Graphics|    1.3|     80|     512|   32| 30747M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     2.00 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   256.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 2
time=2024-12-23T19:55:40.702Z level=INFO source=server.go:619 msg="llama runner started in 2.51 seconds"
[GIN] 2024/12/23 - 19:55:40 | 200 |  2.554739911s |       127.0.0.1 | POST     "/api/generate"
ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
SIGABRT: abort
PC=0x7d574ae969fc m=3 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 7 gp=0xc0000e2000 m=3 mp=0xc00005b008 [syscall]:
runtime.cgocall(0x5f36e170f540, 0xc000069b48)
	runtime/cgocall.go:157 +0x4b fp=0xc000069b20 sp=0xc000069ae8 pc=0x5f36e149042b
ollama/llama/llamafile._Cfunc_llama_decode(0x7d56b2384880, {0x21, 0x7d56b238cac0, 0x0, 0x0, 0x7d56d8027bc0, 0x7d56d80233e0, 0x7d56d800d350, 0x7d56d807cb70, 0x0, ...})
	_cgo_gotypes.go:548 +0x52 fp=0xc000069b48 sp=0xc000069b20 pc=0x5f36e158d9b2
ollama/llama/llamafile.(*Context).Decode.func1(0x5f36e170afab?, 0x7d56b2384880?)
	ollama/llama/llamafile/llama.go:121 +0xd8 fp=0xc000069c68 sp=0xc000069b48 pc=0x5f36e158ffd8
ollama/llama/llamafile.(*Context).Decode(0xc000069d58?, 0x0?)
	ollama/llama/llamafile/llama.go:121 +0x13 fp=0xc000069cb0 sp=0xc000069c68 pc=0x5f36e158fe73
main.(*Server).processBatch(0xc0000b2120, 0xc000136000, 0xc000069f10)
	ollama/llama/runner/runner.go:434 +0x24d fp=0xc000069ed0 sp=0xc000069cb0 pc=0x5f36e1709c6d
main.(*Server).run(0xc0000b2120, {0x5f36e1a15e00, 0xc0000880a0})
	ollama/llama/runner/runner.go:342 +0x1e5 fp=0xc000069fb8 sp=0xc000069ed0 pc=0x5f36e17096e5
main.main.gowrap2()
	ollama/llama/runner/runner.go:980 +0x28 fp=0xc000069fe0 sp=0xc000069fb8 pc=0x5f36e170e548
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x5f36e14f8e41
created by main.main in goroutine 1
	ollama/llama/runner/runner.go:980 +0xd3e

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0xc000042008?, 0x0?, 0xc0?, 0x61?, 0xc00003b898?)
	runtime/proc.go:402 +0xce fp=0xc00003b860 sp=0xc00003b840 pc=0x5f36e14c706e
runtime.netpollblock(0xc00003b8f8?, 0xe148fb86?, 0x36?)
	runtime/netpoll.go:573 +0xf7 fp=0xc00003b898 sp=0xc00003b860 pc=0x5f36e14bf2b7
internal/poll.runtime_pollWait(0x7d574ec49020, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc00003b8b8 sp=0xc00003b898 pc=0x5f36e14f3b05
internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00003b8e0 sp=0xc00003b8b8 pc=0x5f36e1543a27
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000de080)
	internal/poll/fd_unix.go:611 +0x2ac fp=0xc00003b988 sp=0xc00003b8e0 pc=0x5f36e1544eec
net.(*netFD).accept(0xc0000de080)
	net/fd_unix.go:172 +0x29 fp=0xc00003ba40 sp=0xc00003b988 pc=0x5f36e15b3a69
net.(*TCPListener).accept(0xc000040220)
	net/tcpsock_posix.go:159 +0x1e fp=0xc00003ba68 sp=0xc00003ba40 pc=0x5f36e15c479e
net.(*TCPListener).Accept(0xc000040220)
	net/tcpsock.go:327 +0x30 fp=0xc00003ba98 sp=0xc00003ba68 pc=0x5f36e15c3af0
net/http.(*onceCloseListener).Accept(0xc0000b21b0?)
	<autogenerated>:1 +0x24 fp=0xc00003bab0 sp=0xc00003ba98 pc=0x5f36e16ead04
net/http.(*Server).Serve(0xc0000f4000, {0x5f36e1a157c0, 0xc000040220})
	net/http/server.go:3260 +0x33e fp=0xc00003bbe0 sp=0xc00003bab0 pc=0x5f36e16e1b1e
main.main()
	ollama/llama/runner/runner.go:1000 +0x10cd fp=0xc00003bf50 sp=0xc00003bbe0 pc=0x5f36e170e2cd
runtime.main()
	runtime/proc.go:271 +0x29d fp=0xc00003bfe0 sp=0xc00003bf50 pc=0x5f36e14c6c3d
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00003bfe8 sp=0xc00003bfe0 pc=0x5f36e14f8e41

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000054fa8 sp=0xc000054f88 pc=0x5f36e14c706e
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.forcegchelper()
	runtime/proc.go:326 +0xb8 fp=0xc000054fe0 sp=0xc000054fa8 pc=0x5f36e14c6ef8
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000054fe8 sp=0xc000054fe0 pc=0x5f36e14f8e41
created by runtime.init.6 in goroutine 1
	runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000055780 sp=0xc000055760 pc=0x5f36e14c706e
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.bgsweep(0xc000024150)
	runtime/mgcsweep.go:278 +0x94 fp=0xc0000557c8 sp=0xc000055780 pc=0x5f36e14b1bb4
runtime.gcenable.gowrap1()
	runtime/mgc.go:203 +0x25 fp=0xc0000557e0 sp=0xc0000557c8 pc=0x5f36e14a66e5
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000557e8 sp=0xc0000557e0 pc=0x5f36e14f8e41
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc000024150?, 0x5f36e178de60?, 0x1?, 0x0?, 0xc000007340?)
	runtime/proc.go:402 +0xce fp=0xc000055f78 sp=0xc000055f58 pc=0x5f36e14c706e
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.(*scavengerState).park(0x5f36e1bdf660)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000055fa8 sp=0xc000055f78 pc=0x5f36e14af5a9
runtime.bgscavenge(0xc000024150)
	runtime/mgcscavenge.go:653 +0x3c fp=0xc000055fc8 sp=0xc000055fa8 pc=0x5f36e14afb3c
runtime.gcenable.gowrap2()
	runtime/mgc.go:204 +0x25 fp=0xc000055fe0 sp=0xc000055fc8 pc=0x5f36e14a6685
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000055fe8 sp=0xc000055fe0 pc=0x5f36e14f8e41
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc000054648?, 0x5f36e1499fe5?, 0xa8?, 0x1?, 0xc0000061c0?)
	runtime/proc.go:402 +0xce fp=0xc000054620 sp=0xc000054600 pc=0x5f36e14c706e
runtime.runfinq()
	runtime/mfinal.go:194 +0x107 fp=0xc0000547e0 sp=0xc000054620 pc=0x5f36e14a5727
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000547e8 sp=0xc0000547e0 pc=0x5f36e14f8e41
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:164 +0x3d

goroutine 37 gp=0xc000007dc0 m=nil [IO wait]:
runtime.gopark(0x10?, 0x10?, 0xf0?, 0x6d?, 0xb?)
	runtime/proc.go:402 +0xce fp=0xc000056da8 sp=0xc000056d88 pc=0x5f36e14c706e
runtime.netpollblock(0x5f36e152d5b8?, 0xe148fb86?, 0x36?)
	runtime/netpoll.go:573 +0xf7 fp=0xc000056de0 sp=0xc000056da8 pc=0x5f36e14bf2b7
internal/poll.runtime_pollWait(0x7d574ec48f28, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc000056e00 sp=0xc000056de0 pc=0x5f36e14f3b05
internal/poll.(*pollDesc).wait(0xc0000de100?, 0xc000182041?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000056e28 sp=0xc000056e00 pc=0x5f36e1543a27
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000de100, {0xc000182041, 0x1, 0x1})
	internal/poll/fd_unix.go:164 +0x27a fp=0xc000056ec0 sp=0xc000056e28 pc=0x5f36e154457a
net.(*netFD).Read(0xc0000de100, {0xc000182041?, 0xc000056f48?, 0x5f36e14f5730?})
	net/fd_posix.go:55 +0x25 fp=0xc000056f08 sp=0xc000056ec0 pc=0x5f36e15b2965
net.(*conn).Read(0xc000058098, {0xc000182041?, 0x0?, 0x5f36e1c3f9a0?})
	net/net.go:185 +0x45 fp=0xc000056f50 sp=0xc000056f08 pc=0x5f36e15bcc25
net.(*TCPConn).Read(0x5f36e1ba2040?, {0xc000182041?, 0x0?, 0x0?})
	<autogenerated>:1 +0x25 fp=0xc000056f80 sp=0xc000056f50 pc=0x5f36e15c8605
net/http.(*connReader).backgroundRead(0xc000182030)
	net/http/server.go:681 +0x37 fp=0xc000056fc8 sp=0xc000056f80 pc=0x5f36e16d7497
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:677 +0x25 fp=0xc000056fe0 sp=0xc000056fc8 pc=0x5f36e16d73c5
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000056fe8 sp=0xc000056fe0 pc=0x5f36e14f8e41
created by net/http.(*connReader).startBackgroundRead in goroutine 8
	net/http/server.go:677 +0xba

goroutine 8 gp=0xc0000e21c0 m=nil [select]:
runtime.gopark(0xc00015ba28?, 0x2?, 0x50?, 0x61?, 0xc00015b7ec?)
	runtime/proc.go:402 +0xce fp=0xc00015b660 sp=0xc00015b640 pc=0x5f36e14c706e
runtime.selectgo(0xc00015ba28, 0xc00015b7e8, 0x21?, 0x0, 0x1?, 0x1)
	runtime/select.go:327 +0x725 fp=0xc00015b780 sp=0xc00015b660 pc=0x5f36e14d8445
main.(*Server).completion(0xc0000b2120, {0x5f36e1a15970, 0xc00012c2a0}, 0xc00011c360)
	ollama/llama/runner/runner.go:698 +0xa86 fp=0xc00015bab8 sp=0xc00015b780 pc=0x5f36e170bac6
main.(*Server).completion-fm({0x5f36e1a15970?, 0xc00012c2a0?}, 0x5f36e16e5e4d?)
	<autogenerated>:1 +0x36 fp=0xc00015bae8 sp=0xc00015bab8 pc=0x5f36e170ed76
net/http.HandlerFunc.ServeHTTP(0xc000098ea0?, {0x5f36e1a15970?, 0xc00012c2a0?}, 0x10?)
	net/http/server.go:2171 +0x29 fp=0xc00015bb10 sp=0xc00015bae8 pc=0x5f36e16de8e9
net/http.(*ServeMux).ServeHTTP(0x5f36e1499fe5?, {0x5f36e1a15970, 0xc00012c2a0}, 0xc00011c360)
	net/http/server.go:2688 +0x1ad fp=0xc00015bb60 sp=0xc00015bb10 pc=0x5f36e16e076d
net/http.serverHandler.ServeHTTP({0x5f36e1a14cc0?}, {0x5f36e1a15970?, 0xc00012c2a0?}, 0x6?)
	net/http/server.go:3142 +0x8e fp=0xc00015bb90 sp=0xc00015bb60 pc=0x5f36e16e178e
net/http.(*conn).serve(0xc0000b21b0, {0x5f36e1a15dc8, 0xc000096db0})
	net/http/server.go:2044 +0x5e8 fp=0xc00015bfb8 sp=0xc00015bb90 pc=0x5f36e16dd528
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3290 +0x28 fp=0xc00015bfe0 sp=0xc00015bfb8 pc=0x5f36e16e1f08
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00015bfe8 sp=0xc00015bfe0 pc=0x5f36e14f8e41
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3290 +0x4b4

rax    0x0
rbx    0x7d56ee400640
rcx    0x7d574ae969fc
rdx    0x6
rdi    0x145a
rsi    0x145c
rbp    0x145c
rsp    0x7d56ee3fefa0
r8     0x7d56ee3ff070
r9     0x0
r10    0x8
r11    0x246
r12    0x6
r13    0x16
r14    0x7d574d53eb0c
r15    0xffffaaae15800000
rip    0x7d574ae969fc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
[GIN] 2024/12/23 - 19:55:43 | 200 |  194.012636ms |       127.0.0.1 | POST     "/api/chat"

The CGO call fails with

ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
SIGABRT: abort

Installed versions:

Details

./ollama --version
ollama version is 0.4.6-ipexllm-20241223
intel-oneapi-ccl-2021.11/all,now 2021.11.2-5 amd64 [installed,automatic]
intel-oneapi-ccl-devel-2021.11/all,now 2021.11.2-5 amd64 [installed,automatic]
intel-oneapi-ccl-devel/all,now 2021.11.2-5 amd64 [installed,upgradable to: 2021.14.0-505]
intel-oneapi-ccl/all,now 2021.11.2-5 amd64 [installed,upgradable to: 2021.14.0-505]
intel-oneapi-common-licensing-2024.0/all,now 2024.0.0-49406 all [installed,automatic]
intel-oneapi-common-oneapi-vars-2024.0/all,now 2024.0.0-49406 all [installed,automatic]
intel-oneapi-common-oneapi-vars/all,now 2024.0.0-49406 all [installed,upgradable to: 2025.0.1-15]
intel-oneapi-common-vars/all,now 2024.0.0-49406 all [installed,upgradable to: 2025.0.1-15]
intel-oneapi-compiler-cpp-eclipse-cfg-2024.0/all,now 2024.0.2-49895 all [installed,automatic]
intel-oneapi-compiler-dpcpp-cpp-2024.0/all,now 2024.0.2-49895 amd64 [installed,automatic]
intel-oneapi-compiler-dpcpp-cpp-common-2024.0/all,now 2024.0.2-49895 all [installed,automatic]
intel-oneapi-compiler-dpcpp-cpp-runtime-2024.0/all,now 2024.0.2-49895 amd64 [installed,automatic]
intel-oneapi-compiler-dpcpp-cpp/all,now 2024.0.2-49895 amd64 [installed,upgradable to: 2025.0.4-1519]
intel-oneapi-compiler-dpcpp-eclipse-cfg-2024.0/all,now 2024.0.2-49895 all [installed,automatic]
intel-oneapi-compiler-shared-2024.0/all,now 2024.0.2-49895 amd64 [installed,automatic]
intel-oneapi-compiler-shared-common-2024.0/all,now 2024.0.2-49895 all [installed,automatic]
intel-oneapi-compiler-shared-runtime-2024.0/all,now 2024.0.2-49895 amd64 [installed,automatic]
intel-oneapi-dal-2024.0/all,now 2024.0.1-25 amd64 [installed,automatic]
intel-oneapi-dal-common-2024.0/all,now 2024.0.1-25 all [installed,automatic]
intel-oneapi-dal-common-devel-2024.0/all,now 2024.0.1-25 all [installed,automatic]
intel-oneapi-dal-devel-2024.0/all,now 2024.0.1-25 amd64 [installed,automatic]
intel-oneapi-dal-devel/all,now 2024.0.1-25 amd64 [installed,upgradable to: 2025.0.1-9]
intel-oneapi-dal/all,now 2024.0.1-25 amd64 [installed,upgradable to: 2025.0.1-9]
intel-oneapi-dev-utilities-2024.0/all,now 2024.0.0-49320 amd64 [installed,automatic]
intel-oneapi-dev-utilities-eclipse-cfg-2024.0/all,now 2024.0.0-49320 all [installed,automatic]
intel-oneapi-diagnostics-utility-2024.0/all,now 2024.0.0-49093 amd64 [installed,automatic]
intel-oneapi-diagnostics-utility/all,now 2024.0.0-49093 amd64 [installed,upgradable to: 2024.2.1-13]
intel-oneapi-dnnl-2024.0/all,now 2024.0.0-49521 amd64 [installed,automatic]
intel-oneapi-dnnl-devel-2024.0/all,now 2024.0.0-49521 amd64 [installed,automatic]
intel-oneapi-dnnl-devel/all,now 2024.0.0-49521 amd64 [installed,upgradable to: 2025.0.1-6]
intel-oneapi-dnnl/all,now 2024.0.0-49521 amd64 [installed,upgradable to: 2025.0.1-6]
intel-oneapi-dpcpp-cpp-2024.0/all,now 2024.0.2-49895 amd64 [installed,automatic]
intel-oneapi-dpcpp-ct-2024.0/all,now 2024.0.0-49381 amd64 [installed,automatic]
intel-oneapi-dpcpp-ct-eclipse-cfg-2024.0/all,now 2024.0.0-49381 all [installed,automatic]
intel-oneapi-dpcpp-ct/all,now 2024.0.0-49381 amd64 [installed,upgradable to: 2025.0.1-17]
intel-oneapi-dpcpp-debugger-2024.0/all,now 2024.0.1-6 amd64 [installed,automatic]
intel-oneapi-icc-eclipse-plugin-cpp-2024.0/all,now 2024.0.2-49895 all [installed,automatic]
intel-oneapi-ipp-2021.10/all,now 2021.10.1-13 amd64 [installed,automatic]
intel-oneapi-ipp-common-2021.10/all,now 2021.10.1-13 all [installed,automatic]
intel-oneapi-ipp-common-devel-2021.10/all,now 2021.10.1-13 all [installed,automatic]
intel-oneapi-ipp-devel-2021.10/all,now 2021.10.1-13 amd64 [installed,automatic]
intel-oneapi-ipp-devel/all,now 2021.10.1-13 amd64 [installed,upgradable to: 2022.0.0-808]
intel-oneapi-ipp/all,now 2021.10.1-13 amd64 [installed,upgradable to: 2022.0.0-808]
intel-oneapi-ippcp-2021.9/all,now 2021.9.1-5 amd64 [installed,automatic]
intel-oneapi-ippcp-common-2021.9/all,now 2021.9.1-5 all [installed,automatic]
intel-oneapi-ippcp-common-devel-2021.9/all,now 2021.9.1-5 all [installed,automatic]
intel-oneapi-ippcp-devel-2021.9/all,now 2021.9.1-5 amd64 [installed,automatic]
intel-oneapi-ippcp-devel/all,now 2021.9.1-5 amd64 [installed,upgradable to: 2025.0.0-615]
intel-oneapi-ippcp/all,now 2021.9.1-5 amd64 [installed,upgradable to: 2025.0.0-615]
intel-oneapi-libdpstd-devel-2022.3/all,now 2022.3.0-49369 amd64 [installed,automatic]
intel-oneapi-mkl-2024.0/all,now 2024.0.0-49656 amd64 [installed,automatic]
intel-oneapi-mkl-common-2024.0/all,now 2024.0.0-49656 all [installed,automatic]
intel-oneapi-mkl-common-devel-2024.0/all,now 2024.0.0-49656 all [installed,automatic]
intel-oneapi-mkl-devel-2024.0/all,now 2024.0.0-49656 amd64 [installed,automatic]
intel-oneapi-mkl-devel/all,now 2024.0.0-49656 amd64 [installed,upgradable to: 2025.0.1-14]
intel-oneapi-mkl/all,now 2024.0.0-49656 amd64 [installed,upgradable to: 2025.0.1-14]
intel-oneapi-mpi-2021.11/all,now 2021.11.0-49493 amd64 [installed,automatic]
intel-oneapi-mpi-devel-2021.11/all,now 2021.11.0-49493 amd64 [installed,automatic]
intel-oneapi-mpi-devel/all,now 2021.11.0-49493 amd64 [installed,upgradable to: 2021.14.1-5]
intel-oneapi-mpi/all,now 2021.11.0-49493 amd64 [installed,upgradable to: 2021.14.1-5]
intel-oneapi-openmp-2024.0/all,now 2024.0.2-49895 amd64 [installed,automatic]
intel-oneapi-openmp-common-2024.0/all,now 2024.0.2-49895 all [installed,automatic]
intel-oneapi-tbb-2021.11/all,now 2021.11.0-49513 amd64 [installed,automatic]
intel-oneapi-tbb-common-2021.11/all,now 2021.11.0-49513 all [installed,automatic]
intel-oneapi-tbb-common-devel-2021.11/all,now 2021.11.0-49513 all [installed,automatic]
intel-oneapi-tbb-devel-2021.11/all,now 2021.11.0-49513 amd64 [installed,automatic]
intel-oneapi-tcm-1.0/all,now 1.0.0-435 amd64 [installed,upgradable to: 1.0.1-175]
intel-oneapi-tlt-2024.0/all,now 2024.0.0-352 amd64 [installed,automatic]
intel-oneapi-tlt/all,now 2024.0.0-352 amd64 [installed,upgradable to: 2025.0.0-550]

Is this a known issue? I didn't see anywhere that this GPU is supported, but I went ahead and gave it a try anyway.

@ajatprabha
Copy link
Author

ajatprabha commented Dec 23, 2024

I tried llama3.1:8b and that worked well. It is also able to use the GPU

Details

intel-gpu-top: 8086:4626 @ /dev/dri/card0 - 1298/1298 MHz;   0% RC6; 13.50/35.05 W
         71 irqs/s

         ENGINES     BUSY                                                    MI_SEMA MI_WAIT
       Render/3D   99.08% |███████████████████████████████████████████████▋      0%      0%
         Blitter    0.00% |                                                |      0%      0%
           Video    0.00% |                                                |      0%      0%
    VideoEnhance    0.00% |                                                |      0%      0%

Update: It happened with llama3.1:8b too, it looks flaky on 3.1 but always fails on 3.2

@qiuxin2012
Copy link
Contributor

qiuxin2012 commented Dec 24, 2024

Llama 3.2 works fine on our i9 13900H, also Iris Graphics.
Can you check your GPU in task manager? i5-1240P should be Iris Graphics
image
But your log shows it's a UHD graphics?

found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel UHD Graphics|    1.3|     80|     512|   32| 30747M|            1.3.29735|

@ajatprabha
Copy link
Author

I'm on a linux machine, the difference in Device name can be because of RAM installed.

Intel specs say:

Intel® Iris® Xe Graphics only: to use the Intel® Iris® Xe brand, the system must be populated with 128-bit (dual channel) memory. Otherwise, use the Intel® UHD brand.

I only have a single 32GB RAM installed, which could explain the difference in Device Name.

@ajatprabha
Copy link
Author

The error is intermittent! I have been able to run both models every now and then, but most of the time it fails to run with assertion failure.

@ajatprabha
Copy link
Author

I tried lspci -k to check the video device details. It could be a driver issue as well

00:02.0 VGA compatible controller: Intel Corporation Alder Lake-P Integrated Graphics Controller (rev 0c)
        DeviceName: Onboard - Video
        Subsystem: Intel Corporation Alder Lake-P Integrated Graphics Controller
        Kernel driver in use: i915
        Kernel modules: i915, xe

But whether it is an (i915 or xe situation) or (i915 and xe situation) I'm not too sure by this output.

@qiuxin2012
Copy link
Contributor

qiuxin2012 commented Dec 24, 2024

We find it's a bug in our checking, we are fixing it.

@qiuxin2012
Copy link
Contributor

You can try to update ipex-llm[cpp] to 2.2.0b20241226 tomorrow. I have fixed this bug and tested on a similar device i7-1270P.

@ajatprabha
Copy link
Author

I tried it again by upgrading and calling init-ollama.
Looks like it is upgraded

pip freeze | grep ipex-llm
ipex-llm==2.2.0b20241226

However, I still get the same error.

Details

[GIN] 2024/12/27 - 09:48:25 | 200 |     444.146µs |  192.168.100.21 | GET      "/api/tags"
time=2024-12-27T09:48:25.713Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-12-27T09:48:25.714Z level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-12-27T09:48:25.714Z level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-12-27T09:48:25.715Z level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-12-27T09:48:25.722Z level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-12-27T09:48:25.723Z level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-12-27T09:48:25.792Z level=INFO source=server.go:105 msg="system memory" total="16.0 GiB" free="15.6 GiB" free_swap="8.0 GiB"
time=2024-12-27T09:48:25.792Z level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[0 B]" memory.gpu_overhead="0 B" memory.required.full="2.2 GiB" memory.required.partial="0 B" memory.required.kv="224.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2024-12-27T09:48:25.793Z level=INFO source=server.go:401 msg="starting llama server" cmd="/tmp/ollama1273445517/runners/ipex_llm/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --threads 4 --no-mmap --parallel 1 --port 39835"
time=2024-12-27T09:48:25.793Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-27T09:48:25.793Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2024-12-27T09:48:25.793Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2024-12-27T09:48:25.831Z level=INFO source=runner.go:956 msg="starting go runner"
time=2024-12-27T09:48:25.831Z level=INFO source=runner.go:957 msg=system info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=4
time=2024-12-27T09:48:25.831Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39835"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-12-27T09:48:26.045Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  1918.36 MiB
llm_load_tensors:  SYCL_Host buffer size =   308.23 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel UHD Graphics|    1.3|     80|     512|   32| 30747M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   256.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 2
time=2024-12-27T09:48:27.992Z level=WARN source=runner.go:894 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2024-12-27T09:48:29.060Z level=INFO source=server.go:619 msg="llama runner started in 3.27 seconds"
ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
SIGABRT: abort
PC=0x7081bbe969fc m=3 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 20 gp=0xc000102a80 m=3 mp=0xc00005b008 [syscall]:
runtime.cgocall(0x5d853c50f820, 0xc000069b48)
        runtime/cgocall.go:157 +0x4b fp=0xc000069b20 sp=0xc000069ae8 pc=0x5d853c29044b
ollama/llama/llamafile._Cfunc_llama_decode(0x708144006280, {0x20, 0x7080ed512cf0, 0x0, 0x0, 0x7080ed503040, 0x7080ed503850, 0x7080ed504060, 0x7080ed50f000, 0x0, ...})
        _cgo_gotypes.go:548 +0x52 fp=0xc000069b48 sp=0xc000069b20 pc=0x5d853c38d9d2
ollama/llama/llamafile.(*Context).Decode.func1(0x5d853c50b06b?, 0x708144006280?)
        ollama/llama/llamafile/llama.go:121 +0xd8 fp=0xc000069c68 sp=0xc000069b48 pc=0x5d853c390098
ollama/llama/llamafile.(*Context).Decode(0xc000069d58?, 0x0?)
        ollama/llama/llamafile/llama.go:121 +0x13 fp=0xc000069cb0 sp=0xc000069c68 pc=0x5d853c38ff33
main.(*Server).processBatch(0xc000146120, 0xc0000a6000, 0xc000069f10)
        ollama/llama/runner/runner.go:434 +0x24d fp=0xc000069ed0 sp=0xc000069cb0 pc=0x5d853c509d2d
main.(*Server).run(0xc000146120, {0x5d853c816ba0, 0xc000182050})
        ollama/llama/runner/runner.go:342 +0x1e5 fp=0xc000069fb8 sp=0xc000069ed0 pc=0x5d853c5097a5
main.main.gowrap2()
        ollama/llama/runner/runner.go:995 +0x28 fp=0xc000069fe0 sp=0xc000069fb8 pc=0x5d853c50e828
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x5d853c2f8e61
created by main.main in goroutine 1
        ollama/llama/runner/runner.go:995 +0xd3e
goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x1?, 0xc00003b8e0?, 0x74?, 0x6e?, 0xc00003b8c0?)
        runtime/proc.go:402 +0xce fp=0xc00003b860 sp=0xc00003b840 pc=0x5d853c2c708e
runtime.netpollblock(0x10?, 0x3c28fba6?, 0x85?)
        runtime/netpoll.go:573 +0xf7 fp=0xc00003b898 sp=0xc00003b860 pc=0x5d853c2bf2d7
internal/poll.runtime_pollWait(0x7081bfc46f50, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc00003b8b8 sp=0xc00003b898 pc=0x5d853c2f3b25
internal/poll.(*pollDesc).wait(0x3?, 0x7081bf7c0288?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00003b8e0 sp=0xc00003b8b8 pc=0x5d853c343a47
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00017c080)
        internal/poll/fd_unix.go:611 +0x2ac fp=0xc00003b988 sp=0xc00003b8e0 pc=0x5d853c344f0c
net.(*netFD).accept(0xc00017c080)
        net/fd_unix.go:172 +0x29 fp=0xc00003ba40 sp=0xc00003b988 pc=0x5d853c3b3b29
net.(*TCPListener).accept(0xc0001481c0)
        net/tcpsock_posix.go:159 +0x1e fp=0xc00003ba68 sp=0xc00003ba40 pc=0x5d853c3c485e
net.(*TCPListener).Accept(0xc0001481c0)
        net/tcpsock.go:327 +0x30 fp=0xc00003ba98 sp=0xc00003ba68 pc=0x5d853c3c3bb0
net/http.(*onceCloseListener).Accept(0xc000218000?)
        <autogenerated>:1 +0x24 fp=0xc00003bab0 sp=0xc00003ba98 pc=0x5d853c4eadc4
net/http.(*Server).Serve(0xc00019a000, {0x5d853c816560, 0xc0001481c0})
        net/http/server.go:3260 +0x33e fp=0xc00003bbe0 sp=0xc00003bab0 pc=0x5d853c4e1bde
main.main()
        ollama/llama/runner/runner.go:1015 +0x10cd fp=0xc00003bf50 sp=0xc00003bbe0 pc=0x5d853c50e5ad
runtime.main()
        runtime/proc.go:271 +0x29d fp=0xc00003bfe0 sp=0xc00003bf50 pc=0x5d853c2c6c5d
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc00003bfe8 sp=0xc00003bfe0 pc=0x5d853c2f8e61
goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000054fa8 sp=0xc000054f88 pc=0x5d853c2c708e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.forcegchelper()
        runtime/proc.go:326 +0xb8 fp=0xc000054fe0 sp=0xc000054fa8 pc=0x5d853c2c6f18
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000054fe8 sp=0xc000054fe0 pc=0x5d853c2f8e61
created by runtime.init.6 in goroutine 1
        runtime/proc.go:314 +0x1a
goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000055780 sp=0xc000055760 pc=0x5d853c2c708e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.bgsweep(0xc000022150)
        runtime/mgcsweep.go:278 +0x94 fp=0xc0000557c8 sp=0xc000055780 pc=0x5d853c2b1bd4
runtime.gcenable.gowrap1()
        runtime/mgc.go:203 +0x25 fp=0xc0000557e0 sp=0xc0000557c8 pc=0x5d853c2a6705
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000557e8 sp=0xc0000557e0 pc=0x5d853c2f8e61
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:203 +0x66
goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc000022150?, 0x5d853c58e1e8?, 0x1?, 0x0?, 0xc000007340?)
        runtime/proc.go:402 +0xce fp=0xc000055f78 sp=0xc000055f58 pc=0x5d853c2c708e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.(*scavengerState).park(0x5d853c9e0680)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc000055fa8 sp=0xc000055f78 pc=0x5d853c2af5c9
runtime.bgscavenge(0xc000022150)
        runtime/mgcscavenge.go:653 +0x3c fp=0xc000055fc8 sp=0xc000055fa8 pc=0x5d853c2afb5c
runtime.gcenable.gowrap2()
        runtime/mgc.go:204 +0x25 fp=0xc000055fe0 sp=0xc000055fc8 pc=0x5d853c2a66a5
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000055fe8 sp=0xc000055fe0 pc=0x5d853c2f8e61
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0xa5
goroutine 18 gp=0xc000102700 m=nil [finalizer wait]:
runtime.gopark(0xc000054648?, 0x5d853c29a005?, 0xa8?, 0x1?, 0xc0000061c0?)
        runtime/proc.go:402 +0xce fp=0xc000054620 sp=0xc000054600 pc=0x5d853c2c708e
runtime.runfinq()
        runtime/mfinal.go:194 +0x107 fp=0xc0000547e0 sp=0xc000054620 pc=0x5d853c2a5747
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000547e8 sp=0xc0000547e0 pc=0x5d853c2f8e61
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:164 +0x3d
goroutine 34 gp=0xc00021e000 m=nil [select]:
runtime.gopark(0xc000265a28?, 0x2?, 0x10?, 0x81?, 0xc0002657ec?)
        runtime/proc.go:402 +0xce fp=0xc000265660 sp=0xc000265640 pc=0x5d853c2c708e
runtime.selectgo(0xc000265a28, 0xc0002657e8, 0x20?, 0x0, 0x1?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc000265780 sp=0xc000265660 pc=0x5d853c2d8465
main.(*Server).completion(0xc000146120, {0x5d853c816710, 0xc000228540}, 0xc0002205a0)
        ollama/llama/runner/runner.go:698 +0xa86 fp=0xc000265ab8 sp=0xc000265780 pc=0x5d853c50bb86
main.(*Server).completion-fm({0x5d853c816710?, 0xc000228540?}, 0x5d853c4e5f0d?)
        <autogenerated>:1 +0x36 fp=0xc000265ae8 sp=0xc000265ab8 pc=0x5d853c50f056
net/http.HandlerFunc.ServeHTTP(0xc00011edd0?, {0x5d853c816710?, 0xc000228540?}, 0x10?)
        net/http/server.go:2171 +0x29 fp=0xc000265b10 sp=0xc000265ae8 pc=0x5d853c4de9a9
net/http.(*ServeMux).ServeHTTP(0x5d853c29a005?, {0x5d853c816710, 0xc000228540}, 0xc0002205a0)
        net/http/server.go:2688 +0x1ad fp=0xc000265b60 sp=0xc000265b10 pc=0x5d853c4e082d
net/http.serverHandler.ServeHTTP({0x5d853c815a60?}, {0x5d853c816710?, 0xc000228540?}, 0x6?)
        net/http/server.go:3142 +0x8e fp=0xc000265b90 sp=0xc000265b60 pc=0x5d853c4e184e
net/http.(*conn).serve(0xc000218000, {0x5d853c816b68, 0xc00011cdb0})
        net/http/server.go:2044 +0x5e8 fp=0xc000265fb8 sp=0xc000265b90 pc=0x5d853c4dd5e8
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3290 +0x28 fp=0xc000265fe0 sp=0xc000265fb8 pc=0x5d853c4e1fc8
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000265fe8 sp=0xc000265fe0 pc=0x5d853c2f8e61
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3290 +0x4b4
goroutine 40 gp=0xc00021e1c0 m=nil [IO wait]:
runtime.gopark(0x10?, 0x10?, 0xf0?, 0x5?, 0xb?)
        runtime/proc.go:402 +0xce fp=0xc0002305a8 sp=0xc000230588 pc=0x5d853c2c708e
runtime.netpollblock(0x5d853c32d5d8?, 0x3c28fba6?, 0x85?)
        runtime/netpoll.go:573 +0xf7 fp=0xc0002305e0 sp=0xc0002305a8 pc=0x5d853c2bf2d7
internal/poll.runtime_pollWait(0x7081bfc46e58, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc000230600 sp=0xc0002305e0 pc=0x5d853c2f3b25
internal/poll.(*pollDesc).wait(0xc000216000?, 0xc00008a041?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000230628 sp=0xc000230600 pc=0x5d853c343a47
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000216000, {0xc00008a041, 0x1, 0x1})
        internal/poll/fd_unix.go:164 +0x27a fp=0xc0002306c0 sp=0xc000230628 pc=0x5d853c34459a
net.(*netFD).Read(0xc000216000, {0xc00008a041?, 0xc000230748?, 0x5d853c2f5750?})
        net/fd_posix.go:55 +0x25 fp=0xc000230708 sp=0xc0002306c0 pc=0x5d853c3b2a25
net.(*conn).Read(0xc00020e008, {0xc00008a041?, 0x0?, 0x5d853ca409c0?})
        net/net.go:185 +0x45 fp=0xc000230750 sp=0xc000230708 pc=0x5d853c3bcce5
net.(*TCPConn).Read(0x5d853c9a3050?, {0xc00008a041?, 0x0?, 0x0?})
        <autogenerated>:1 +0x25 fp=0xc000230780 sp=0xc000230750 pc=0x5d853c3c86c5
net/http.(*connReader).backgroundRead(0xc00008a030)
        net/http/server.go:681 +0x37 fp=0xc0002307c8 sp=0xc000230780 pc=0x5d853c4d7557
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:677 +0x25 fp=0xc0002307e0 sp=0xc0002307c8 pc=0x5d853c4d7485
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0002307e8 sp=0xc0002307e0 pc=0x5d853c2f8e61
created by net/http.(*connReader).startBackgroundRead in goroutine 34
        net/http/server.go:677 +0xba
rax    0x0
rbx    0x70815f400640
rcx    0x7081bbe969fc
rdx    0x6
rdi    0x5c0b
rsi    0x5c0d
rbp    0x5c0d
rsp    0x70815f3fefa0
r8     0x70815f3ff070
r9     0x0
r10    0x8
r11    0x246
r12    0x6
r13    0x16
r14    0x7081be53ed6c
r15    0xffffd556aa790000
rip    0x7081bbe969fc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
[GIN] 2024/12/27 - 09:48:29 | 200 |    3.3945751s |  192.168.100.21 | POST     "/api/chat"
[GIN] 2024/12/27 - 09:48:30 | 200 |     356.158µs |  192.168.100.21 | GET      "/api/tags"

@ajatprabha
Copy link
Author

@qiuxin2012 Is there something else that can be looked at?

@qiuxin2012
Copy link
Contributor

@qiuxin2012 Is there something else that can be looked at?

Happy New Year, Sorry for late reply. I can reproduce your error, and I'm fixing it.

@ajatprabha
Copy link
Author

HNY @qiuxin2012! NP, this isn't urgent on my end, wanted to followup if this is still an issue in the lib or my system has an issue. Thanks for the update!

@qiuxin2012
Copy link
Contributor

@ajatprabha I have test llama3.2:3b with ipex llm 2.2.0b20250102, it works fine now. Please try again.

@tklengyel
Copy link

tklengyel commented Jan 17, 2025

I'm still seeing this same crash with llama3.2:3b on ipex llm 2.2.0b20250116, doesn't seem to be resolved or it regressed back.

** Edit: if I set OLLAMA_INTEL_GPU=1 it forces ollama to use the integrated GPU and I don't see the crash. However, if it is set as OLLAMA_INTEL_GPU=0 (which is the default for the intelanalytics/ipex-llm-inference-cpp-xpu):

ollama-intel-gpu  | found 2 SYCL devices:
ollama-intel-gpu  | |  |                   |                                       |       |Max    |        |Max  |Global |                     |
ollama-intel-gpu  | |  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
ollama-intel-gpu  | |ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
ollama-intel-gpu  | |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
ollama-intel-gpu  | | 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.6|    512|    1024|   32| 16225M|            1.3.31294|
ollama-intel-gpu  | | 1| [level_zero:gpu:1]|                 Intel UHD Graphics 770|    1.6|     32|     512|   32| 62690M|            1.3.31294|
ollama-intel-gpu  | llama_kv_cache_init:      SYCL0 KV buffer size =   192.00 MiB
ollama-intel-gpu  | llama_kv_cache_init:      SYCL1 KV buffer size =   704.00 MiB
ollama-intel-gpu  | llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
ollama-intel-gpu  | llama_new_context_with_model:  SYCL_Host  output buffer size =     2.00 MiB
ollama-intel-gpu  | llama_new_context_with_model:      SYCL0 compute buffer size =    66.00 MiB
ollama-intel-gpu  | llama_new_context_with_model:      SYCL1 compute buffer size =   256.50 MiB
ollama-intel-gpu  | llama_new_context_with_model:  SYCL_Host compute buffer size =    22.01 MiB
ollama-intel-gpu  | llama_new_context_with_model: graph nodes  = 790
ollama-intel-gpu  | llama_new_context_with_model: graph splits = 3
ollama-intel-gpu  | time=2025-01-17T08:24:34.410+08:00 level=WARN source=runner.go:894 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
ollama-intel-gpu  | time=2025-01-17T08:24:35.347+08:00 level=INFO source=server.go:619 msg="llama runner started in 2.51 seconds"
ollama-intel-gpu  | ollama_llama_server: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.

I would obviously prefer to be able to use the ARC card instead of the integrated one.

@qiuxin2012
Copy link
Contributor

qiuxin2012 commented Jan 17, 2025

@tklengyel Could you share your OS and CPU info?
For GPU selector, can you try ONEAPI_DEVICE_SELECTOR instead of OLLAMA_INTEL_GPU? See run-ollama-serve for the usage.

@ajatprabha
Copy link
Author

I wasn't able to verify 2.2.0b20250102 either because when I upgraded I started getting linker errors. I thought of trying with a clean install, haven't got the chance to do it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants