Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

stduhpf · 2024-06-17T13:11:03Z

What happened?

I-quants suddenly started working on Vulkan backend after #6210 was merged, albeit at very slow speeds (token generation is even slowr than when using a single cpu thread).

But, it only works if at least all layers exept the last one (every "repeating layers") are oflloaded to GPU. Anything else (even -ngl 0) and it crashes with GGML_ASSERT: C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

Example llama-bench outputs:

Vulkan (q6-k):

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model	size	params	backend	ngl	threads	n_batch	test	t/s
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	23	6	32	pp512	512.52 ± 0.18
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	23	6	32	tg512	159.35 ± 0.32
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	22	6	32	pp512	498.63 ± 0.26
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	22	6	32	tg512	141.69 ± 0.38
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	21	6	32	pp512	462.52 ± 0.19
llama 1B Q6_K	860.87 MiB	1.10 B	Vulkan	21	6	32	tg512	127.42 ± 0.55

build: ba68309d (3163)

Vulkan (iq4-xs):

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model	size	params	backend	ngl	threads	n_batch	test	t/s
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	23	6	32	pp512	98.00 ± 0.20
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	23	6	32	tg512	12.60 ± 0.03
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	22	6	32	pp512	94.57 ± 1.02
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	Vulkan	22	6	32	tg512	12.43 ± 0.15

GGML_ASSERT: C:[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

CPU (iq4-xs):

model	size	params	backend	threads	n_batch	test	t/s
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	CPU	12	32	pp512	185.04 ± 4.81
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	CPU	12	32	tg512	57.17 ± 1.08
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	CPU	6	32	pp512	127.78 ± 2.52
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	CPU	6	32	tg512	61.14 ± 1.07
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	CPU	1	32	pp512	24.71 ± 0.05
llama 1B IQ4_XS - 4.25 bpw	577.42 MiB	1.10 B	CPU	1	32	tg512	21.14 ± 0.05

build: ba68309d (3163)

Additional info

Vulkan backend built using: cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_VULKAN=1 -G "Visual Studio 17 2022" -A x64

The ouput with I-quants doesn't look broken when it's working, it's just way too slow compared to legacy or k-quants.

(The current build sha doesn't match any commit because of some unrelated local changes on my end that are rebased on top of 21be9ca, don't mind it)

Name and Version

version: 3163 (ba68309d)
built with MSVC 19.39.33523.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

GGML_ASSERT:  C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

The text was updated successfully, but these errors were encountered:

slaren · 2024-06-17T14:06:11Z

The performance is not good because the weights are offloaded to VRAM even if the backend cannot use them, which then causes the weights to be copied to RAM on every evaluation. The assert should be fixed in #7977.

stduhpf added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jun 17, 2024

slaren mentioned this issue Jun 17, 2024

sched : offload_op also requires supports_op #7977

Merged

slaren closed this as completed in #7977 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

stduhpf commented Jun 17, 2024 •

edited

Loading

slaren commented Jun 17, 2024

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

Comments

stduhpf commented Jun 17, 2024 • edited Loading

What happened?

Example llama-bench outputs:

Vulkan (q6-k):

Vulkan (iq4-xs):

CPU (iq4-xs):

Additional info

Name and Version

What operating system are you seeing the problem on?

Relevant log output

slaren commented Jun 17, 2024

stduhpf commented Jun 17, 2024 •

edited

Loading