Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976
Labels
bug-unconfirmed
low severity
Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
What happened?
I-quants suddenly started working on Vulkan backend after #6210 was merged, albeit at very slow speeds (token generation is even slowr than when using a single cpu thread).
But, it only works if at least all layers exept the last one (every "repeating layers") are oflloaded to GPU. Anything else (even
-ngl 0
) and it crashes withGGML_ASSERT: C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03
Example llama-bench outputs:
Vulkan (q6-k):
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
build: ba68309d (3163)
Vulkan (iq4-xs):
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
GGML_ASSERT: C:[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03
CPU (iq4-xs):
build: ba68309d (3163)
Additional info
Vulkan backend built using:
cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_VULKAN=1 -G "Visual Studio 17 2022" -A x64
The ouput with I-quants doesn't look broken when it's working, it's just way too slow compared to legacy or k-quants.
(The current build sha doesn't match any commit because of some unrelated local changes on my end that are rebased on top of 21be9ca, don't mind it)
Name and Version
version: 3163 (ba68309d)
built with MSVC 19.39.33523.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
The text was updated successfully, but these errors were encountered: