Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

Closed
stduhpf opened this issue Jun 17, 2024 · 1 comment · Fixed by #7977
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Comments

@stduhpf
Copy link
Contributor

stduhpf commented Jun 17, 2024

What happened?

I-quants suddenly started working on Vulkan backend after #6210 was merged, albeit at very slow speeds (token generation is even slowr than when using a single cpu thread).

But, it only works if at least all layers exept the last one (every "repeating layers") are oflloaded to GPU. Anything else (even -ngl 0) and it crashes with GGML_ASSERT: C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

Example llama-bench outputs:

Vulkan (q6-k):

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model size params backend ngl threads n_batch test t/s
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 23 6 32 pp512 512.52 ± 0.18
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 23 6 32 tg512 159.35 ± 0.32
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 22 6 32 pp512 498.63 ± 0.26
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 22 6 32 tg512 141.69 ± 0.38
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 21 6 32 pp512 462.52 ± 0.19
llama 1B Q6_K 860.87 MiB 1.10 B Vulkan 21 6 32 tg512 127.42 ± 0.55

build: ba68309d (3163)

Vulkan (iq4-xs):

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model size params backend ngl threads n_batch test t/s
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 23 6 32 pp512 98.00 ± 0.20
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 23 6 32 tg512 12.60 ± 0.03
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 22 6 32 pp512 94.57 ± 1.02
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B Vulkan 22 6 32 tg512 12.43 ± 0.15

GGML_ASSERT: C:[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03

CPU (iq4-xs):

model size params backend threads n_batch test t/s
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 12 32 pp512 185.04 ± 4.81
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 12 32 tg512 57.17 ± 1.08
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 6 32 pp512 127.78 ± 2.52
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 6 32 tg512 61.14 ± 1.07
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 1 32 pp512 24.71 ± 0.05
llama 1B IQ4_XS - 4.25 bpw 577.42 MiB 1.10 B CPU 1 32 tg512 21.14 ± 0.05

build: ba68309d (3163)

Additional info

Vulkan backend built using: cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_VULKAN=1 -G "Visual Studio 17 2022" -A x64

The ouput with I-quants doesn't look broken when it's working, it's just way too slow compared to legacy or k-quants.

(The current build sha doesn't match any commit because of some unrelated local changes on my end that are rebased on top of 21be9ca, don't mind it)

Name and Version

version: 3163 (ba68309d)
built with MSVC 19.39.33523.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

GGML_ASSERT:  C:\[...]\llama.cpp\ggml-vulkan.cpp:3006: d_X->size >= x_sz * ne02 * ne03
@stduhpf stduhpf added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jun 17, 2024
@slaren
Copy link
Collaborator

slaren commented Jun 17, 2024

The performance is not good because the weights are offloaded to VRAM even if the backend cannot use them, which then causes the weights to be copied to RAM on every evaluation. The assert should be fixed in #7977.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants