Refactor Vulkan backend to allow multiple contexts #7961

0cc4m · 2024-06-16T18:44:31Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

I reworked how Vulkan backends are handled. This should fix #7575, and hopefully some issues with RAM overuse.

@MaggotHATE Can you check whether this has fixed your issue?

I'll leave this on draft while I run further tests to make sure I didn't break anything.

MaggotHATE · 2024-06-16T19:14:06Z

@0cc4m Looks like it has, thank you! I will run more tests tomorrow, but so far it works on non-MoE models (not just the same one, testing 11b llama3-based in Q5_K_M). Speeds are quite good too.

…nd Intel GPUs

MaggotHATE · 2024-06-18T07:12:06Z

Maybe I'm spoiled by Clblast, but so far Vulkan backend has 2 annoyances:

Shaders runtime compilation (?) takes too much time (GPU utilization is less than 30%), and it happens quite often, because...
...running models requires precise number of layers offloaded - one more or one less, and it will either crash or halt, doing nothing.

I don't see other major problems for now, but waiting times are quite long, and it (or Windows?) clears or forgets the cache after some time, so even if I haven't changed anything in settings, I'll have to wait again.

lin72h · 2024-06-18T07:21:06Z

Shaders runtime compilation (?) takes too much time (GPU utilization is less than 30%), and it happens quite often, because...

Thank for the feedback, just wondering is there a way to cache the shader in spvir binary, and make the loading fast?

0cc4m · 2024-06-18T19:01:29Z

Maybe I'm spoiled by Clblast, but so far Vulkan backend has 2 annoyances:
1. Shaders runtime compilation (?) takes too much time (GPU utilization is less than 30%), and it happens quite often, because...

2. ...running models requires precise number of layers offloaded - one more or one less, and it will either crash or halt, doing nothing.
I don't see other major problems for now, but waiting times are quite long, and it (or Windows?) clears or forgets the cache after some time, so even if I haven't changed anything in settings, I'll have to wait again.

Both of those issues are specific to you. I have never had those on any system I tested.

Shader compilation and caching is the responsibility of your driver, and both Linux and Windows drivers do that well in my experience.
It should never halt. It will crash when it runs out of VRAM.

Can you give me more details of your system? It doesn't seem to behave as expected.

MaggotHATE · 2024-06-18T19:53:20Z

Can you give me more details of your system? It doesn't seem to behave as expected

Windows 10, 16GB DDR4, i7 8700, GTX 1060 3GB (536.23). I remember reporting about it first here, so maybe it's Windows 10 specifically. I use both iGPU and 1060 now, so it's not related.

It may be relevant that I use Lunar's Vulkan SDK 1.3.283.

It should never halt. It will crash when it runs out of VRAM.

It does crash when it's out of VRAM, but it halts when not enough layers are offloaded.

0cc4m · 2024-06-18T20:12:30Z

Can you give me more details of your system? It doesn't seem to behave as expected

Windows 10, 16GB DDR4, i7 8700, GTX 1060 3GB (536.23). I remember reporting about it first here, so maybe it's Windows 10 specifically. I use both iGPU and 1060 now, so it's not related.

That seems perfectly normal, yeah. I don't have a Windows Nvidia setup myself, but many people run that. Odd.

It should never halt. It will crash when it runs out of VRAM.

It does crash when it's out of VRAM, but it halts when not enough layers are offloaded.

That doesn't make sense, you can run with 0 layers offloaded and it should work just fine. In that case it'll only offload the big matrix multiplications to the GPU. Any number of layers up to your VRAM limit should work.

MaggotHATE · 2024-06-19T06:52:32Z

The shaders compilation is almost random - it didn't happen today in a fresh session on the first time running a model. Not eve once so far.

That doesn't make sense, you can run with 0 layers offloaded and it should work just fine. In that case it'll only offload the big matrix multiplications to the GPU. Any number of layers up to your VRAM limit should work.

Here's the report: vk_report_hermes_halt.txt. The model is Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, 8 layers, ctx-size is 8192, n_batch is 4096. It runs perfectly with 9 layers and crashes with 10 layers.

UPD: This is specifically 8 layers that halt - 7 and 6 work.

0cc4m · 2024-06-19T07:23:56Z

That doesn't make sense, you can run with 0 layers offloaded and it should work just fine. In that case it'll only offload the big matrix multiplications to the GPU. Any number of layers up to your VRAM limit should work.

Here's the report: vk_report_hermes_halt.txt. The model is Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, 8 layers, ctx-size is 8192, n_batch is 4096. It runs perfectly with 9 layers and crashes with 10 layers.

UPD: This is specifically 8 layers that halt - 7 and 6 work.

Can you build and run with validation and vulkan debug enabled and upload a log where it got stuck?

MaggotHATE · 2024-06-19T07:46:21Z

Can you build and run with validation and vulkan debug enabled and upload a log where it got stuck?

Here's the log: vk_report_validation_hermes_halt.txt

GPU just stays on 21% power, but it doesn't seem to do anything.

UPD: now that I remember, this was an issue with Vulkan backend my friend had on Win 10 back before the new backend system. It was fixed just before moving onto the new backend. In fact, my friend still uses that pre-backend sometimes.

0cc4m · 2024-06-19T07:59:43Z

Can you enable the debug output? LLAMA_VULKAN_DEBUG=1

MaggotHATE · 2024-06-19T08:05:46Z

Can you enable the debug output? LLAMA_VULKAN_DEBUG=1

I'm getting compilation error with GGML_VULKAN_DEBUG

base/ggml-vulkan.cpp:2040:58: error: 'dev_num' was not declared in this scope; did you mean 'dev_t'? 2040 | VK_LOG_DEBUG("ggml_vk_init(" << ctx->name << ", " << dev_num << ")");

UPD: looks like it should be idx instead of dev_num?

0cc4m · 2024-06-19T08:10:05Z

Can you enable the debug output? LLAMA_VULKAN_DEBUG=1

I'm getting compilation error with GGML_VULKAN_DEBUG

base/ggml-vulkan.cpp:2040:58: error: 'dev_num' was not declared in this scope; did you mean 'dev_t'? 2040 | VK_LOG_DEBUG("ggml_vk_init(" << ctx->name << ", " << dev_num << ")");

That's my bad, sorry. I forgot to update it after refactoring the function.

If it's just that line, you could delete it. Or you wait until I fix it later today.

MaggotHATE · 2024-06-19T08:12:23Z

If it's just that line, you could delete it. Or you wait until I fix it later today.

I think I fixed it by going with idx. Here's the log: vk_report_debug_memdebug_validation_hermes_halt.txt

0cc4m · 2024-06-19T08:16:57Z

If it's just that line, you could delete it. Or you wait until I fix it later today.

I think I fixed it by going with idx. Here's the log: vk_report_debug_memdebug_validation_hermes_halt.txt

Thank you, looks like it gets stuck in a small copy to GPU operation. Weird. I'll take a closer look later, maybe I can figure out what's going on.

0cc4m · 2024-06-19T17:28:13Z

I have no idea why it's stopping there, I think that's some issue with your driver. Your log looks normal up to that point, and I can't reproduce that issue with any of my GPUs.

* Refactor Vulkan backend to allow multiple contexts * Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs * Fix Vulkan debug build error

Refactor Vulkan backend to allow multiple contexts

d63aca3

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jun 16, 2024

Fix too many shader groups called validation error in llama3 on AMD a…

0a321fc

…nd Intel GPUs

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jun 18, 2024

Fix Vulkan debug build error

fad942b

0cc4m marked this pull request as ready for review June 19, 2024 17:28

slaren approved these changes Jun 19, 2024

View reviewed changes

slaren mentioned this pull request Jun 20, 2024

ggml : remove ggml_task_type and GGML_PERF #8017

Merged

0cc4m merged commit 45c0e2e into master Jun 23, 2024
57 checks passed

0cc4m deleted the 0cc4m/vulkan-backend-context-fix branch June 23, 2024 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Vulkan backend to allow multiple contexts #7961

Refactor Vulkan backend to allow multiple contexts #7961

0cc4m commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024 •

edited

Loading

MaggotHATE commented Jun 18, 2024

lin72h commented Jun 18, 2024

0cc4m commented Jun 18, 2024

MaggotHATE commented Jun 18, 2024

0cc4m commented Jun 18, 2024

MaggotHATE commented Jun 19, 2024 •

edited

Loading

0cc4m commented Jun 19, 2024

MaggotHATE commented Jun 19, 2024 •

edited

Loading

0cc4m commented Jun 19, 2024

MaggotHATE commented Jun 19, 2024 •

edited

Loading

0cc4m commented Jun 19, 2024

MaggotHATE commented Jun 19, 2024

0cc4m commented Jun 19, 2024

0cc4m commented Jun 19, 2024

Refactor Vulkan backend to allow multiple contexts #7961

Refactor Vulkan backend to allow multiple contexts #7961

Conversation

0cc4m commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024 • edited Loading

MaggotHATE commented Jun 18, 2024

lin72h commented Jun 18, 2024

0cc4m commented Jun 18, 2024

MaggotHATE commented Jun 18, 2024

0cc4m commented Jun 18, 2024

MaggotHATE commented Jun 19, 2024 • edited Loading

0cc4m commented Jun 19, 2024

MaggotHATE commented Jun 19, 2024 • edited Loading

0cc4m commented Jun 19, 2024

MaggotHATE commented Jun 19, 2024 • edited Loading

0cc4m commented Jun 19, 2024

MaggotHATE commented Jun 19, 2024

0cc4m commented Jun 19, 2024

0cc4m commented Jun 19, 2024

MaggotHATE commented Jun 16, 2024 •

edited

Loading

MaggotHATE commented Jun 19, 2024 •

edited

Loading

MaggotHATE commented Jun 19, 2024 •

edited

Loading

MaggotHATE commented Jun 19, 2024 •

edited

Loading