-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Vulkan backend to allow multiple contexts #7961
Conversation
@0cc4m Looks like it has, thank you! I will run more tests tomorrow, but so far it works on non-MoE models (not just the same one, testing 11b llama3-based in Q5_K_M). Speeds are quite good too. |
Maybe I'm spoiled by Clblast, but so far Vulkan backend has 2 annoyances:
I don't see other major problems for now, but waiting times are quite long, and it (or Windows?) clears or forgets the cache after some time, so even if I haven't changed anything in settings, I'll have to wait again. |
Thank for the feedback, just wondering is there a way to cache the shader in spvir binary, and make the loading fast? |
Both of those issues are specific to you. I have never had those on any system I tested.
Can you give me more details of your system? It doesn't seem to behave as expected. |
Windows 10, 16GB DDR4, i7 8700, GTX 1060 3GB (536.23). I remember reporting about it first here, so maybe it's Windows 10 specifically. I use both iGPU and 1060 now, so it's not related. It may be relevant that I use Lunar's Vulkan SDK 1.3.283.
It does crash when it's out of VRAM, but it halts when not enough layers are offloaded. |
That seems perfectly normal, yeah. I don't have a Windows Nvidia setup myself, but many people run that. Odd.
That doesn't make sense, you can run with 0 layers offloaded and it should work just fine. In that case it'll only offload the big matrix multiplications to the GPU. Any number of layers up to your VRAM limit should work. |
The shaders compilation is almost random - it didn't happen today in a fresh session on the first time running a model. Not eve once so far.
Here's the report: vk_report_hermes_halt.txt. The model is Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, 8 layers, ctx-size is 8192, n_batch is 4096. It runs perfectly with 9 layers and crashes with 10 layers. UPD: This is specifically 8 layers that halt - 7 and 6 work. |
Can you build and run with validation and vulkan debug enabled and upload a log where it got stuck? |
Here's the log: vk_report_validation_hermes_halt.txt GPU just stays on 21% power, but it doesn't seem to do anything. UPD: now that I remember, this was an issue with Vulkan backend my friend had on Win 10 back before the new backend system. It was fixed just before moving onto the new backend. In fact, my friend still uses that pre-backend sometimes. |
Can you enable the debug output? |
I'm getting compilation error with
UPD: looks like it should be |
That's my bad, sorry. I forgot to update it after refactoring the function. If it's just that line, you could delete it. Or you wait until I fix it later today. |
I think I fixed it by going with |
Thank you, looks like it gets stuck in a small copy to GPU operation. Weird. I'll take a closer look later, maybe I can figure out what's going on. |
I have no idea why it's stopping there, I think that's some issue with your driver. Your log looks normal up to that point, and I can't reproduce that issue with any of my GPUs. |
* Refactor Vulkan backend to allow multiple contexts * Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs * Fix Vulkan debug build error
I reworked how Vulkan backends are handled. This should fix #7575, and hopefully some issues with RAM overuse.
@MaggotHATE Can you check whether this has fixed your issue?
I'll leave this on draft while I run further tests to make sure I didn't break anything.