-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. #9118
Conversation
…by submitting smaller cmdbuffers early.
The Vulkan documentation states:
Thus there is no need for pipeline barriers between multiple submits to the same queue which happens here. The change is ready to submit. |
Trying to compile on w64devkit (Win10, Vulkan SDK 1.3.280.0, upgrading to 1.3.290.0 doesn't help), getting errors:
While the first ones can be fixed by moving |
I fixed the forward declaration and type of the fence. |
@mtavenrath Thank you! It compiles now, however, it is crashing after:
Testing setup:
Previous version works correctly on these settings and model. UPD. added related portion of debugging info: report_cmdbuffer.txt |
@MaggotHATE I've rewrote the portion of the code which handles the 'last node' of a graph which resulted in the failures you mentioned. I'm no longer able to reproduce the crash with no_kv_offload = true. |
@mtavenrath works now, thank you! Tested with DonutHole-8x7B.Q4_K_S.gguf, can fit 2 layers. |
* the original oobabooga/text-generation-webui#6335 calls for 2 minimum * some older commits from llama.cpp I forgot earlier * Vulcan commit ggerganov/llama.cpp#9118 * fixed wrong `xtc_probability_once` type
…crease submit count to 100.
Everything works correctly so far, prompt processing is very good, especially with short prompts now. Unrelated, but I forgot to mention earlier - tedious initialization is back, it takes quite a lot of time on each launch - and it barely uses GPU. I believe it was fixed in befddd0 , but I'm not sure when exactly it broke back. It's definitely not RAM issue now since I've upgraded to 64GB. VRAM utilization is also low. |
* "fix llama3.1 rope_freqs" commit
@MaggotHATE Long initialization means it's compiling the shaders to driver-internal representations. That should only happen once after updates and after driver updates, if it happens on each launch there's an issue with your driver. |
@0cc4m Is there a way to debug that? I'm pretty sure it was fixed at some point on the same machine I use now, no driver update since then. The only change I might think of is switching to iGPU as the main display device and updating its driver - but Vulkan detects and uses the external one, not iGPU (judging from logs, at least). |
@MaggotHATE You can build with |
I'd be quite surprised if this has been fixed unless someone added a pipeline cache. On NVIDIA drivers the pipelines are cached by the driver and thus there is a one-time cost after each driver change / shader change only. On Intel systems there doesn't seem to be a driver cache. I am having to wait a few seconds on each launch to recompile the shaders. This time could be reduced by adding a pipeline cache and/or compiling only the pipelines which are required and/or using multiple threads for pipeline compilation. |
AMD and Intel also cache the pipelines, at least on Linux. But multithreading the pipeline compilation and adding a manual cache would be good. |
It seems like shaders are compiled over and over. From:
And it goes on each time I start my program. I suspect it has something to do with my Intel iGPU being the primary device. However, this is mainline Vulkan as I haven't got the time to merge this PR into the recent updates. |
No, it shouldn't be related to your Intel iGPU. But let's not sidetrack this PR, you could open a separate issue for the compilation time and at a future point me or someone else can implement one of the improvements mtavenrath proposed. |
I've created ggerganov/ggml#963 to multithread vulkan pipeline compilation. |
…by submitting smaller cmdbuffers early. (ggerganov#9118) * Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. * fix compile issues * Fix issues where the last submit wasn't executed or handled properly. * remove trailing whitespace * Repair GGML_VULKAN_CHECK_RESULTS * Increase submit counter only if actual work has been submitted and increase submit count to 100. * Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
…by submitting smaller cmdbuffers early. (ggerganov#9118) * Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. * fix compile issues * Fix issues where the last submit wasn't executed or handled properly. * remove trailing whitespace * Repair GGML_VULKAN_CHECK_RESULTS * Increase submit counter only if actual work has been submitted and increase submit count to 100. * Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
…by submitting smaller cmdbuffers early. (ggerganov#9118) * Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. * fix compile issues * Fix issues where the last submit wasn't executed or handled properly. * remove trailing whitespace * Repair GGML_VULKAN_CHECK_RESULTS * Increase submit counter only if actual work has been submitted and increase submit count to 100. * Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
This is an incremental improvement over ggerganov#9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.
This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.
By overlapping cmdbuffer creation and cmdbuffer submissing the GPU is less time idle resulting in a 10% perf increase for stablelm 3B Q8_0 on a RTX 6000 Ada.