Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. #9118

mtavenrath · 2024-08-21T13:46:54Z

By overlapping cmdbuffer creation and cmdbuffer submissing the GPU is less time idle resulting in a 10% perf increase for stablelm 3B Q8_0 on a RTX 6000 Ada.

master:
| model                          |       size |     params | backend    | ngl | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | ------------: | ---------------: |
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         pp512 |   3255.25 ± 7.62 |
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         tg512 |    133.49 ± 0.42 |

master async:
| model                          |       size |     params | backend    | ngl | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | ------------: | ---------------: |
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         pp512 |   3289.29 ± 5.79 | 1.01x
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         tg512 |    146.49 ± 2.24 | 1.10x

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

…by submitting smaller cmdbuffers early.

mtavenrath · 2024-08-21T14:16:46Z

The Vulkan documentation states:

When vkCmdPipelineBarrier is submitted to a queue, it defines a memory dependency between commands that were submitted to the same queue before it, and those submitted to the same queue after it.

Thus there is no need for pipeline barriers between multiple submits to the same queue which happens here. The change is ready to submit.

MaggotHATE · 2024-08-21T15:58:02Z

Trying to compile on w64devkit (Win10, Vulkan SDK 1.3.280.0, upgrading to 1.3.290.0 doesn't help), getting errors:

base/ggml/ggml-vulkan.cpp:5860:13: error: 'bool ggml_vk_compute_forward(ggml_backend_vk_context*, ggml_tensor*, int, bool)' was declared 'extern' and later 'static' [-fpermissive]
 5860 | static bool ggml_vk_compute_forward(ggml_backend_vk_context * ctx, ggml_tensor * tensor, int tensor_idx, bool use_fence = true){
      |             ^~~~~~~~~~~~~~~~~~~~~~~
base/ggml/ggml-vulkan.cpp:5620:6: note: previous declaration of 'bool ggml_vk_compute_forward(ggml_backend_vk_context*, ggml_tensor*, int, bool)'
 5620 | bool ggml_vk_compute_forward(ggml_backend_vk_context* ctx, ggml_tensor* tensor, int tensor_idx, bool use_fence);
      |      ^~~~~~~~~~~~~~~~~~~~~~~
base/ggml/ggml-vulkan.cpp: In function 'bool ggml_vk_compute_forward(ggml_backend_vk_context*, ggml_tensor*, int, bool)':
base/ggml/ggml-vulkan.cpp:5932:31: error: operands to '?:' have different types 'vk::Fence' and 'VkFence' {aka 'VkFence_T*'}
 5932 |     VkFence fence = use_fence ? ctx->fence : VkFence{};
      |                     ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
base/ggml/ggml-vulkan.cpp:5932:31: note:   and each type can be converted to the other

While the first ones can be fixed by moving static bool ggml_vk_compute_forward(... above static bool ggml_vk_build_graph(..., I'm not sure how to work with VkFence.

mtavenrath · 2024-08-21T17:09:07Z

I fixed the forward declaration and type of the fence.

MaggotHATE · 2024-08-21T18:02:38Z

@mtavenrath Thank you! It compiles now, however, it is crashing after:

ggml_vulkan memory: ggml_vk_preallocate_buffers(x_size: 0)
ggml_vulkan memory: ggml_vk_preallocate_buffers(y_size: 0)
ggml_vulkan memory: ggml_vk_preallocate_buffers(split_k_size: 0)
ggml_backend_vk_graph_compute: error: failed to enqueue cmdbuffer

Testing setup:

GPU - 1060 3GB
model - Mistral-Nemo-Instruct-2407.q5_k.gguf
n_gpu_layers = 2 (can fit more, just for testing)
flash_attn = false
no_kv_offload = true

Previous version works correctly on these settings and model.

UPD. added related portion of debugging info: report_cmdbuffer.txt

mtavenrath · 2024-08-22T13:55:23Z

@MaggotHATE I've rewrote the portion of the code which handles the 'last node' of a graph which resulted in the failures you mentioned. I'm no longer able to reproduce the crash with no_kv_offload = true.

MaggotHATE · 2024-08-22T14:26:22Z

@mtavenrath works now, thank you! Tested with DonutHole-8x7B.Q4_K_S.gguf, can fit 2 layers.

* the original oobabooga/text-generation-webui#6335 calls for 2 minimum * some older commits from llama.cpp I forgot earlier * Vulcan commit ggerganov/llama.cpp#9118 * fixed wrong `xtc_probability_once` type

ggml/src/ggml-vulkan.cpp

…crease submit count to 100.

MaggotHATE · 2024-08-27T18:32:38Z

Everything works correctly so far, prompt processing is very good, especially with short prompts now.

Unrelated, but I forgot to mention earlier - tedious initialization is back, it takes quite a lot of time on each launch - and it barely uses GPU. I believe it was fixed in befddd0 , but I'm not sure when exactly it broke back. It's definitely not RAM issue now since I've upgraded to 64GB. VRAM utilization is also low.

* "fix llama3.1 rope_freqs" commit

0cc4m · 2024-09-08T06:20:10Z

@MaggotHATE Long initialization means it's compiling the shaders to driver-internal representations. That should only happen once after updates and after driver updates, if it happens on each launch there's an issue with your driver.

MaggotHATE · 2024-09-08T07:29:51Z

@0cc4m Is there a way to debug that? I'm pretty sure it was fixed at some point on the same machine I use now, no driver update since then. The only change I might think of is switching to iGPU as the main display device and updating its driver - but Vulkan detects and uses the external one, not iGPU (judging from logs, at least).

0cc4m · 2024-09-08T07:47:58Z

@MaggotHATE You can build with GGML_VULKAN_DEBUG and check which calls are taking so long.

mtavenrath · 2024-09-08T08:48:27Z

I'd be quite surprised if this has been fixed unless someone added a pipeline cache. On NVIDIA drivers the pipelines are cached by the driver and thus there is a one-time cost after each driver change / shader change only. On Intel systems there doesn't seem to be a driver cache. I am having to wait a few seconds on each launch to recompile the shaders.

This time could be reduced by adding a pipeline cache and/or compiling only the pipelines which are required and/or using multiple threads for pipeline compilation.

0cc4m · 2024-09-08T08:53:34Z

AMD and Intel also cache the pipelines, at least on Linux. But multithreading the pipeline compilation and adding a manual cache would be good.

MaggotHATE · 2024-09-08T13:08:28Z

build with GGML_VULKAN_DEBUG and check which calls are taking so long

It seems like shaders are compiled over and over. From:

...
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 12.25 B
llm_load_print_meta: model size       = 9.68 GiB (6.79 BPW)
llm_load_print_meta: general.name     = Mistral Nemo Instruct 2407
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1196 '├Д'
llm_load_print_meta: max token length = 150
ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(1)
Vulkan0: NVIDIA GeForce GTX 1060 3GB (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32
ggml_vk_get_device(0)
Initializing new vk_device
ggml_vk_find_queue_family_index()
ggml_vk_find_queue_family_index()
ggml_vk_create_queue()
ggml_vk_load_shaders(NVIDIA GeForce GTX 1060 3GB)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
...

And it goes on each time I start my program. I suspect it has something to do with my Intel iGPU being the primary device. However, this is mainline Vulkan as I haven't got the time to merge this PR into the recent updates.

0cc4m · 2024-09-08T14:39:24Z

And it goes on each time I start my program. I suspect it has something to do with my Intel iGPU being the primary device. However, this is mainline Vulkan as I haven't got the time to merge this PR into the recent updates.

No, it shouldn't be related to your Intel iGPU. But let's not sidetrack this PR, you could open a separate issue for the compilation time and at a future point me or someone else can implement one of the improvements mtavenrath proposed.

jeffbolznv · 2024-09-18T18:55:51Z

But multithreading the pipeline compilation and adding a manual cache would be good.

I've created ggerganov/ggml#963 to multithread vulkan pipeline compilation.

…by submitting smaller cmdbuffers early. (ggerganov#9118) * Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. * fix compile issues * Fix issues where the last submit wasn't executed or handled properly. * remove trailing whitespace * Repair GGML_VULKAN_CHECK_RESULTS * Increase submit counter only if actual work has been submitted and increase submit count to 100. * Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.

This is an incremental improvement over ggerganov#9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.

This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend …

73beb8d

…by submitting smaller cmdbuffers early.

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 21, 2024

mtavenrath mentioned this pull request Aug 21, 2024

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

Merged

4 tasks

mtavenrath changed the title ~~[Draft] Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.~~ Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. Aug 21, 2024

fix compile issues

fb7befd

0cc4m self-requested a review August 22, 2024 13:06

Fix issues where the last submit wasn't executed or handled properly.

5e2ce24

remove trailing whitespace

2812b35

0cc4m reviewed Aug 23, 2024

View reviewed changes

ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved

0cc4m reviewed Aug 25, 2024

View reviewed changes

ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-vulkan.cpp Show resolved Hide resolved

ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-vulkan.cpp Show resolved Hide resolved

mtavenrath added 2 commits August 27, 2024 13:47

Repair GGML_VULKAN_CHECK_RESULTS

2b2bc1f

Increase submit counter only if actual work has been submitted and in…

28506e5

…crease submit count to 100.

MaggotHATE added a commit to MaggotHATE/Llama_chat that referenced this pull request Aug 27, 2024

Updates from ggerganov/llama.cpp#9118

f4ff08a

* "fix llama3.1 rope_freqs" commit

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 30, 2024

Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.

9dc0223

0cc4m approved these changes Sep 8, 2024

View reviewed changes

0cc4m merged commit daa9623 into ggerganov:master Sep 8, 2024
49 of 52 checks passed

jeffbolznv mentioned this pull request Nov 25, 2024

vulkan: get the first command buffer submitted sooner #10499

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. #9118

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. #9118

mtavenrath commented Aug 21, 2024 •

edited

Loading

mtavenrath commented Aug 21, 2024

MaggotHATE commented Aug 21, 2024 •

edited

Loading

mtavenrath commented Aug 21, 2024

MaggotHATE commented Aug 21, 2024 •

edited

Loading

mtavenrath commented Aug 22, 2024

MaggotHATE commented Aug 22, 2024

MaggotHATE commented Aug 27, 2024

0cc4m commented Sep 8, 2024

MaggotHATE commented Sep 8, 2024

0cc4m commented Sep 8, 2024

mtavenrath commented Sep 8, 2024

0cc4m commented Sep 8, 2024

MaggotHATE commented Sep 8, 2024

0cc4m commented Sep 8, 2024

jeffbolznv commented Sep 18, 2024

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. #9118

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. #9118

Conversation

mtavenrath commented Aug 21, 2024 • edited Loading

mtavenrath commented Aug 21, 2024

MaggotHATE commented Aug 21, 2024 • edited Loading

mtavenrath commented Aug 21, 2024

MaggotHATE commented Aug 21, 2024 • edited Loading

mtavenrath commented Aug 22, 2024

MaggotHATE commented Aug 22, 2024

MaggotHATE commented Aug 27, 2024

0cc4m commented Sep 8, 2024

MaggotHATE commented Sep 8, 2024

0cc4m commented Sep 8, 2024

mtavenrath commented Sep 8, 2024

0cc4m commented Sep 8, 2024

MaggotHATE commented Sep 8, 2024

0cc4m commented Sep 8, 2024

jeffbolznv commented Sep 18, 2024

mtavenrath commented Aug 21, 2024 •

edited

Loading

MaggotHATE commented Aug 21, 2024 •

edited

Loading

MaggotHATE commented Aug 21, 2024 •

edited

Loading