Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. #9118

Merged
merged 7 commits into from
Sep 8, 2024

Conversation

mtavenrath
Copy link
Contributor

@mtavenrath mtavenrath commented Aug 21, 2024

By overlapping cmdbuffer creation and cmdbuffer submissing the GPU is less time idle resulting in a 10% perf increase for stablelm 3B Q8_0 on a RTX 6000 Ada.

master:
| model                          |       size |     params | backend    | ngl | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | ------------: | ---------------: |
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         pp512 |   3255.25 ± 7.62 |
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         tg512 |    133.49 ± 0.42 |

master async:
| model                          |       size |     params | backend    | ngl | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | ------------: | ---------------: |
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         pp512 |   3289.29 ± 5.79 | 1.01x
| stablelm 3B Q8_0               |   2.77 GiB |     2.80 B | Vulkan     | 100 |    0 |         tg512 |    146.49 ± 2.24 | 1.10x

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 21, 2024
@mtavenrath mtavenrath changed the title [Draft] Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. Aug 21, 2024
@mtavenrath
Copy link
Contributor Author

The Vulkan documentation states:

When vkCmdPipelineBarrier is submitted to a queue, it defines a memory dependency between commands that were submitted to the same queue before it, and those submitted to the same queue after it.

Thus there is no need for pipeline barriers between multiple submits to the same queue which happens here. The change is ready to submit.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Aug 21, 2024

Trying to compile on w64devkit (Win10, Vulkan SDK 1.3.280.0, upgrading to 1.3.290.0 doesn't help), getting errors:

base/ggml/ggml-vulkan.cpp:5860:13: error: 'bool ggml_vk_compute_forward(ggml_backend_vk_context*, ggml_tensor*, int, bool)' was declared 'extern' and later 'static' [-fpermissive]
 5860 | static bool ggml_vk_compute_forward(ggml_backend_vk_context * ctx, ggml_tensor * tensor, int tensor_idx, bool use_fence = true){
      |             ^~~~~~~~~~~~~~~~~~~~~~~
base/ggml/ggml-vulkan.cpp:5620:6: note: previous declaration of 'bool ggml_vk_compute_forward(ggml_backend_vk_context*, ggml_tensor*, int, bool)'
 5620 | bool ggml_vk_compute_forward(ggml_backend_vk_context* ctx, ggml_tensor* tensor, int tensor_idx, bool use_fence);
      |      ^~~~~~~~~~~~~~~~~~~~~~~
base/ggml/ggml-vulkan.cpp: In function 'bool ggml_vk_compute_forward(ggml_backend_vk_context*, ggml_tensor*, int, bool)':
base/ggml/ggml-vulkan.cpp:5932:31: error: operands to '?:' have different types 'vk::Fence' and 'VkFence' {aka 'VkFence_T*'}
 5932 |     VkFence fence = use_fence ? ctx->fence : VkFence{};
      |                     ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
base/ggml/ggml-vulkan.cpp:5932:31: note:   and each type can be converted to the other

While the first ones can be fixed by moving static bool ggml_vk_compute_forward(... above static bool ggml_vk_build_graph(..., I'm not sure how to work with VkFence.

@mtavenrath
Copy link
Contributor Author

I fixed the forward declaration and type of the fence.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Aug 21, 2024

@mtavenrath Thank you! It compiles now, however, it is crashing after:

ggml_vulkan memory: ggml_vk_preallocate_buffers(x_size: 0)
ggml_vulkan memory: ggml_vk_preallocate_buffers(y_size: 0)
ggml_vulkan memory: ggml_vk_preallocate_buffers(split_k_size: 0)
ggml_backend_vk_graph_compute: error: failed to enqueue cmdbuffer

Testing setup:

  • GPU - 1060 3GB
  • model - Mistral-Nemo-Instruct-2407.q5_k.gguf
  • n_gpu_layers = 2 (can fit more, just for testing)
  • flash_attn = false
  • no_kv_offload = true

Previous version works correctly on these settings and model.

UPD. added related portion of debugging info: report_cmdbuffer.txt

@0cc4m 0cc4m self-requested a review August 22, 2024 13:06
@mtavenrath
Copy link
Contributor Author

@MaggotHATE I've rewrote the portion of the code which handles the 'last node' of a graph which resulted in the failures you mentioned. I'm no longer able to reproduce the crash with no_kv_offload = true.

@MaggotHATE
Copy link
Contributor

@mtavenrath works now, thank you! Tested with DonutHole-8x7B.Q4_K_S.gguf, can fit 2 layers.

MaggotHATE added a commit to MaggotHATE/Llama_chat that referenced this pull request Aug 22, 2024
* the original oobabooga/text-generation-webui#6335 calls for 2 minimum
* some older commits from llama.cpp I forgot earlier
* Vulcan commit ggerganov/llama.cpp#9118
* fixed wrong `xtc_probability_once` type
ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-vulkan.cpp Show resolved Hide resolved
ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-vulkan.cpp Show resolved Hide resolved
@MaggotHATE
Copy link
Contributor

Everything works correctly so far, prompt processing is very good, especially with short prompts now.

Unrelated, but I forgot to mention earlier - tedious initialization is back, it takes quite a lot of time on each launch - and it barely uses GPU. I believe it was fixed in befddd0 , but I'm not sure when exactly it broke back. It's definitely not RAM issue now since I've upgraded to 64GB. VRAM utilization is also low.

MaggotHATE added a commit to MaggotHATE/Llama_chat that referenced this pull request Aug 27, 2024
* "fix llama3.1 rope_freqs" commit
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 30, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 8, 2024

@MaggotHATE Long initialization means it's compiling the shaders to driver-internal representations. That should only happen once after updates and after driver updates, if it happens on each launch there's an issue with your driver.

@MaggotHATE
Copy link
Contributor

@0cc4m Is there a way to debug that? I'm pretty sure it was fixed at some point on the same machine I use now, no driver update since then. The only change I might think of is switching to iGPU as the main display device and updating its driver - but Vulkan detects and uses the external one, not iGPU (judging from logs, at least).

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 8, 2024

@MaggotHATE You can build with GGML_VULKAN_DEBUG and check which calls are taking so long.

@mtavenrath
Copy link
Contributor Author

I'd be quite surprised if this has been fixed unless someone added a pipeline cache. On NVIDIA drivers the pipelines are cached by the driver and thus there is a one-time cost after each driver change / shader change only. On Intel systems there doesn't seem to be a driver cache. I am having to wait a few seconds on each launch to recompile the shaders.

This time could be reduced by adding a pipeline cache and/or compiling only the pipelines which are required and/or using multiple threads for pipeline compilation.

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 8, 2024

AMD and Intel also cache the pipelines, at least on Linux. But multithreading the pipeline compilation and adding a manual cache would be good.

@MaggotHATE
Copy link
Contributor

build with GGML_VULKAN_DEBUG and check which calls are taking so long

It seems like shaders are compiled over and over. From:

...
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 12.25 B
llm_load_print_meta: model size       = 9.68 GiB (6.79 BPW)
llm_load_print_meta: general.name     = Mistral Nemo Instruct 2407
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1196 '├Д'
llm_load_print_meta: max token length = 150
ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(1)
Vulkan0: NVIDIA GeForce GTX 1060 3GB (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32
ggml_vk_get_device(0)
Initializing new vk_device
ggml_vk_find_queue_family_index()
ggml_vk_find_queue_family_index()
ggml_vk_create_queue()
ggml_vk_load_shaders(NVIDIA GeForce GTX 1060 3GB)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(NVIDIA GeForce GTX 1060 3GB, matmul_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
...

And it goes on each time I start my program. I suspect it has something to do with my Intel iGPU being the primary device. However, this is mainline Vulkan as I haven't got the time to merge this PR into the recent updates.

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 8, 2024

And it goes on each time I start my program. I suspect it has something to do with my Intel iGPU being the primary device. However, this is mainline Vulkan as I haven't got the time to merge this PR into the recent updates.

No, it shouldn't be related to your Intel iGPU. But let's not sidetrack this PR, you could open a separate issue for the compilation time and at a future point me or someone else can implement one of the improvements mtavenrath proposed.

@0cc4m 0cc4m merged commit daa9623 into ggerganov:master Sep 8, 2024
49 of 52 checks passed
@jeffbolznv
Copy link
Collaborator

But multithreading the pipeline compilation and adding a manual cache would be good.

I've created ggerganov/ggml#963 to multithread vulkan pipeline compilation.

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
…by submitting smaller cmdbuffers early. (ggerganov#9118)

* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.

* fix compile issues

* Fix issues where the last submit wasn't executed or handled properly.

* remove trailing whitespace

* Repair GGML_VULKAN_CHECK_RESULTS

* Increase submit counter only if actual work has been submitted and increase submit count to 100.

* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…by submitting smaller cmdbuffers early. (ggerganov#9118)

* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.

* fix compile issues

* Fix issues where the last submit wasn't executed or handled properly.

* remove trailing whitespace

* Repair GGML_VULKAN_CHECK_RESULTS

* Increase submit counter only if actual work has been submitted and increase submit count to 100.

* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…by submitting smaller cmdbuffers early. (ggerganov#9118)

* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.

* fix compile issues

* Fix issues where the last submit wasn't executed or handled properly.

* remove trailing whitespace

* Repair GGML_VULKAN_CHECK_RESULTS

* Increase submit counter only if actual work has been submitted and increase submit count to 100.

* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
jeffbolznv added a commit to jeffbolznv/llama.cpp that referenced this pull request Nov 25, 2024
This is an incremental improvement over ggerganov#9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
0cc4m pushed a commit that referenced this pull request Nov 29, 2024
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants