Vulkan Optimizations and Fixes #8959

0cc4m · 2024-08-09T20:30:47Z

I have implemented a number of Vulkan optimizations and fixes:

Implement REPEAT operator shader to fix low performance of Vulkan copy-based implementation
Use GLSL FMA instruction where possible
Add GGML_VULKAN_PERF option to get approximate performance data about a running model
Rework and fix Vulkan Descriptor Set handling, this improves performance in my tests on AMD RADV
Fix validation error on float32 concat f16 shader

I will keep this on draft while I check a few more things, but feel free to test and benchmark. Don't expect a huge difference.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

…ng and more * fixed default sampling queue to include p_step * changed sampling queue display to better reflect the actual logic * added VK-specific settings `use_mmap_vk`, `flash_attn_vk`, `no_kv_offload_vk` * added new presets for testing

0cc4m · 2024-08-11T09:00:04Z

I missed a validation issue in #8943, but the fix is now in this branch. I think this should be ready for a review and then merge.

ggerganov · 2024-08-11T09:15:09Z

ggml/src/vulkan-shaders/mul_mat_vec_q4_k.comp

+                        sz * FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 4] & 0x0f) | ((data_a[ib0 + i].scales[v_im] & 0xc0) >> 2)) + sw * FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 5] & 0x0f) | ((data_a[ib0 + i].scales[v_im + 1] & 0xc0) >> 2))) - dmin * smin);
+        const uint tmp_idx = 16 * ix + tid;
+        tmp[tmp_idx] = fma(dall, (fma(sx, FLOAT_TYPE(data_a[ib0 + i].scales[v_im] & 0x3f), fma(sy, FLOAT_TYPE(data_a[ib0 + i].scales[v_im + 1] & 0x3f),
+                       fma(sz, FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 4] & 0x0f) | ((data_a[ib0 + i].scales[v_im] & 0xc0) >> 2)), fma(sw, FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 5] & 0x0f) | ((data_a[ib0 + i].scales[v_im + 1] & 0xc0) >> 2))))))), fma(-dmin, smin, tmp[tmp_idx]));


If you consider only the FMA changes, is there a measurable performance gain?

It's very hard to tell. The GLSL compiler should be using FMA instructions anyways, basically this change just makes it certain instead of leaving it to the optimizer. Hopefully this means a few more FMA calls in SPIR-V, which could be checked.

But afterwards the SPIR-V code gets compiled again to a device-specific driver-internal representation, where some more optimization takes place. Since there are many combinations of devices, I can't really be sure whether this helped anywhere, but at least I'm sure it doesn't cause slow downs. I haven't seen a significant performance difference on my devices.

0cc4m · 2024-08-14T14:09:27Z

@ggerganov @slaren Can one of you review the non-Vulkan parts of this PR and approve if that's fine?

ggerganov

Make sure to fix the CI before merging

Vulkan Optimizations and Fixes (ggerganov#8959)

* Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces

0cc4m added 5 commits August 7, 2024 10:55

Optimize Vulkan REPEAT performance

0645ed5

Use Vulkan GLSL fused multiply-add instruction where possible

f78487b

Add GGML_VULKAN_PERF option to output performance data per operator

4c6a7bb

Rework and fix Vulkan descriptor set and descriptor pool handling

efe6aca

Fix float32 concat f16 shader validation error

9e0ac98

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 9, 2024

0cc4m mentioned this pull request Aug 10, 2024

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

Merged

4 tasks

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level performance Speed related topics bugfix fixes an issue or bug labels Aug 10, 2024

Add Vulkan GROUP_NORM eps parameter

61d8388

0cc4m added 2 commits August 11, 2024 10:32

Merge upstream changes, fix conflicts

4f197e3

Fix validation error with transfer queue memory barrier flags

5ae33eb

0cc4m marked this pull request as ready for review August 11, 2024 08:59

ggerganov reviewed Aug 11, 2024

View reviewed changes

0cc4m requested a review from ggerganov August 13, 2024 18:44

slaren approved these changes Aug 14, 2024

View reviewed changes

ggerganov approved these changes Aug 14, 2024

View reviewed changes

Remove trailing whitespaces

12d214f

0cc4m merged commit 5fd89a7 into master Aug 14, 2024
53 checks passed

0cc4m deleted the 0cc4m/vulkan-optimization branch August 14, 2024 16:32

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 14, 2024

Merge pull request #299 from ggerganov/master

e71ce8f

Vulkan Optimizations and Fixes (ggerganov#8959)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan Optimizations and Fixes #8959

Vulkan Optimizations and Fixes #8959

0cc4m commented Aug 9, 2024

0cc4m commented Aug 11, 2024

ggerganov Aug 11, 2024

0cc4m Aug 11, 2024

0cc4m commented Aug 14, 2024

ggerganov left a comment

Vulkan Optimizations and Fixes #8959

Vulkan Optimizations and Fixes #8959

Conversation

0cc4m commented Aug 9, 2024

0cc4m commented Aug 11, 2024

ggerganov Aug 11, 2024

Choose a reason for hiding this comment

0cc4m Aug 11, 2024

Choose a reason for hiding this comment

0cc4m commented Aug 14, 2024

ggerganov left a comment

Choose a reason for hiding this comment