vulkan: get the first command buffer submitted sooner #10499

jeffbolznv · 2024-11-25T16:54:05Z

This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.

before:
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        109.93 ± 0.59 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     | 1000 |         tg128 |        100.70 ± 0.71 |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     | 1000 |         tg128 |         73.39 ± 0.61 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |         92.45 ± 1.01 |

after:
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        112.04 ± 0.39 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     | 1000 |         tg128 |        100.59 ± 0.17 |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     | 1000 |         tg128 |         73.90 ± 0.25 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |         95.13 ± 0.75 |

I did some timings of how long it takes to do the dryrun, and get to the first submit. These numbers are averaged over 32 evaluations of the model, and are all in microseconds. For Llama-3.2-3B-Instruct-Q8_0.gguf:

before:
dryRunTime 122 firstSubmitTime 238 beforeLastSubmitTime 1600 totalTime 10070
after:
dryRunTime 84 firstSubmitTime 130 beforeLastSubmitTime 1639 totalTime 9725

So dryRunTime and firstSubmitTime are the amount of time spent before we submit any work to the GPU (these are disjoint, i.e. firstSubmitTime does not include dryRunTime). beforeLastSubmitTime is roughly the total CPU time, and totalTime is roughly the GPU time. (Note that the GPU time seems to vary from execution to execution, I think maybe related to KV cache).

So before we had about 0.35ms of GPU idle time out of 10ms of total GPU time, and this reduces it to about 0.21ms, corresponding to around a 1% speedup. The numbers all tend to be a little noisy, the table above shows +3% for this model, but there's a clear improvement and it's generally aligned with the measurements I did of this idle bubble.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This is an incremental improvement over ggerganov#9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.

jeffbolznv · 2024-11-25T16:54:30Z

CC @mtavenrath

0cc4m

I reproduced the slight performance improvement. Looks good.

jeffbolznv requested a review from 0cc4m November 25, 2024 16:54

jeffbolznv added the Vulkan Issues specific to the Vulkan backend label Nov 28, 2024

0cc4m approved these changes Nov 29, 2024

View reviewed changes

0cc4m merged commit f095a64 into ggerganov:master Nov 29, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: get the first command buffer submitted sooner #10499

vulkan: get the first command buffer submitted sooner #10499

jeffbolznv commented Nov 25, 2024

jeffbolznv commented Nov 25, 2024

0cc4m left a comment

vulkan: get the first command buffer submitted sooner #10499

vulkan: get the first command buffer submitted sooner #10499

Conversation

jeffbolznv commented Nov 25, 2024

jeffbolznv commented Nov 25, 2024

0cc4m left a comment

Choose a reason for hiding this comment