Simplify and improve CUDA graphs through use of indirect copy pointers #9017

agray3 · 2024-08-13T08:56:22Z

Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.

agray3 · 2024-08-13T08:57:48Z

@slaren could you possibly review this whenever you get the bandwidth? Note that as well as simplifying the CUDA graphs code, this change also gives ~1-2% performance uplift by avoiding CUDA Graph updates for each token.

Nexesenex · 2024-08-13T12:52:16Z

Is this PR compatible with #8366, or does it supersedes it?

agray3 · 2024-08-13T13:05:46Z

Is this PR compatible with #8366, or does it supersedes it?

Yes, it is compatible (it doesn't supersede since #8366 provides further benefits). If/when this change is merged then I will re-base #8366 on it (which will actually also simplify #8366).
__

slaren · 2024-08-13T14:35:06Z

The idea of keeping a list of pointers in device memory to avoid the update to the graphs is interesting, but the way this is implemented is shifting some of the complexity from the CUDA backend to the application side. My view generally is that adding custom functions to the backends that require special handling from the application side should only be done as a last resort, and the priority should be to provide a simple and unified interface to the applications.

I think it would be possible to implement this entirely in the CUDA backend side by scanning the graph to obtain and update the list of pointers. I suppose it may be worth it if updating the nodes in the CUDA graph is significantly slower than copying a list of pointers to device memory, but if the difference is small, it may be hard to justify the added complexity to the CUDA backend code.

agray3 · 2024-08-13T16:14:57Z

Thanks @slaren. The current code involves repeated updates to the graph, and the proposed approach does give a significant performance advantage (even with the exrtra memcopies). E.g. On A100 for llama 7B Q4 I get (tokens/s):

        PR9017  Current 
Run1	157.02	154.35
Run2	157.03	154.62
Run3	156.78	154.45
Run4	156.45	153.98
Run5	156.78	154.26
		
Average	156.82	154.33
Speedup	1.016   1

This 1.6% speedup is not dramatic, but given the huge worldwide usage of llama.cpp I'd argue that it would accumulate to an enormous overall time, cost and energy saving. Plus it is a step in the right direction (IMO) of reducing the need to do a full rebuild of the GGML graph every step.

But I acknowledge that it does add e few lines of extra complexity to the llama.cpp file - I'll have a think about that can be better abstracted into GGML.

agray3 · 2024-08-14T10:05:15Z

I've now fully abstracted into the GGML CUDA backend, with just a single call from llama.cpp.

slaren · 2024-08-14T14:04:27Z

@agray3 I am sorry, I think there has been a misunderstanding. The problem is not the location of the few lines of code to build the list of pointers, the problem is skipping several layers of abstraction and going directly from llama.cpp to the CUDA backend code. Not only this code is going to be hard to maintain and will certainly require exceptions for some architectures, but ggml is a separate library from llama.cpp and it used in more applications, and the goal is to continue expanding the capabilities to use ggml in other projects. Simply put, it is not ok to add new functions to the CUDA backend interface to achieve this, and much less so to the ggml-backend interface. The only way I can see to implement this would be to build the list of pointers automatically and transparently by inspecting the graph within the CUDA backend.

agray3 · 2024-08-14T15:03:13Z

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

slaren · 2024-08-14T15:17:43Z

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

The problem is that we cannot add new functions to the backend interface every time it is more convenient to implement some optimization by doing so, because it will pollute the application code and the backend interface, and will quickly become unmaintainable. Even if this is a small change now, there are currently 7 backends supported in ggml, and all of them would like to add similar functions to simplify their implementation. We cannot go this route unless it is absolutely necessary, and I don't think that this case qualifies.

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes. The only difference is how the kernel receives the updated pointers, either as a kernel argument argument updated in the CUDA graph, or as a pointer obtained from device memory.

agray3 · 2024-08-14T15:44:42Z

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes.

Yes, you are right. I was getting mixed up between GGML and CUDA graphs. Currently we extract from the GGML graph and insert into the CUDA graph, but we could instead extract from the GGML graph and pass to the GPU via a memcpy. I'll experiment with that, thanks.

ggerganov#9017 Co-Authored-By: agray3 <[email protected]>

…y pointers ggerganov#9017" This reverts commit 1dea402e4cb8f64737aa49ba98bc9647656e4d26.

Nexesenex · 2024-10-17T19:34:19Z

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

agray3 · 2024-10-19T19:24:39Z

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

This will require a bit more rebasing to be compatible with my other patch - I'm away for a few days so will take a look when I'm back.

Nexesenex · 2024-10-19T20:01:45Z

@agray3: Thanks! Have a great time meanwhile!

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2024

Abstract into GGML

38f4863

agray3 marked this pull request as draft August 14, 2024 15:47

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024

Simplify and improve CUDA graphs through use of indirect copy pointers

ccd8e56

ggerganov#9017 Co-Authored-By: agray3 <[email protected]>

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024

Revert " Simplify and improve CUDA graphs through use of indirect cop…

56fc498

…y pointers ggerganov#9017" This reverts commit 1dea402e4cb8f64737aa49ba98bc9647656e4d26.

Nexesenex mentioned this pull request Oct 18, 2024

Adding @agray3's graph caching approach ikawrakow/ik_llama.cpp#94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

agray3 commented Aug 13, 2024

agray3 commented Aug 13, 2024

Nexesenex commented Aug 13, 2024

agray3 commented Aug 13, 2024

slaren commented Aug 13, 2024

agray3 commented Aug 13, 2024

agray3 commented Aug 14, 2024

slaren commented Aug 14, 2024

agray3 commented Aug 14, 2024

slaren commented Aug 14, 2024

agray3 commented Aug 14, 2024

Nexesenex commented Oct 17, 2024

agray3 commented Oct 19, 2024

Nexesenex commented Oct 19, 2024

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

Are you sure you want to change the base?

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

Conversation

agray3 commented Aug 13, 2024

agray3 commented Aug 13, 2024

Nexesenex commented Aug 13, 2024

agray3 commented Aug 13, 2024

slaren commented Aug 13, 2024

agray3 commented Aug 13, 2024

agray3 commented Aug 14, 2024

slaren commented Aug 14, 2024

agray3 commented Aug 14, 2024

slaren commented Aug 14, 2024

agray3 commented Aug 14, 2024

Nexesenex commented Oct 17, 2024

agray3 commented Oct 19, 2024

Nexesenex commented Oct 19, 2024