llama : reduce useless copies when saving session #8916

compilade · 2024-08-07T20:11:05Z

Should help with #8915.

Some useless copies to immediately discarded temporary buffers were done as of #8699 because session size calculation and session file writing now share mostly the same code.

On CPU, the speed was reasonable, but on CUDA, as reported in #8915, this makes the session size calculation too slow.

To fix this, it's possible to simply avoid calling ggml_backend_tensor_get when the data won't be used (i.e. when calculating the session file size).

I've also eliminated the double tensor copies when saving the state to a buffer.

@josharian does this help with your use-case?

TODO

Test state saving and restoring to and from a buffer
Test session file saving and restoring
Test sequence saving and restoring

I have read the contributing guidelines
Self-reported review complexity:
- Low

josharian · 2024-08-07T22:20:21Z

This looks great. Thank you!

josharian · 2024-08-08T17:59:34Z

OK, re-ran my profiling so I have real numbers to share.

This speeds up llama_state_seq_get_size by >25x, to the point that it is now legitimately cheap enough to use. :) Yay! Thank you!

This cuts roughly 5-6% off llama_state_seq_get_data, which is good, but not yet enough to make it usable in practice. I think batching up the tensor transfers (when contiguous) is probably the next thing to try out on this front.

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

compilade added 2 commits August 7, 2024 15:42

llama : avoid useless copies in dummy session writer

dca7ad8

llama : avoid double tensor copy when saving session to buffer

9329953

compilade added performance Speed related topics bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Aug 7, 2024

slaren approved these changes Aug 7, 2024

View reviewed changes

ggerganov approved these changes Aug 8, 2024

View reviewed changes

compilade merged commit 345a686 into master Aug 9, 2024
53 of 54 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama : reduce useless copies when saving session (ggerganov#8916)

3fbbbd7

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama : reduce useless copies when saving session (ggerganov#8916)

6ac252e

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : reduce useless copies when saving session #8916

llama : reduce useless copies when saving session #8916

compilade commented Aug 7, 2024 •

edited

Loading

josharian commented Aug 7, 2024

josharian commented Aug 8, 2024

llama : reduce useless copies when saving session #8916

llama : reduce useless copies when saving session #8916

Conversation

compilade commented Aug 7, 2024 • edited Loading

TODO

josharian commented Aug 7, 2024

josharian commented Aug 8, 2024

compilade commented Aug 7, 2024 •

edited

Loading