llama : reduce useless copies when saving session #8916
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Should help with #8915.
Some useless copies to immediately discarded temporary buffers were done as of #8699 because session size calculation and session file writing now share mostly the same code.
On CPU, the speed was reasonable, but on CUDA, as reported in #8915, this makes the session size calculation too slow.
To fix this, it's possible to simply avoid calling
ggml_backend_tensor_get
when the data won't be used (i.e. when calculating the session file size).I've also eliminated the double tensor copies when saving the state to a buffer.
@josharian does this help with your use-case?
TODO